New Jina primitive data types
Some time ago I wrote this post post on how to extract data from a PDF in Jina, however, a lot of things have happened since then and we have our own Primitive Data Types in Jina now 🎉🎉🎉 So I thought to refactor those tests and write this little post on all the things you’ll need now.
I’ll be honest, every time something needs to be refactored I feel the end of the world is near, but don’t worry because this time it’s actually fast and painless and you’ll feel it looks better at the end.
So the first thing we need to know is which kind of data types we have in Jina now, and there is a great post that talks about that already, so here we will only talk about what we need to change in the PDFExtractor tests.
So let’s take a look at our code, first thing we had was to import the necessary files to work with Protobuf:
from jina.drivers.helper import array2pb from jina.proto import jina_pb2
But using Protobuf directly? pfff! that’s so pre-pandemic, ain’t nobody got time for that, so we change it to:
from jina import Document
And that is of course much cleaner! And to use it, instead of having:
d = jina_pb2.Document()
we simply use it as a normal object:
d = Document()
And just as in the original example we have two ways to access our file, either we receive the path of the PDF or the bytes of it directly, and to access it we just need to check which type we are receiving:
def search_generator(path: str, buffer: bytes): d = Document() d.update_id() if buffer: d.buffer = buffer if path: d.content = path yield d
What we did here was simply creating our Document and updating its ID, after that we have it ready to use either directly with the bytes or the URI.
If you needed the data in a different format, you could also just use Jina’s convert_ methods, for example:
def search_generator(path: str, buffer: bytes): d = Document() d.update_id() if path: #I could convert this URI to buffer if needed d.convert_uri_to_buffer() yield d
This wasn’t necessary for this example but could be useful for some other case, and as I said, we also have this very detailed post with all the necessary information you need to start using our new data types 🦄🦄🦄
And that’s it! First time in my life that refactoring feels actually nice, so I hope this works for you too.
Don’t forget to check our other examples to see what other new things we have now.