GDPR Compliance Message The European General Data Protection Regulation (GDPR) is the EU regulation for data privacy and security. At Jina AI, we are fully committed to complying with the GDPR. Our policy towards privacy, security, and data protection aligns with the goals of GDPR. On our website, we only collect and store information that is essential for offering our service and for improving user experience, and we do this with the consent of our website visitors. If you have any further questions, please do not hesitate to reach out to us at:
[email protected].


New Jina primitive data types

Published

Some time ago I wrote this post post on how to extract data from a PDF in Jina, however, a lot of things have happened since then and we have our own Primitive Data Types in Jina now 🎉🎉🎉 So I thought to refactor those tests and write this little post on all the things you’ll need now.

I’ll be honest, every time something needs to be refactored I feel the end of the world is near, but don’t worry because this time it’s actually fast and painless and you’ll feel it looks better at the end.

So the first thing we need to know is which kind of data types we have in Jina now, and there is a great post that talks about that already, so here we will only talk about what we need to change in the PDFExtractor tests.

So let’s take a look at our code, first thing we had was to import the necessary files to work with Protobuf:

from jina.drivers.helper import array2pb	
from jina.proto import jina_pb2

But using Protobuf directly? pfff! that’s so pre-pandemic, ain’t nobody got time for that, so we change it to:

from jina import Document

And that is of course much cleaner! And to use it, instead of having:

d = jina_pb2.Document()

we simply use it as a normal object:

d = Document()

And just as in the original example we have two ways to access our file, either we receive the path of the PDF or the bytes of it directly, and to access it we just need to check which type we are receiving:

def search_generator(path: str, buffer: bytes):
    d = Document()
    d.update_id()
    if buffer:
        d.buffer = buffer
    if path:
        d.content = path
    yield d

What we did here was simply creating our Document and updating its ID, after that we have it ready to use either directly with the bytes or the URI.

If you needed the data in a different format, you could also just use Jina’s convert_ methods, for example:

def search_generator(path: str, buffer: bytes):
    d = Document()
    d.update_id()
    if path:
        #I could convert this URI to buffer if needed
        d.convert_uri_to_buffer()
    yield d

This wasn’t necessary for this example but could be useful for some other case, and as I said, we also have this very detailed post with all the necessary information you need to start using our new data types 🦄🦄🦄

And that’s it! First time in my life that refactoring feels actually nice, so I hope this works for you too.

Don’t forget to check our other examples to see what other new things we have now.