Jina.ai logo
How To: Extract data from a PDF in Jina-image

How To: Extract data from a PDF in Jina

Susana Guzmán
Susana Guzmán

Hello everyone!

In the past weeks, I was working on creating a PDFExtractor so today I’ll show you what I learned and how to extract data from your PDF in Jina. Also on the way, I learned what a Segmenter does and how it is different from a C_rafter_, so maybe that helps you too.

I will split this post into two sections; one will be how you can use it, and the other will be how does that work? So if for now, you are not interested in how exactly this works, or if you know already, and just want to see the flow-example, go straight into the second part, otherwise, let’s get into the code.

1) What kind of sorcery is this? (aka how does that work?)

Ok, the file we’ll be talking about today is the PDFExtractorSegmenter, as you can see, it doesn’t even have anything in the __init__function, everything is happening in the craft function_._

def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def craft(self, uri: str, buffer: bytes, *args, **kwargs) -> List[Dict]:
        import fitz
        import PyPDF2

        # Open file
        if uri:
        elif buffer:
            raise ValueError('No value found in "buffer" or "uri"')

        chunks = []
        # Extract images
        with pdf_img:

        # Extract text
        with pdf_text:
        return chunks

So the first thing I noticed here*, was that the Segmenter is returning a List[Dict] instead of Dict as Crafters do, why is that? So one of the core differences between a Crafter and a Segmenter is that the Crafter transforms 1-to-1 documents, while the Segmenter transforms 1-to-n. And this was important to notice because when I said *the first thing I noticed here, what I actually meant was “after I finished everything, I realised it wasn’t working and I had no idea why, and it turns out, I couldn’t use a Crafter because a PDF can have images and text, which are multiple documents, and therefore I needed to use a Segmenter” but as you can see, that was too long to write.

So once we know this, we see why we need to use a Segmenter for this. Ok, got it, now what? The next part is to load the PDF document, in Jina we can use the URI of the document, as “cats_are_awesome.pdf”, or directly the buffer data from the file. So then the first part is to check what kind of input are we using:

# Open file
        if uri:
            pdf_img = fitz.open(uri)
            pdf_text = open(uri, 'rb')
        elif buffer:
            pdf_text = io.BytesIO(buffer)
            pdf_img = fitz.open(stream=buffer, filetype="pdf")
            raise ValueError('No value found in "buffer" or "uri"')

At the time of writing this, I didn’t find any good way to extract data from a PDF for both images and text together, so I decided to use two libraries that work very well for text or for images independently (but if you know of any that can handle both, you are super welcome to let us know or make a PR on our repo, we are all for Open Source 🎉 ), and that’s why you see that we read the files twice: Once to extract the images, and once to extract the text.

The next part is pretty straightforward: We simply use those libraries and append the result to the list of chunks we have (aka documents). When we append those chunks we also set if it’s a text or a blob (the images in this case) and the mime_type.

if text:
    dict(text=text, weight=1.0, mime_type="text/plain"))

2) Sure sure, but how can I use it?????

Good, now that we know how that works we can go on our way to extract all beautiful c̶a̶t̶s̶ data from our PDFs.

In the tests folder, we can see two tests, one of them is to also show how can the PDFExtractorSegmenter works with your flow.

So let’s say you have your “cats_are_awesome.pdf”, which has images (of cats, shocking) and text (a cat poem nonetheless), and you want everything from it:

def test_pdf_flow_mix():
    path = os.path.join(cur_dir, 'cats_are_awesome.pdf')
    f = Flow().add(uses='PDFExtractorSegmenter', array_in_pb=True)
    with f:
        f.search(input_fn=search_generator(path=path, buffer=None), output_fn=validate_mix_fn)

So the first thing will be to create your Flow:

`f = Flow().add(uses='PDFExtractorSegmenter', array\_in\_pb=True)`

There we say that we will use the PDFExtractorSegmenter. For the moment no need to go into the array-in-pb details, but set it to True if you are using images and text combined.

Then we can start our search, and because I’m using .searchinstead of .search_lines or whatever, I need to create the protobuf document manually and I do that in the search_generator:

def search_generator(path: str, buffer: bytes):
    d = jina_pb2.Document()
    if buffer:
        d.buffer = buffer
    if path:
        d.uri = path
    yield d

Only thing to do there is specify the buffer or the URI to be used. After this, I will validate the results:

def validate_mix_fn(resp):
    for d in resp.search.docs:
        for chunk in range(len(d.chunks) - 1):
            img = Image.open(os.path.join(cur_dir, f'test_img_{chunk}.jpg'))
            blob = d.chunks[chunk].blob
            assert blob.shape[1], blob.shape[0] == img.size
        assert expected_text == d.chunks[2].text

Here is where you could get your data from the PDF and do all kinds of amazing stuff with it, I’m just checking that I’m getting the data I was expecting. The first assert validates the images and the second assert validates the text.

And that’s it! You saw what this PDFExtractorSegmenter is doing and how to use it in your Flow 💃🏼💃🏼

Feel free to run this with your own data and contact us if you have any doubts/comments/pics of your cat to show.

You can follow us on Twitter or Github, or join our Slack community.

By Susana Guzmán on September 24, 2020.

© 2021 Jina AI GmbH. All rights reserved.Terms of Service|Privacy Policy