Search PDFs with AI and Python: Part 3

💡

This article is light on the code, and heavy on the lessons learned. You’ve been warned!

Before reading this, be sure to check the prior posts in the series:

I first started building my PDF search engine so the Jina AI community could have an example to fork and adapt to their needs. PDF search is a popular topic after all.

That said, there are so many different kinds of PDF content, and so many different use cases that it’s not a case of one size fits all.

My initial version is here:

It aimed to:

Be general purpose, and work well with any kind of PDF data (emphasis on work well — Just because it returns results doesn’t mean it’s good — it needs to return quality results).
Search cross-modally, so you could use image/text as both input/output.

In short, the kitchen sink approach.

tagSo, how amazing was your first version?

The result was…sub-optimal. I used a few PDFs I downloaded from Wikipedia as an example dataset and went from there:

Wikipedia page on chocolate
Wikipedia page on rabbits
Wikipedia page on tigers
Wikipedia page on Linux
Wikipedia page on cake

But I forgot the cardinal rule of data science: kicking your data into shape for your use case is like 90% of the work. And since every potential use case has wildly different data, it doesn’t make much sense to spend hours kicking Wikipedia PDFs into shape. Especially since I got bad results when it came to the search.

For example, when searching “rabbit ears”:

I would get short text snippets that were bare URLs (which I assume were picked up from endnotes in the article).
Or I would get strings just a few words long (which I assume were titles).
Most of these were just about rabbits, with no mention of ears, despite sentences about rabbit’s ears being indexed.
The matches that were the most relevant got a less-good score than some less-relevant matches (in cosine scores, lower score means higher relevance).
No matter what I typed, I never got any images returned (despite typing descriptions of said images and them definitely being indexed).

On the plus side, it did bring up stuff about rabbits instead of chocolate, but that’s all that can be said for it.

tagWhy does it suck?

Perhaps the encoder or model itself isn’t great (I’m not convinced of this — CLIP isn’t ideal for text, but it’s serviceable at least, as can be seen in our fashion search)
Perhaps the dataset itself is just too small (I don’t believe this. Sure, it’s just a few PDFs but it’s broken down into several thousand chunks, many of which are full sentences or images)
Perhaps the index is so full of junk (endnotes, text fragments, references to book pages) rather than “real content” (sentences about rabbit ears, etc) and the encoder just throws up its hands in despair because it can’t make any sense of most of it. (I have a hunch this may be the main issue)
Some other reason? I was trying to do so much with the initial version that it could be down to the phase of the moon for all I know. Your ideas are welcome!

But we’re not retreating. Never give up, never surrender and all that. We’re just going to re-think the use case.

We’re not retreating. We’re strategically reconsidering our approach with all due alacrity

tagWhat are other people trying to build? And how do we adapt our PDF search to fit?

I’ve been in discussions with the community about PDF search engines and their use cases. So far most people want to focus on just text. That’s a good thing!

You see, previously we were trying to search both text and images, so we needed an encoder that could embed both to a common vector space. Namely CLIPEncoder:

In my experience CLIP is pretty dang good at images, but my experience with text (i.e. the PDF search engine I’ve been building) has been not so great.

tagWhat can we use instead of CLIP?

If we’re only dealing with text we can use other encoders like:

**SpacyTextEncoder**: supports several languages, is fast and good for general purpose text.

**TransformerTorchEncoder**: supports a whole bunch of languages and special use cases (e.g. medical text search). Can pull any model from huggingface.

I’ve used both of these before, and I’m particularly fond of SpaCy. When I was indexing meme captions it took about 4 minutes to blast through 200,000 memes on my laptop. I like that kind of speed.

tagWhat exactly will we build?

💡

More brass tacks, less kitchen sink

I think a ground-up approach is going to be better than a “let’s do everything” approach here. We’ll be able to see what works more clearly since there are fewer moving parts.

So we’ll build a search engine for simple text PDFs that don’t require too much pre-processing like:

Removing page numbers.
Removing endnotes, footnotes.
Dealing with lots of references and unexpected punctuation (like Fly-fishing for Dummies, 1988 Penguin Press, A.Albrechtson et al, pp.3–8. http://penguin.com/flyfishing).
Merging text chunks between page breaks.

In short, stripping out everything that could trip up the encoder. Once that’s up and running, we can start thinking about:

More complex PDFs (ramping up gradually).
Multilingual search (models already exist for this).
Searching text and images.

Ideas for simple PDFs are more than welcome! Drop a reply to this tweet if you have anything in mind, or ping me on our community Slack!

tagNext time

Stay tuned for part 4 in this seemingly infinite saga of blog posts, where we’ll refactor our code and choose a simpler dataset and go from there. Fingers crossed!