News
Models
Products
keyboard_arrow_down
DeepSearch
Search, read and reason until best answer found.
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
More
keyboard_arrow_down
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms & Conditions


Log in
login
So, how amazing was your first version?
Why does it suck?
What are other people trying to build? And how do we adapt our PDF search to fit?
What can we use instead of CLIP?
What exactly will we build?
Next time
Tech blog
November 01, 2022

Search PDFs with AI and Python: Part 3

In part 3, let me summarize all lessons learned: pitfalls and perils of building a PDF search engine
Graphic for a blog series on "SEARCH PDFS WITH AI & PYTHON," featuring lettered cubes and part indicators
Alex C-G
Alex C-G • 6 minutes read
💡
This article is light on the code, and heavy on the lessons learned. You’ve been warned!

Before reading this, be sure to check the prior posts in the series:

Search PDFs with AI and Python: Part 1
I know several folks already building PDF search engines powered by AI, so I figured I’d give it a stab too. How hard could it possibly be? Part I
Jina AIJina AI
Search PDFs with AI and Python: Part 2
In our previous post we looked at how (and how not) to break down PDF files into usable chunks so we could build a search engine for them. In this article, we continue our journey.
Jina AIJina AI

I first started building my PDF search engine so the Jina AI community could have an example to fork and adapt to their needs. PDF search is a popular topic after all.

That said, there are so many different kinds of PDF content, and so many different use cases that it’s not a case of one size fits all.

My initial version is here:

GitHub - alexcg1/example-pdf-search: Search PDFs using Jina, DocArray and Jina Hub
Search PDFs using Jina, DocArray and Jina Hub. Contribute to alexcg1/example-pdf-search development by creating an account on GitHub.
GitHubalexcg1

It aimed to:

  • Be general purpose, and work well with any kind of PDF data (emphasis on work well — Just because it returns results doesn’t mean it’s good — it needs to return quality results).
  • Search cross-modally, so you could use image/text as both input/output.

In short, the kitchen sink approach.

tagSo, how amazing was your first version?

The result was…sub-optimal. I used a few PDFs I downloaded from Wikipedia as an example dataset and went from there:

  • Wikipedia page on chocolate
  • Wikipedia page on rabbits
  • Wikipedia page on tigers
  • Wikipedia page on Linux
  • Wikipedia page on cake

But I forgot the cardinal rule of data science: kicking your data into shape for your use case is like 90% of the work. And since every potential use case has wildly different data, it doesn’t make much sense to spend hours kicking Wikipedia PDFs into shape. Especially since I got bad results when it came to the search.

For example, when searching “rabbit ears”:

  • I would get short text snippets that were bare URLs (which I assume were picked up from endnotes in the article).
  • Or I would get strings just a few words long (which I assume were titles).
  • Most of these were just about rabbits, with no mention of ears, despite sentences about rabbit’s ears being indexed.
  • The matches that were the most relevant got a less-good score than some less-relevant matches (in cosine scores, lower score means higher relevance).
  • No matter what I typed, I never got any images returned (despite typing descriptions of said images and them definitely being indexed).
The results? Not great

On the plus side, it did bring up stuff about rabbits instead of chocolate, but that’s all that can be said for it.

tagWhy does it suck?

  • Perhaps the encoder or model itself isn’t great (I’m not convinced of this — CLIP isn’t ideal for text, but it’s serviceable at least, as can be seen in our fashion search)
  • Perhaps the dataset itself is just too small (I don’t believe this. Sure, it’s just a few PDFs but it’s broken down into several thousand chunks, many of which are full sentences or images)
  • Perhaps the index is so full of junk (endnotes, text fragments, references to book pages) rather than “real content” (sentences about rabbit ears, etc) and the encoder just throws up its hands in despair because it can’t make any sense of most of it. (I have a hunch this may be the main issue)
  • Some other reason? I was trying to do so much with the initial version that it could be down to the phase of the moon for all I know. Your ideas are welcome!

But we’re not retreating. Never give up, never surrender and all that. We’re just going to re-think the use case.

We’re not retreating. We’re strategically reconsidering our approach with all due alacrity

tagWhat are other people trying to build? And how do we adapt our PDF search to fit?

I’ve been in discussions with the community about PDF search engines and their use cases. So far most people want to focus on just text. That’s a good thing!

You see, previously we were trying to search both text and images, so we needed an encoder that could embed both to a common vector space. Namely CLIPEncoder:

CLIPEncoder | latest | Jina AI Cloud
Encoder that embeds documents using either the CLIP vision encoder or the CLIP text encoder, dependi...
CLIPEncoder | latest | Jina AI Cloud

In my experience CLIP is pretty dang good at images, but my experience with text (i.e. the PDF search engine I’ve been building) has been not so great.

tagWhat can we use instead of CLIP?

If we’re only dealing with text we can use other encoders like:

SpacyTextEncoder | latest | Jina AI Cloud
This is a text encoder using models from Spacy...
SpacyTextEncoder | latest | Jina AI Cloud
SpacyTextEncoder: supports several languages, is fast and good for general purpose text.
TransformerTorchEncoder | latest | Jina AI Cloud
`TransformerTorchEncoder` encodes text using transformer models...
TransformerTorchEncoder | latest | Jina AI Cloud
TransformerTorchEncoder: supports a whole bunch of languages and special use cases (e.g. medical text search). Can pull any model from huggingface.

I’ve used both of these before, and I’m particularly fond of SpaCy. When I was indexing meme captions it took about 4 minutes to blast through 200,000 memes on my laptop. I like that kind of speed.

tagWhat exactly will we build?

💡
More brass tacks, less kitchen sink

I think a ground-up approach is going to be better than a “let’s do everything” approach here. We’ll be able to see what works more clearly since there are fewer moving parts.

So we’ll build a search engine for simple text PDFs that don’t require too much pre-processing like:

  • Removing page numbers.
  • Removing endnotes, footnotes.
  • Dealing with lots of references and unexpected punctuation (like Fly-fishing for Dummies, 1988 Penguin Press, A.Albrechtson et al, pp.3–8. http://penguin.com/flyfishing).
  • Merging text chunks between page breaks.

In short, stripping out everything that could trip up the encoder. Once that’s up and running, we can start thinking about:

  • More complex PDFs (ramping up gradually).
  • Multilingual search (models already exist for this).
  • Searching text and images.

Ideas for simple PDFs are more than welcome! Drop a reply to this tweet if you have anything in mind, or ping me on our community Slack!

tagNext time

Stay tuned for part 4 in this seemingly infinite saga of blog posts, where we’ll refactor our code and choose a simpler dataset and go from there. Fingers crossed!

Categories:
Tech blog
rss_feed

Read more
May 07, 2025 • 9 minutes read
Model Soup’s Recipe for Embeddings
Bo Wang
Scott Martens
Still life drawing of a purple bowl filled with apples and oranges on a white table. The scene features rich colors against a
April 16, 2025 • 10 minutes read
On the Size Bias of Text Embeddings and Its Impact in Search
Scott Martens
Black background with a simple white ruler marked in centimeters, emphasizing a minimalist design.
April 01, 2025 • 17 minutes read
Using DeepSeek R1 Reasoning Model in DeepSearch
Andrei Ungureanu
Alex C-G
Brown background with a stylized whale graphic and the text "THINK:" and ":SEARCH>" in code-like font.
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
location_on
Beijing, China
Level 5, Building 6, No.48 Haidian West St. Beijing, China
location_on
Shenzhen, China
402 Floor 4, Fu'an Technology Building, Shenzhen, China
Search Foundation
DeepSearch
Reader
Embeddings
Reranker
Classifier
Segmenter
API Documentation
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
Newsroom
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI © 2020-2025.