GDPR Compliance Message The European General Data Protection Regulation (GDPR) is the EU regulation for data privacy and security. At Jina AI, we are fully committed to complying with the GDPR. Our policy towards privacy, security, and data protection aligns with the goals of GDPR. On our website, we only collect and store information that is essential for offering our service and for improving user experience, and we do this with the consent of our website visitors. If you have any further questions, please do not hesitate to reach out to us at:
[email protected].


Building a Search System with Jina and Faiss

Published

So, today let’s create a Search System using Jina and Facebook AI Similarity Search, or FAISS for us pals. I’m using the example that is already in Jina documentation and going a bit deeper into some of the parts that I had trouble getting right.

But first of all, let’s clone the code from there and make sure to get the docker image too.

Prepare the data

Ok, so now you have the code example and the docker image. You might think this is enough data preparation, but not really cause machine learning REALLY loves preparing data. So now let’s get the dataset we’ll use.

For this example we will use the ANN_SIFT10k as our dataset, this one is already divided into 3 parts so we don’t need to worry about doing that ourselves.

  1. Vectors to index
  2. Vectors to train
  3. Vectors to query

We will create a temporary folder and store the dataset there. You can do all that with the command

./get_siftsmall.sh

Then we will create a Workspace directory, where we will use the dataset we previously fetched, convert it into binary mode and use it as a training data

./generate_training_data.sh

The cool part after the boring part

So we have a lot of vectors now…yay? now what? The first thing we’ll do is to index the data

python app.py -t index -n $batch_size

You can use whatever you want as batch size, I used 30 as a reminder that 30s are the new 20s.

So because we passed the flag “index”, the first thing app.py will do is call flow-index.yml Flow.

Jina uses Flows as a cool way to abstract high-level tasks, as indexing in this case, and querying later on the example. There’s a really nice documentation on how this works here. But even if you haven’t seen it and you’ll “check it out later”, you can still get the gist of the example. Just see it as a high-level task for the moment. So as an index Flow at the moment.

In flow-index.yml we use 3 Pods: crafter, encoder and indexer

Don’t worry about the crafter and the encoder now, let’s check what the indexer does.

The indexer.yml will take care of setting the name of the index file (faiss_index.tgz), and the filepath (./workspace) and prepare it as a NumpyIndexer.

Okaaaaaay, that was a lot of stuff, but now we have our index ready.

Query time!

So now that we have our index ready, is time to search for something. Since we used the ANN_SIFT10k that is already divided, we can use the query subset that we have ready so no need to do anything else besides running the command:

python app.py -t query

Uhm yeah ok? and what is happening here? what are those results? what is that recall? what’s the meaning of life? you may ask. I’m so glad you asked, thank you.

So what is happening here is that just as last time we used the flow-index.yml, this time we will use the Flow flow-query.yml, and here is finally where we use the Faiss library.

This time we have 4 Pods; the crafter, the encoder, the indexer, and the ranker.

Let’s keep ignoring the crafter and indexer, for now, ¯\_(ツ)_/¯, and let’s see what the indexer is doing since it’s the interesting part.

Here we see that we are using the docker image we got from the GitHub link, as well as the query-indexer.yml. If we check the query-indexer, we can see it has a FaissIndexer, WHAT A COINCIDENCE, this is exactly what we’re talking about.

Now, the FaissIndexer has 3 parameters:

  • index_key: This determines the type of vector index we use. In this case, we are using IVF10 that will create an index with Inverted Index and with 10 clusters. You can change this and use more complex inverted indices that would lead to optimized results. We used IVF10 in this example and you can see its details in the faiss index factory and tweak it if you want to.
  • train_filepath: This is the path where we have our training data
  • ref_indexer: This is showing it uses a numpyIndexer, since it is the format we used while creating the index in the previous step. And finally, we set the name of the file where to store the index.

Show me the results!

Wow, that was a lot of stuff doing stuff. Let’s see the results again, and see what they mean

So if you scroll on your terminal, you’ll see many query ids, 100 to be exact, and those are the 100 vectors we had on our query subset, so we are going through all of them, and for each one, you’ll see k-results.

The default k is 5, you can change that in app.py if you want, I set it as 50 for this example.

So in this example, you’ll see 100 queries, with 50 results per query. It will show you the DocId and its Ranking Score.

Test our results

At the bottom you’ll also see the [email protected], this helps us to understand how well our prediction was by comparing it to the ground truth we already have.

What is happening here is that between the k-top results, so between the 50-top results in my case, it will check how many of those are actually the true nearest neighbor. Or in other words, it checks how many times the true nearest neighbor we have from the ground truth, is returned in my 50-top result, or the k-top result you specified.

And that’s it! that was a lot of doing stuff and learning stuff but we managed to build our vector search engine!

So that’s it for today. I hope you enjoyed this and check all the other cool examples that the Jina GitHub has.