Building a Search System with Jina and Faiss
So, today let’s create a Search System using Jina and Facebook AI Similarity Search, or FAISS for us pals. I’m using the example that is already in Jina documentation and going a bit deeper into some of the parts that I had trouble getting right.
Prepare the data
Ok, so now you have the code example and the docker image. You might think this is enough data preparation, but not really cause machine learning REALLY loves preparing data. So now let’s get the dataset we’ll use.
For this example we will use the ANN_SIFT10k as our dataset, this one is already divided into 3 parts so we don’t need to worry about doing that ourselves.
- Vectors to index
- Vectors to train
- Vectors to query
We will create a temporary folder and store the dataset there. You can do all that with the command
Then we will create a Workspace directory, where we will use the dataset we previously fetched, convert it into binary mode and use it as a training data
The cool part after the boring part
So we have a lot of vectors now…yay? now what? The first thing we’ll do is to index the data
python app.py -t index -n $batch_size
You can use whatever you want as batch size, I used 30 as a reminder that 30s are the new 20s.
So because we passed the flag “index”, the first thing
app.py will do is call
Jina uses Flows as a cool way to abstract high-level tasks, as indexing in this case, and querying later on the example. There’s a really nice documentation on how this works here. But even if you haven’t seen it and you’ll “check it out later”, you can still get the gist of the example. Just see it as a high-level task for the moment. So as an index Flow at the moment.
flow-index.yml we use 3 Pods: crafter, encoder and indexer
Don’t worry about the crafter and the encoder now**,** let’s check what the indexer does.
indexer.yml will take care of setting the name of the index file (
faiss_index.tgz), and the filepath (
./workspace) and prepare it as a NumpyIndexer.
Okaaaaaay, that was a lot of stuff, but now we have our index ready.
So now that we have our index ready, is time to search for something. Since we used the ANN_SIFT10k that is already divided, we can use the query subset that we have ready so no need to do anything else besides running the command:
python app.py -t query
Uhm yeah ok? and what is happening here? what are those results? what is that recall? what’s the meaning of life? you may ask. I’m so glad you asked, thank you.
So what is happening here is that just as last time we used the
flow-index.yml, this time we will use the Flow
flow-query.yml, and here is finally where we use the Faiss library.
This time we have 4 Pods; the crafter, the encoder, the indexer, and the ranker.
Let’s keep ignoring the crafter and indexer, for now, ¯\_(ツ)_/¯, and let’s see what the indexer is doing since it’s the interesting part.
Here we see that we are using the docker image we got from the GitHub link, as well as the
query-indexer.yml. If we check the query-indexer, we can see it has a FaissIndexer, WHAT A COINCIDENCE, this is exactly what we’re talking about.
Now, the FaissIndexer has 3 parameters:
index_key: This determines the type of vector index we use. In this case, we are using IVF10 that will create an index with Inverted Index and with 10 clusters. You can change this and use more complex inverted indices that would lead to optimized results. We used IVF10 in this example and you can see its details in the faiss index factory and tweak it if you want to.
train_filepath: This is the path where we have our training data
ref_indexer: This is showing it uses a
numpyIndexer, since it is the format we used while creating the index in the previous step. And finally, we set the name of the file where to store the index.
Show me the results!
Wow, that was a lot of stuff doing stuff. Let’s see the results again, and see what they mean
So if you scroll on your terminal, you’ll see many query ids, 100 to be exact, and those are the 100 vectors we had on our query subset, so we are going through all of them, and for each one, you’ll see
k is 5, you can change that in
app.py if you want, I set it as 50 for this example.
So in this example, you’ll see 100 queries, with 50 results per query. It will show you the DocId and its Ranking Score.
Test our results
At the bottom you’ll also see the
[email protected], this helps us to understand how well our prediction was by comparing it to the ground truth we already have.
What is happening here is that between the k-top results, so between the 50-top results in my case, it will check how many of those are actually the true nearest neighbor. Or in other words, it checks how many times the true nearest neighbor we have from the ground truth, is returned in my 50-top result, or the k-top result you specified.
And that’s it! that was a lot of doing stuff and learning stuff but we managed to build our vector search engine!
So that’s it for today. I hope you enjoyed this and check all the other cool examples that the Jina GitHub has.