GDPR Compliance Message
The European General Data Protection Regulation (GDPR) is the EU regulation for data privacy and
security.
At Jina AI, we are fully committed to complying with the GDPR. Our policy towards privacy,
security,
and data protection aligns with the goals of GDPR. On our website, we only collect and store information
that is essential for offering our service and for improving user experience, and we do this with the
consent of our website visitors.
If you have any further questions, please do not hesitate to reach out to us at:
[email protected].
Adapted from Wikimedia Commons
If you read my previous article on Towards Data Science you’ll know I’m a bit of a Star Trek nerd. There’s only one thing I like more than Star Trek, and that’s building cool new stuff with AI. So I thought I’d combine the two yet again!
In this tutorial we’re going to build our own search engine to search all the lines from Star Trek: The Next Generation. We’ll be using Jina, a neural search framework which uses deep learning to power our NLP search, though we could easily use it for image, audio or video search if we wanted to.
We’ll cover:
If you’re new to AI or search, don’t worry. As long as you have some knowledge of Python and the command line you’ll be fine. If it helps, think of yourself as Lieutenant Commander Data Science.
Via Giphy
Before going through the trouble of downloading, configuring and testing your search engine, let’s get an idea of the finished product. In this case, it’s exactly the same as what we’re building, but with lines from South Park instead of Star Trek:
Via Jinabox
Jina has a pre-built Docker image with indexed data from South Park. You can run it with:
docker run -p 45678:45678 jinaai/hub.app.distilbert-southpark
After getting Docker up and running, you can start searching for those South Park lines.
Jinabox is a simple web-based front-end for neural search. You can see it in the graphic at the top of this tutorial.
[http://localhost:45678/api/search](http://localhost:45678/api/search)
Note: If it times out the first time, that’s because the query system is still warming up. Try again in a few seconds!
curl
Alternatively, you can open your shell and check the results via the RESTful API. The matched results are stored in topkResults
.
curl –request POST -d ‘{“top_k”: 10, “mode”: “search”, “data”: [“text:hey, dude”]}’ -H ‘Content-Type: application/json’ ‘http://0.0.0.0:45678/api/search’
You’ll see the results output in JSON format. Each result looks like:
Now go back to your terminal running Docker and hit Ctrl-C
(or Command-C
on Mac) a few times to ensure you’ve stopped everything.
Now that you know what we’re building, let’s get started!
You will need:
pip
Let’s get the basic files we need to get moving:
git clone [email protected]:alexcg1/my-first-jina-app.git
cd my-first-jina-app
Via Giphy
pip install -U cookiecutter
cookiecutter gh:jina-ai/cookiecutter-jina
We use cookiecutter to spin up a basic Jina app and save you having to do a lot of typing and setup.
For our Star Trek example, use the following settings:
project_name
: Star Trek
project_slug
: star_trek
(default value)task_type
: nlp
index_type
: strings
public_port
: 65481
(default value)Just use the defaults for all other fields. After cookiecutter has finished, let’s have a look at the files it created:
cd star_trek
ls
You should see a bunch of files:
app.py
- The main Python script where you initialize and pass data into your FlowDockerfile
- Lets you spin up a Docker instance running your appflows/
- Folder to hold your Flowspods/
- Folder to hold your PodsREADME.md
- An auto-generated README filerequirements.txt
- A list of required Python packagesIn the flows/
folder we can see index.yml
and query.yml
- these define the indexing and querying Flows for your app.
In pods/
we see chunk.yml
, craft.yml
, doc.yml
, and encode.yml
- these Pods are called from the Flows to process data for indexing or querying.
More on Flows and Pods later!
In your terminal run this command to download and install all the required Python packages:
pip install -r requirements.txt
Our goal is to find out who said what in Star Trek episodes when a user queries a phrase. The Star Trek dataset from Kaggle contains all the scripts and individual character lines from Star Trek: The Original Series all the way through Star Trek: Enterprise.
We’re just using a subset in this example, containing the characters and lines from Star Trek: The Next Generation. This has also been converted from JSON to CSV format, which is more suitable for Jina to process.
Now let’s ensure we’re back in our base folder and download the dataset by running:
Once that’s finished downloading, let’s get back into the star_trek
directory and make sure our dataset has everything we want:
cd star_trek
head data/startrek_tng.csv
You should see output consisting of characters (like MCCOY
), a separator, (!
), and the lines spoken by the character ( What about my age?
):
BAILIFF!The prisoners will all stand.BAILIFF!All present, stand and make respectful attention to honouredJudge.BAILIFF!Before this gracious court now appear these prisoners to answer for the multiple and grievous savageries of their species. How plead you, criminal? BAILIFF!Criminals keep silence!BAILIFF!You will answer the charges, criminals. BAILIFF!Criminal, you will read the charges to the court.BAILIFF!All present, respectfully stand. QBAILIFF!This honourable court is adjourned. Stand respectfully. Q MCCOY!Hold it right there, boy.MCCOY!What about my age?
Note: Your character lines may be a little different. That’s okay!
Now we we need to pass startrek_tng.csv
into app.py
so we can index it. app.py
is a little too simple out of the box, so let’s make some changes:
Open app.py
in your editor and check the index
function, we currently have:
As you can see, this indexes just 3 strings. Let’s load up our Star Trek file instead with the filepath
parameter. Just replace the last line of the function:
While we’re here, let’s reduce the number of documents we’re indexing, just to speed things up while we’re testing. We don’t want to spend ages indexing only to have issues later on!
In the section above the config
function, let’s change:
to:
That should speed up our testing by a factor of 100! Once we’ve verified everything works we can set it back to 50000
to index more of our dataset.
Now that we’ve got the code to load our data, we’re going to dive into writing our app and running our Flows! Flows are the different tasks our app performs, like indexing or searching the data.
First up we need to build up an index of our file. We’ll search through this index when we use the query Flow later.
python app.py index
Your app will show a lot of output in the terminal, but you’ll know it’s finished when you see the line:
[email protected][S]:flow is closed and all resources should be released already, current build level is 0
This may take a little while the first time, since Jina needs to download the language model and tokenizer to process the data. You can think of these as the brains behind the neural network that powers the search.
To start search mode run:
python app.py search
After a while you should see the terminal stop scrolling and display output like:
[email protected][S]:flow is started at 0.0.0.0:65481, you can now use client to send request!
⚠️ Be sure to note down the port number. We’ll need it for curl
and jinabox! In our case we’ll assume it’s 65481
, and we use that in the below examples. If your port number is different, be sure to use that instead.
ℹ️ python app.py search
doesn’t pop up a search interface - for that you’ll need to connect via curl
, Jinabox, or another client.
Via Jinabox
[http://localhost:65481/api/search](http://localhost:65481/api/search)
curl –request POST -d ‘{“top_k”: 10, “mode”: “search”, “data”: [“text:picard to riker”]}’ -H ‘Content-Type: application/json’ ‘http://0.0.0.0:65481/api/search’
curl
will spit out a lot of information in JSON format - not just the lines you’re searching for, but all sorts of metadata about the search and the lines it returns. Look for the lines starting with "matchDoc"
to find the matches, like:
Congratulations! You’ve just built your very own search engine!
For a more general overview of what neural search is and how it works, check one of my other previous articles. Jina itself is just one way to build a neural search engine, and it has a couple of important concepts: Flows and Pods:
distilbert-base-cased
. (Which we can see in pods/encode.yml
)
Via Jina 101
Just as a plant manages nutrient flow and growth rate for its branches, a Flow manages the states and context of a group of Pods, orchestrating them to accomplish one task. Whether a Pod is remote or running in Docker, one Flow rules them all!
We define Flows in app.py
to index and query the content in our Star Trek dataset.
In this case our Flows are written in YAML format and loaded into app.py
with:
It really is that simple! Alternatively you can build Flows in app.py
itself without specifying them in YAML.
No matter whether you’re dealing with text, graphics, sound, or video, all datasets need to be indexed and queried, and the steps for doing each (chunking, vector encoding) are more or less the same (even if how you perform each step is different — that’s where Pods come in!)
Every Flow has well, a flow to it. Different Pods pass data along the Flow, with one Pod’s output becoming another Pod’s input. Look at our indexing Flow as an example:
Via Jina Dashboard
If you look at startrek_tng.csv
you’ll see it’s just one big text file. Our Flow processes it into something more suitable for Jina, which is handled by the Pods in the Flow. Each Pod performs a different task.
You can see the following Pods in flows/index.yml
:
crafter
- Split the Document into Chunksencoder
- Encode each Chunk into a vectorchunk_idx
- Build an index of Chunksdoc_idx
- Store the Document contentjoin_all
- Join the chunk_idx
and doc_idx
pathwaysThe full file is essentially just a list of Pods with parameters and some setup at the top of the file:
Luckily, YAML is pretty human-readable. I regularly thank the Great Bird of the Galaxy it’s not in Klingon, or even worse, XML!
So, is that all of the Pods? Not quite! We always have another Pod working in silence — the gateway
pod. Most of the time we can safely ignore it because it basically does all the dirty orchestration work for the Flow.
Via Jina Dashboard
In the query Flow we’ve got the following Pods:
chunk_seg
- Segments the user query into meaningful Chunkstf_encode
- Encode each word of the query into a vectorchunk_idx
- Build an index for the Chunks for fast lookupranker
- Sort results listdoc_idx
- Store the Document contentAgain, flows/query.yml
gives some setup options and lists the Pods in order of use:
When we were indexing we broke the Document into Chunks to index it. For querying we do the same, but this time the Document is the query the user types in, not the Star Trek dataset. We’ll use many of the same Pods, but there are a few differences to bear in mind. In the big picture:
And digging into the flows/query.yml
, we can see it has an extra Pod and some more parameters compared to flows/index.yml
:
rest_api:true
- Use Jina’s REST API, allowing clients like jinabox and curl
to connectport_expose: $JINA_PORT
- The port for connecting to Jina’s APIpolling: all
- Setting polling
to all
ensures all workers poll the messagereducing_yaml_path: _merge_topk_chunks
- Use _merge_topk_chunks
to reduce results from all replicasranker:
- Rank results by relevanceHow does Jina know whether it should be indexing or searching? In our RESTful API we set the mode
field in the JSON body and send the request to the corresponding API:
api/index
- {"mode": "index"}
api/search
- {"mode": "search"}
Via Jina 101
As we discussed above, a Flow tells Jina what task to perform and is comprised of Pods. And a Pod tells Jina how to perform that task (i.e. what the right tool for job is). Both Pods and Flows are written in YAML.
Let’s start by looking at a Pod in our indexing Flow, flows/index.yml
. Instead of the first Pod crafter
, let’s look at encoder
which is a bit simpler:
As we can see in the code above, the encoder
Pod’s YAML file is stored in pods/encode.yml
, and looks like:
The Pods uses the built-in TransformerTorchEncoder
as its Executor. Each Pod has a different Executor based on its task, and an Executor represents an algorithm, in this case encoding. The Executor differs based on what’s being encoded. For video or audio you’d use a different one. The with
field specifies the parameters passed to TransformerTorchEncoder
.
pooling_strategy
- Strategy to merge word embeddings into chunk embeddingmodel_name
- Name of the model we’re usingmax_length
- Maximum length to truncate tokenized sequences toWhen the Pod runs, data is passed in from the previous Pod, TransformerTorchEncoder
encodes the data, and the Pod passes the data to the next Pod in the Flow.
For a deeper dive on Pods, Flows, Executors and everything else, you can refer to Jina 101.
Via Giphy
Be sure to run pip install -r requirements.txt
before beginning, and ensure you have lots of RAM/swap and space in your tmp
partition (see below issues). This may take a while since there are a lot of prerequisites to install.
If this error keeps popping up, look into the errors that were output onto the terminal to try to find which module is missing, and then run:
pip install <module_name>
Machine learning requires a lot of resources, and if your machine hangs this is often due to running out of memory. To fix this, try creating a swap file if you use Linux. This isn’t such an issue on macOS, since it allocates swap automatically.
ERROR: Could not install packages due to an EnvironmentError: [Errno 28] No space left on device
This is often due to your /tmp
partition running out of space so you’ll need to increase its size.
command not found
For this error you’ll need to install the relevant software package onto your system. In Ubuntu this can be done with:
sudo apt-get install <package_name>
Via Giphy
In this tutorial you’ve learned:
curl
and JinaboxNow that you have a broad understanding of how things work, you can try out some of more example tutorials to build image or video search, or stay tuned for our next set of tutorials that build upon your Star Trek app.
Got an idea for a tutorial covering Star Trek and/or neural search? My commbadge is out of order right now, but you can leave a comment or note on this article for me to assimilate!
Alex C-G is the Open Source Evangelist at Jina AI, and a massive Star Trek geek.