GDPR Compliance Message The European General Data Protection Regulation (GDPR) is the EU regulation for data privacy and security. At Jina AI, we are fully committed to complying with the GDPR. Our policy towards privacy, security, and data protection aligns with the goals of GDPR. On our website, we only collect and store information that is essential for offering our service and for improving user experience, and we do this with the consent of our website visitors. If you have any further questions, please do not hesitate to reach out to us at:
[email protected].


May the Source Be With You: Understanding Jina through Star Wars

Published

Introduction

You’ve read Jina 101 and got an overview of all the Jina components. Let’s take a look at how they work together in a simple Jina workflow to search text. Actually, let’s go one better, and throw in some Star Wars. Because if there’s ever been a year where as need a New Hope it’s 2020…

Bear in mind there are lots of ways to wire up a Jina workflow; we’re just looking at a relatively simple use case for now, and an oversimplification at that. In future posts we’ll look at searching different forms of media, see how we can scale up speed and reliability, run Jina remotely, etc.

Prerequisites

Some useful reading before you begin:

Lies! Evil Wicked Lies!

In this post we’re doing what Terry Pratchett calls “lies to children”, meaning we’re skipping a lot of implementation details and simplifying things to make it easier to get a high-level overview without getting bogged down in messy details. We’ll dig into some of these details as we go along, but if you yearn to learn about Jina’s daemon or other low-level things, you should probably check our docs.

We won’t include commands to run the Flows because this is just an over-simplified explanation. The fictional example here lacks a few bits and bobs needed to run in the real world, and aims only to give you an idea of how things are hooked together.

Search as a Black Box

Let’s start by looking at a simple “black box” overview of a search. It has several top-level “components”:

Component Star Wars example
Data you want to search All the text of the Star Wars movie scripts
Search query A single line like Give me a hand Mr Space Wizard
Jina core The magical box that finds the closest match to your query in all the data
Ranked results A list of closest matches (for example, Help me Obi Wan-Kenobi)

In this example we’re using CSV files (all the Star Wars scripts) as our dataset, but if we wanted we could also use Jina to search through the movies themselves for soundbites or scenes, based on audio, video, and subtitle data.

First up we’ll focus on indexing, so don’t worry about the querying bit for now. After all you have to process the plans for the Death Star before you can search out it’s vulnerability, right?

Indexing

Flows

The first step is to build a Flow and feed in our CSV files. As you’ve read from Jina 101, the Flow is the high-level abstraction Jina uses for any search. Data is fed in, processed, and results spat out the other end. These Flows act as pipelines for indexing and querying:

Note: Since we’re only interested in indexing for now, we’ve left out the results. They’ll come when querying.

You can build a Flow in Python, YAML, or Jina Dashboard. In our case we’ll use an elegant weapon for a more civilized age YAML to create an empty Flow:

!Flow
  # We'll fill this in soon, don't worry!

Let’s save that as flows/index.yml. Don’t worry if it looks as useful as Jar Jar Binks. We’ll flesh it out in the next section.

While we’re at it, let’s add our (currently useless) Flow to app.py and set it to index:

from jina.flow import Flow

def input_fn():
  # see above

f = Flow().load_config("flows/index.yml")

with f:
  f.index(filepath="scripts/*") # Note this isn't the exact method you'd use. It's a bit simplified

Pods

Each task in the Flow is performed by a different Pod which can be written in YAML or Python. Our Star Wars example indexing Flow looks like:

Earlier we created a Flow using flows/index.yml. Let’s expand that file with a list of Pods:

!Flow
version: '1'
pods:
  - name: crafter:
      uses: pods/craft.yml
  - name: encoder:
      uses: pods/encode.yml
  - name: chunk_indexer:
      uses: pods/chunk_index.yml
  - name: doc_indexer:
      uses: pods/doc_index.yml

Pods are configured in their own YAML files (like pods/chunk_idx.yml) and may behave differently depending on which Flow they’re in:

Pod Indexing Flow Querying Flow
crafter Break dataset into sub-documents N/A (user query is already same format as output i.e a sentence)
encoder Encode sub-documents into vectors Encode user query into vectors
chunk_indexer Create vector and sub-document indexes Search vector and sub-document indexes for matches
doc_indexer Create index of Documents and sub-documents Search Doc indexes for matching sub-documents
ranker N/A Rank results by relevance

We’ll dig more into Pod YAML files in the Executor section below.

What Does Each Pod Do?

Crafter

A Crafter takes our CSV file as input and breaks it down into sub-documents:

To keep things simple, in our example the Crafter is just splitting each line from the input CSVs into individual sentences. But if we wanted to do a more advanced text search, we could split the document (script) into any level of granularity (like line, sentence, word, etc, a.k.a sub-documents). Each sub-document comes with its own metadata (like doc_id, parent_id, granularity, etc).

Encoder

An encoder takes a sub-document and converts it to a vector embedding:

Closely related sub-documents have similar embeddings, for example jedi and space_wizard would have similar embeddings, while death_star and princess would be very different from each other.

Chunk Indexer

The Chunk indexer takes all of the sub-documents and their embeddings as input and creates two indexes as output:

  • Vector index: Stores doc_id and vector embedding of each sub-document
  • Key-value index: Stores doc_id, parent_id and any metadata associated with each sub-document

Document Indexer

Creates an index of Documents and their associated doc_ids for each level of granularity specified:

And that’s the last step of the Flow for indexing! But wait, there’s more to learn before we jump into querying!

Executors

Now we’ve processed all that data - but we haven’t even dived into the deep learning behind Jina.

That’s where our Executors come in. Like droids, some of them are “smart” (i.e. like an executor in the encoder Pod which uses artificial intelligence), while some are just more-or-less mindless automatons (like the indexers). Either way, they’re the ones who do all the real work. Everything else is just higher-level wrappers to make them easier to work with.

We’ll use the analogy of Luke flying his X-wing to blow up the Death Star:

X-wing Pod (the container and “interface” that lets R2 fly)
R2D2 Executor (the one who actually does all the work)
Luke Useless bag of flesh who doesn’t do much except “trust the force”, pull the trigger and get the glory

Since each Pod has a different task, each Pod uses a different executor. After all, you wouldn’t ask C3PO to fly an X-wing, and you wouldn’t ask R2D2 to wander around like a useless gold mannequin who gets lost in the desert translate languages. Let’s look at our encoder.yml Pod as an example:

!TransformerTorchEncoder
with:
  pooling_strategy: auto
  pretrained_model_name_or_path: distilbert-base-cased
  max_length: 96
# More stuff here

We can see that it uses TransformerTorchEncoder to encode our sub-documents into vector embeddings using the distilbert-base-cased language model. For example:

Input Output Vector
Help me Obi Wan-Kenobi {1.2, 0.6, -3.1, ...}
You're my only hope {-0.3, 1.5, 0.7, ...}

In this way, TransformerTorchEncoder is like C3PO. It knows how to translate things to other things, like text to vector embeddings (admittedly, not over 6 million forms of translation, but we only have 3PO’s word for that). However, it is probably equally useless at getting out of a desert.

Storage

Phew, that’s a lot of data! But where is it all stored? Jina uses a working directory, usually workspace to store all of the indexes created by the Pods above.

Note that many of the Pods write nothing to storage. They merely pass their output to the next Pod in the Flow. It’s only when data needs to be available between Flows that data needs to be written to storage (in our case, the indexes created by chunk_indexer and doc_indexer).

Running the Indexing Flow

In the introduction we mentioned this was just meant as an overview and not a “real” Flow (though it could be built into a working example). In short, it won’t run as is. However, if you did want to run it, it’s as simple as:

python app.py index

Querying

At this point the index Flow is out of the picture, so we can safely ignore both the Flow and the original input documents:

The query Flow is very similar to the indexing Flow. It:

  • Takes a user query as input (Give me a hand Mr Space Wizard)
  • Processes it (get its vector)
  • Pulls the indexes from storage
  • Finds the closest vectors in the vector index (like the vector for Help me Obi-Wan Kenobi) and gets the doc_ids
  • Ranks the closest matching Documents
  • Returns the ranked list to the user via REST or gRPC

A human readable version of this ranked list might look like:

Rank Line Document ID Parent Doc Metadata
1 Help me Obi-Wan Kenobi 4501 04 speaker: LEIA
2 Yes, I was once a Jedi Knight the same as your father 4702 04 speaker: OBIWAN
3 I’m looking for a Jedi Master 5304 05 speaker: LUKE

And so on…

Like indexing, you’d run the Flow with:

python app.py query

Summary

In this post we’ve covered most of the components from Jina 101, what they do, and how they’re hooked together in the context of a simple text-based search.

Hopefully by now you should understand:

  • Jina searches require indexing and querying Flows
  • Flows can be built in YAML, Python, or Jina Dashboard
  • Flows are called from app.py
  • Documents go through a series of steps in each Flow, being passed through Pods which in turn use Executors to take input, process it, and pass it to the next Pod (and sometimes read from or write to storage)
  • There may be different Pods in each Flow (in our example, the query Flow lacks crafter Pod, and the indexing Flow lacks ranker)
  • Output is returned as a ranked list of matching Documents, based on similarity of embedding

As we mentioned at the start, the above is over-simplified and certainly not exhaustive. There’s always more to learn and deeper components to dive into.

Next Time

Of course, there’s always more to learn, just as there’s always yet another Death Star to take down. In future posts we’ll cover running Pods and Flows remotely, speeding up performance, improving reliability, searching different kinds of things, and lots more. Let us know which you’d like to learn about via Twitter or our Slack community!

Until then, may the Force be with you…