Jina.ai logo
May the Source Be With You: Understanding Jina through Star Wars-image
text search
nlp

May the Source Be With You: Understanding Jina through Star Wars

Alex C-G
Alex C-G

Introduction

You've read Jina 101 and got an overview of all the Jina components. Let's take a look at how they work together in a simple Jina workflow to search text. Actually, let's go one better, and throw in some Star Wars. Because if there's ever been a year where as need a New Hope it's 2020...

Bear in mind there are lots of ways to wire up a Jina workflow; we're just looking at a relatively simple use case for now, and an oversimplification at that. In future posts we'll look at searching different forms of media, see how we can scale up speed and reliability, run Jina remotely, etc.

Prerequisites

Some useful reading before you begin:

Lies! Evil Wicked Lies!

In this post we're doing what Terry Pratchett calls "lies to children", meaning we're skipping a lot of implementation details and simplifying things to make it easier to get a high-level overview without getting bogged down in messy details. We'll dig into some of these details as we go along, but if you yearn to learn about Jina's daemon or other low-level things, you should probably check our docs.

We won't include commands to run the Flows because this is just an over-simplified explanation. The fictional example here lacks a few bits and bobs needed to run in the real world, and aims only to give you an idea of how things are hooked together.

Search as a Black Box

Let's start by looking at a simple "black box" overview of a search. It has several top-level "components":

| Component | Star Wars example | | --- | --- | | Data you want to search | All the text of the Star Wars movie scripts | | Search query | A single line like Give me a hand Mr Space Wizard | | Jina core | The magical box that finds the closest match to your query in all the data | | Ranked results | A list of closest matches (for example, Help me Obi Wan-Kenobi) |

In this example we're using CSV files (all the Star Wars scripts) as our dataset, but if we wanted we could also use Jina to search through the movies themselves for soundbites or scenes, based on audio, video, and subtitle data.

First up we'll focus on indexing, so don't worry about the querying bit for now. After all you have to process the plans for the Death Star before you can search out it's vulnerability, right?

Indexing

Flows

The first step is to build a Flow and feed in our CSV files. As you've read from Jina 101, the Flow is the high-level abstraction Jina uses for any search. Data is fed in, processed, and results spat out the other end. These Flows act as pipelines for indexing and querying:

Note: Since we're only interested in indexing for now, we've left out the results. They'll come when querying.

You can build a Flow in Python, YAML, or Jina Dashboard. In our case we'll use an elegant weapon for a more civilized age YAML to create an empty Flow:

!Flow
  # We'll fill this in soon, don't worry!

Let's save that as flows/index.yml. Don't worry if it looks as useful as Jar Jar Binks. We'll flesh it out in the next section.

While we're at it, let's add our (currently useless) Flow to app.py and set it to index:

from jina.flow import Flow

def input_fn():
  # see above

f = Flow().load_config("flows/index.yml")

with f:
  f.index(filepath="scripts/*") # Note this isn't the exact method you'd use. It's a bit simplified

Pods

Each task in the Flow is performed by a different Pod which can be written in YAML or Python. Our Star Wars example indexing Flow looks like:

Earlier we created a Flow using flows/index.yml. Let's expand that file with a list of Pods:

!Flow
version: '1'
pods:
  - name: crafter:
      uses: pods/craft.yml
  - name: encoder:
      uses: pods/encode.yml
  - name: chunk_indexer:
      uses: pods/chunk_index.yml
  - name: doc_indexer:
      uses: pods/doc_index.yml

Pods are configured in their own YAML files (like pods/chunk_idx.yml) and may behave differently depending on which Flow they're in:

| Pod | Indexing Flow | Querying Flow | | --- | --- | --- | | crafter | Break dataset into sub-documents | N/A (user query is already same format as output i.e a sentence) | | encoder | Encode sub-documents into vectors | Encode user query into vectors | | chunk_indexer | Create vector and sub-document indexes | Search vector and sub-document indexes for matches | | doc_indexer | Create index of Documents and sub-documents | Search Doc indexes for matching sub-documents | | ranker | N/A | Rank results by relevance |

We'll dig more into Pod YAML files in the Executor section below.

What Does Each Pod Do?

Crafter

A Crafter takes our CSV file as input and breaks it down into sub-documents:

To keep things simple, in our example the Crafter is just splitting each line from the input CSVs into individual sentences. But if we wanted to do a more advanced text search, we could split the document (script) into any level of granularity (like line, sentence, word, etc, a.k.a sub-documents). Each sub-document comes with its own metadata (like doc_id, parent_id, granularity, etc).

Encoder

An encoder takes a sub-document and converts it to a vector embedding:

Closely related sub-documents have similar embeddings, for example jedi and space_wizard would have similar embeddings, while death_star and princess would be very different from each other.

Chunk Indexer

The Chunk indexer takes all of the sub-documents and their embeddings as input and creates two indexes as output:

  • Vector index: Stores doc_id and vector embedding of each sub-document
  • Key-value index: Stores doc_id, parent_id and any metadata associated with each sub-document

Document Indexer

Creates an index of Documents and their associated doc_ids for each level of granularity specified:

And that's the last step of the Flow for indexing! But wait, there's more to learn before we jump into querying!

Executors

Now we've processed all that data - but we haven't even dived into the deep learning behind Jina.

That's where our Executors come in. Like droids, some of them are "smart" (i.e. like an executor in the encoder Pod which uses artificial intelligence), while some are just more-or-less mindless automatons (like the indexers). Either way, they're the ones who do all the real work. Everything else is just higher-level wrappers to make them easier to work with.

We'll use the analogy of Luke flying his X-wing to blow up the Death Star:

| X-wing | Pod (the container and "interface" that lets R2 fly) | R2D2 | Executor (the one who actually does all the work) | Luke | Useless bag of flesh who doesn't do much except "trust the force", pull the trigger and get the glory

Since each Pod has a different task, each Pod uses a different executor. After all, you wouldn't ask C3PO to fly an X-wing, and you wouldn't ask R2D2 to wander around like a useless gold mannequin who gets lost in the desert translate languages. Let's look at our encoder.yml Pod as an example:

!TransformerTorchEncoder
with:
  pooling_strategy: auto
  pretrained_model_name_or_path: distilbert-base-cased
  max_length: 96
# More stuff here

We can see that it uses TransformerTorchEncoder to encode our sub-documents into vector embeddings using the distilbert-base-cased language model. For example:

| Input | Output Vector | | --- | --- | | Help me Obi Wan-Kenobi | {1.2, 0.6, -3.1, ...} | | You're my only hope | {-0.3, 1.5, 0.7, ...} |

In this way, TransformerTorchEncoder is like C3PO. It knows how to translate things to other things, like text to vector embeddings (admittedly, not over 6 million forms of translation, but we only have 3PO's word for that). However, it is probably equally useless at getting out of a desert.

Storage

Phew, that's a lot of data! But where is it all stored? Jina uses a working directory, usually workspace to store all of the indexes created by the Pods above.

Note that many of the Pods write nothing to storage. They merely pass their output to the next Pod in the Flow. It's only when data needs to be available between Flows that data needs to be written to storage (in our case, the indexes created by chunk_indexer and doc_indexer).

Running the Indexing Flow

In the introduction we mentioned this was just meant as an overview and not a "real" Flow (though it could be built into a working example). In short, it won't run as is. However, if you did want to run it, it's as simple as:

python app.py index

Querying

At this point the index Flow is out of the picture, so we can safely ignore both the Flow and the original input documents:

The query Flow is very similar to the indexing Flow. It:

  • Takes a user query as input (Give me a hand Mr Space Wizard)
  • Processes it (get its vector)
  • Pulls the indexes from storage
  • Finds the closest vectors in the vector index (like the vector for Help me Obi-Wan Kenobi) and gets the doc_ids
  • Ranks the closest matching Documents
  • Returns the ranked list to the user via REST or gRPC

A human readable version of this ranked list might look like:

| Rank | Line | Document ID | Parent Doc | Metadata | | --- | --- | --- | --- | --- | | 1 | Help me Obi-Wan Kenobi | 4501 | 04 | speaker: LEIA | | 2 | Yes, I was once a Jedi Knight the same as your father | 4702 | 04 | speaker: OBIWAN | | 3 | I'm looking for a Jedi Master | 5304 | 05 | speaker: LUKE |

And so on...

Like indexing, you'd run the Flow with:

python app.py query

Summary

In this post we've covered most of the components from Jina 101, what they do, and how they're hooked together in the context of a simple text-based search.

Hopefully by now you should understand:

  • Jina searches require indexing and querying Flows
  • Flows can be built in YAML, Python, or Jina Dashboard
  • Flows are called from app.py
  • Documents go through a series of steps in each Flow, being passed through Pods which in turn use Executors to take input, process it, and pass it to the next Pod (and sometimes read from or write to storage)
  • There may be different Pods in each Flow (in our example, the query Flow lacks crafter Pod, and the indexing Flow lacks ranker)
  • Output is returned as a ranked list of matching Documents, based on similarity of embedding

As we mentioned at the start, the above is over-simplified and certainly not exhaustive. There's always more to learn and deeper components to dive into.

Next Time

Of course, there's always more to learn, just as there's always yet another Death Star to take down. In future posts we'll cover running Pods and Flows remotely, speeding up performance, improving reliability, searching different kinds of things, and lots more. Let us know which you'd like to learn about via Twitter or our Slack community!

Until then, may the Force be with you...

© 2021 Jina AI GmbH. All rights reserved.Terms of Service|Privacy Policy