Jina.ai logo
Multimodal Search Demo in Detail-image
Multimodal search
image&text search

Multimodal Search Demo in Detail

Susana Guzmán
Susana Guzmán

First things first, I hope you all saw we we released our 1.0 version not so long ago yay!. And today I want to talk about one of our demo's: multimodal document search. I know, even the name sounds luxurious!


It makes sense to first define what we mean by multimodality before going into more fancy terms.

Multimodality, that should not to be confused with Cross-modality, basically means multiple modalities (I know, shocking right?), and those data types (a.k.a. modalities) can be audio, video, text or images. For example, a PDF file could have text only, images only, or in most cases, images and text together. In that case, we would have a file with multimodality: text and images.

So once we get this straight, we can see that it would be very useful to have a way to search through data with multiple modalities. We can use the multimodal demo to see this more clearly:

Let's say you have the image on the left. You want something similar to that, but not exactly that. You want something between the image and the text that is below the image. And this is when multimodal search is useful.

You can play with this demo yourself:

pip install "jina[multimodal]"
jina hello multimodal

How does this black magic work?

Ok, now that I got your attention we can dig into the details.

In Jina we always work with Flows. You can see more details on the basics of Jina here, but if you haven't read it (sure sure, you'll read it later, I know), think of Flows as a way to abstract high-level tasks. And which tasks you might ask? well Indexing and Querying in this case. Let's check them more closely, starting with the Indexing Flow:

Index Flow

version: '1'
  - name: segment
    uses: pods/segment.yml
  # first pathway
  - name: filter_text
    uses: pods/filter.yml
      filter_mime: text/plain
  - name: textEncoder
    uses: pods/encode-text.yml
  - name: textModIndexer
    uses: pods/index-comp.yml
      indexer_name: text
  # second pathway, in parallel
  - name: filter_image
    uses: pods/filter.yml
      filter_mime: image/jpeg
    needs: segment
  - name: imageCrafter
    uses: pods/crafte-image.yml
  - name: imageEncoder
    uses: pods/encode-image.yml
  - name: imageModIndexer
    uses: pods/index-comp.yml
      indexer_name: image
  # third pathway, in parallel
  - name: docIndexer
    uses: pods/index-doc.yml
    needs: segment
  # join all parallel works
  - needs: [docIndexer, imageModIndexer, textModIndexer]
    name: joiner

Here's index.yml file in GitHub if you want to check it out. As you can see, the first thing we define are the Pods, and in those Pods, there are some comments specifying 3 pathways. This means that our Flow will run those 3 pathways in parallel, something like this:

So the very first thing we need is to pre-process our data, and for that, we will use a segmenter (pods/segment.yml`). This will divide (segment) our document into different smaller parts, and since we have different modalities, it makes sense to segment our original document according to those modality types.

  • Pathway 1: take care of the text.
  • Pathway 2: take care of the image.
  • Pathway 3: take care of the document itself as a whole.

And the last thing will be to join all those different segmented parts back together.

Okay okay, that's nice and all, but those pathways still seem very mysterious. We can see that each pathway has several keywords: name, uses and env.

We should remember that the Flow manages Pods. And those keywords refer to those Pods. So if we see the example of the first pathway, we can see it has 3 pods; filter_text, textEncoder and TextModIndexer:

  - name: filter_text
    uses: pods/filter.yml
      filter_mime: text/plain
  - name: textEncoder
    uses: pods/encode-text.yml
  - name: textModIndexer
    uses: pods/index-comp.yml
      indexer_name: text

The first Pod, filter_text, is doing exactly what its name suggests: It filters the text of the Document. If you open that YAML file (pods/filter.yml) you'll see this:

  uses_default: true
    [IndexRequest, SearchRequest]:
      - !FilterQL
            mime_type: '${{ENV.filter_mime}}'
          traversal_paths: ['c']

The very first line is telling us which Executor will be used, in this case the BaseExecutor. You can find it in Jina Hub, where we have many Executors that are ready to use and are provided by Jina and the community (just a little ad here to say "come join us! bring your own Executors to the Open Source side!"). You can see the full list here if that's your cup of tea.

Now the next part is this requests keyword. "The what now?" you might ask. Well, in Jina we can do several things, which means different requests, so we have:

  1. Index
  2. Search
  3. Update
  4. Delete
  5. Control

All of them are different so all of them requires different configurations, but of course if you only need to Index and Search as in this case, it wouldn't make much sense to write the details for all the rest right? So we provide a basic configuration that you can use with.

uses_default: true

This will use the basic configuration for all request types, but since we want to focus on Index and Search we use the following lines. This particular executor is using QueryLang, but for this example, you just need to know that it is able to filter the text by providing the MIME type.

If you see the other Pods, they will have a similar structure. The textEncoder (pods/encode-text.yml) is using TransformerTorchEncoder instead of the BaseExecutor that we saw before, but the rest is pretty similar:

It will use the basic configuration for all requests, except for Index and Search.

And that's it

That's it! we're done for the day. If you check each part of the other pathways you'll see they are all similar. And if you check the Query Flow, it will also be pretty similar with some small differences. And if you check the Indexers you'll see we have two types of them in this example: One for the vectors and one for the meta-information.

BUT those are a lot of ifs and we've done a lot today. You deserve a cake. I deserve to go pet a cat. Go get one (a cake, not a cat) and we'll discuss the BinaryPBIndexer and the NumpyIndexer next time.

In the meantime, you can follow us on Twitter, Github, or join our Slack community.

© 2021 Jina AI GmbH. All rights reserved.Terms of Service|Privacy Policy