Jina.ai logo
Using Jina to build a text-to-image search workflow based on OpenAI CLIP-image
image&text search

Using Jina to build a text-to-image search workflow based on OpenAI CLIP

David Buchaca, Bo Wang, Joan Fontanals
David Buchaca, Bo Wang, Joan Fontanals

Using Jina to build a text-to-image search workflow based on OpenAI CLIP

If you are not familiar with cross-modal search in Jina please read this wonderful blog post which presents how you can use Jina and build a text-to-image search system based on VSE.

To start, we need a dataset that we will use to retrieve items. First, you have to install Kaggle in your machine, because we will use the following dataset. After logging to Kaggle and setting your Kaggle Token in your system (as described here), run:

kaggle datasets download adityajn105/flickr8k
unzip flickr8k.zip 
rm flickr8k.zip
mkdir data
mkdir data/f8k
mv Images data/f8k/images
mv captions.txt data/f8k/captions.txt

make sure that your data folder has the following structure:


After the data is downloaded we can add it in the root folder of cross-modal-search . After setting the working directory we are set to index the data using the CLIP model:

python app.py -t index -n 16000 -d f8k -m clip

Once we've indexed the data we can perform a search by first running:

python app.py -t query

Then, in a browser use Jina Box url, type anything in the search bar and wait to see the top-k results.

What is this magic? Introduction to CLIP

CLIP is a neural network built to jointly train an image encoder and a text encoder. The learning task for CLIP consist of predicting correct pairings between text descriptions and images in a batch. The following image summarizes the training of the model:


After training, the model can be used to encode a piece of text or an image into a 512-dimensional vector.

Since both text and images are embedded into the same fix-sized space the embedded text can be used to retrieve images (and vice-versa).

Let phi_t and phi_i be the embeddings learned by CLIP to transform text and images respectively to vectors. In particular phi_t(x_text) and phi_t(x_image) are 512 dimensional vectors.

Given a text query q and N images (x_1, ..., x_N) we can compute the dot products between q and the N images and store them in the vector s. That is, s = phi_t(q) * phi_i(x_1), phi_t(q) * phi_i(x_2), ..., phi_t(q) * phi_i(x_N) ).

Then we can return the top K elements from x_1, ... , x_N in a list top_K using the similarity values in s, where the k'th value in top_K would correspond to the coordinate in s with the k'th highest score.

Quality CLIP vs VSE++

We can evaluate CLIP using the mean reciprocal rank (MRR), which is a common metric to evaluate information retrieval systems. Given a list of queries query_list = (q_1,...,q_Q) and a list (of lists) of retrieved items retrieved_list = ( r_1, ... , r_Q), the mean reciprocal rank assigned to (query_list, retrieved_list) is defined as the mean of the inverse of the ranks. That is:


where rank(q_i, r_i) refers to the position of the first relevant document in r_i.

The provided evaluate.py can be used to compute the MRR of CLIP and VSE++. The results are:

| Encoder | Modality      | Mean reciprocal Rank | Terminal command                                             |
| ------- | ------------- | -------------------- | ------------------------------------------------------------ |
| vse     | text to image | 0.3669               | python evaluate.py -e text2image -i 16000 -n 4000 -m vse -s 32 |
| clip    | text to image | 0.4018               | python evaluate.py -e text2image -i 16000 -n 4000 -m clip -s 32 |
| clip    | image to text | 0.4088               | python evaluate.py -e image2text -i 16000 -n 4000 -m clip -s 32 |
| vse     | image to text | 0.3791               | python evaluate.py -e image2text -i 16000 -n 4000 -m vse -s 32 |

Looking inside our app.py

Now that we have seen how to use in our Jina application and the benefits of leveraging the CLIP encoder we can look at the code and the details of the implementation.

A Jina application will usually start by setting several environment variables that are needed in order to properly set up the provided .yml files which might use different environment variables such as JINA_PARALLEL or JINA_SHARDS. The setting of these variables is done in the config() function.

def config(model_name):
    os.environ['JINA_PARALLEL'] = os.environ.get('JINA_PARALLEL', '1')
    os.environ['JINA_SHARDS'] = os.environ.get('JINA_SHARDS', '1')
    os.environ['JINA_PORT'] = '45678'
    os.environ['JINA_USE_REST_API'] = 'true'
    if model_name == 'clip':
        os.environ['JINA_IMAGE_ENCODER'] = os.environ.get('JINA_IMAGE_ENCODER', 'docker://jinahub/pod.encoder.clipimageencoder:0.0.1-1.0.7')
        os.environ['JINA_TEXT_ENCODER'] = os.environ.get('JINA_TEXT_ENCODER', 'docker://jinahub/pod.encoder.cliptextencoder:0.0.1-1.0.7')
        os.environ['JINA_TEXT_ENCODER_INTERNAL'] = 'yaml/clip/text-encoder.yml'
    elif model_name == 'vse':
        os.environ['JINA_IMAGE_ENCODER'] = os.environ.get('JINA_IMAGE_ENCODER', 'docker://jinahub/pod.encoder.vseimageencoder:0.0.5-1.0.7')
        os.environ['JINA_TEXT_ENCODER'] = os.environ.get('JINA_TEXT_ENCODER', 'docker://jinahub/pod.encoder.vsetextencoder:0.0.6-1.0.7')
        os.environ['JINA_TEXT_ENCODER_INTERNAL'] = 'yaml/vse/text-encoder.yml'

This is paramount to start defining the Flow, because the Flow is created from flow-index.yml which already expects some of this environment variables already set. For example we can see shards: $JINA_PARALLEL and uses: $JINA_USES_VSE_IMAGE_ENCODER.

!head -18 flow-index.yml

version: '1'
  prefetch: 10
  - name: loader
    uses: yaml/image-load.yml
    shards: $JINA_PARALLEL
    read_only: true
  - name: normalizer
    uses: yaml/image-normalize.yml
    shards: $JINA_PARALLEL
    read_only: true
  - name: image_encoder
    shards: $JINA_PARALLEL
    timeout_ready: 600000
    read_only: true

Once config is called we can see that the application selects either index or query mode.

Note that config sets some global variables with Docker, such as 'docker://clip_text_encoder' . This is needed becasue this encoder is not part of the core of Jina and we need to specify how to find the docker for a specific part of the Flow that we will construct.

Note that the Pod named loader, defined in the file image-load.yml does not need any docker information. This file starts with !ImageReader which is a crafter already provided in Jina's core.

Index mode

If index is selected a Flow is created with

f = Flow().load_config('flow-index.yml')

And the Flow can start indexing with f.index

with f:
    f.index(input_fn=input_index_data(num_docs, request_size, data_set), 

Here f.index receives a generator input_index_data that reads the input data and creates jina.Document objects with the images and the captions.

We can plot a Flow using the f.plot(image_path) function. The output of


is the following plot:


In the diagram, each blue node corresponds to one of Pods defined in flow-index.yml. The string inside each node corresponds to the atrribute name assigned to each Pod in flow-index.yml.

Note that there are two main branches: the middle path transforms images to vectors while the bottom branch transforms text to vectors. Both branches transform data to the same vector space.

Query mode

If query mode is selected we can see that another Flow is created from flow_query.yml

f = Flow().load_config('flow-query.yml')
with f:

as before, we can visually inspect the query Flow using f.plot('flow_diagram_index.svg') which will provide the following diagram:


© 2021 Jina AI GmbH. All rights reserved.Terms of Service|Privacy Policy