GDPR Compliance Message The European General Data Protection Regulation (GDPR) is the EU regulation for data privacy and security. At Jina AI, we are fully committed to complying with the GDPR. Our policy towards privacy, security, and data protection aligns with the goals of GDPR. On our website, we only collect and store information that is essential for offering our service and for improving user experience, and we do this with the consent of our website visitors. If you have any further questions, please do not hesitate to reach out to us at:
[email protected].

Using Jina to build a text-to-image search workflow based on OpenAI CLIP


Using Jina to build a text-to-image search workflow based on OpenAI CLIP

If you are not familiar with cross-modal search in Jina please read this wonderful blog post which presents how you can use Jina and build a text-to-image search system based on VSE.

To start, we need a dataset that we will use to retrieve items. First, you have to install Kaggle in your machine, because we will use the following dataset. After logging to Kaggle and setting your Kaggle Token in your system (as described here), run:

kaggle datasets download adityajn105/flickr8k
mkdir data
mkdir data/f8k
mv Images data/f8k/images
mv captions.txt data/f8k/captions.txt

make sure that your data folder has the following structure:


After the data is downloaded we can add it in the root folder of cross-modal-search . After setting the working directory we are set to index the data using the CLIP model:

python -t index -n 16000 -d f8k -m clip

Once we’ve indexed the data we can perform a search by first running:

python -t query

Then, in a browser use Jina Box url, type anything in the search bar and wait to see the top-k results.

What is this magic? Introduction to CLIP

CLIP is a neural network built to jointly train an image encoder and a text encoder. The learning task for CLIP consist of predicting correct pairings between text descriptions and images in a batch. The following image summarizes the training of the model:


After training, the model can be used to encode a piece of text or an image into a 512-dimensional vector.

Since both text and images are embedded into the same fix-sized space the embedded text can be used to retrieve images (and vice-versa).

Let phi_t and phi_i be the embeddings learned by CLIP to transform text and images respectively to vectors. In particular phi_t(x_text) and phi_t(x_image) are 512 dimensional vectors.

Given a text query q and N images (x_1, ..., x_N) we can compute the dot products between q and the N images and store them in the vector s. That is, s = phi_t(q) * phi_i(x_1), phi_t(q) * phi_i(x_2), ..., phi_t(q) * phi_i(x_N) ).

Then we can return the top K elements from x_1, ... , x_N in a list top_K using the similarity values in s, where the k’th value in top_K would correspond to the coordinate in s with the k’th highest score.

Quality CLIP vs VSE++

We can evaluate CLIP using the mean reciprocal rank (MRR), which is a common metric to evaluate information retrieval systems. Given a list of queries query_list = (q_1,...,q_Q) and a list (of lists) of retrieved items retrieved_list = ( r_1, ... , r_Q), the mean reciprocal rank assigned to (query_list, retrieved_list) is defined as the mean of the inverse of the ranks. That is:


where rank(q_i, r_i) refers to the position of the first relevant document in r_i.

The provided can be used to compute the MRR of CLIP and VSE++. The results are:

| Encoder | Modality      | Mean reciprocal Rank | Terminal command                                             |
| ------- | ------------- | -------------------- | ------------------------------------------------------------ |
| vse     | text to image | 0.3669               | python -e text2image -i 16000 -n 4000 -m vse -s 32 |
| clip    | text to image | 0.4018               | python -e text2image -i 16000 -n 4000 -m clip -s 32 |
| clip    | image to text | 0.4088               | python -e image2text -i 16000 -n 4000 -m clip -s 32 |
| vse     | image to text | 0.3791               | python -e image2text -i 16000 -n 4000 -m vse -s 32 |

Looking inside our

Now that we have seen how to use in our Jina application and the benefits of leveraging the CLIP encoder we can look at the code and the details of the implementation.

A Jina application will usually start by setting several environment variables that are needed in order to properly set up the provided .yml files which might use different environment variables such as JINA_PARALLEL or JINA_SHARDS. The setting of these variables is done in the config() function.

def config(model_name):
    os.environ['JINA_PARALLEL'] = os.environ.get('JINA_PARALLEL', '1')
    os.environ['JINA_SHARDS'] = os.environ.get('JINA_SHARDS', '1')
    os.environ['JINA_PORT'] = '45678'
    os.environ['JINA_USE_REST_API'] = 'true'
    if model_name == 'clip':
        os.environ['JINA_IMAGE_ENCODER'] = os.environ.get('JINA_IMAGE_ENCODER', 'docker://jinahub/pod.encoder.clipimageencoder:0.0.1-1.0.7')
        os.environ['JINA_TEXT_ENCODER'] = os.environ.get('JINA_TEXT_ENCODER', 'docker://jinahub/pod.encoder.cliptextencoder:0.0.1-1.0.7')
        os.environ['JINA_TEXT_ENCODER_INTERNAL'] = 'yaml/clip/text-encoder.yml'
    elif model_name == 'vse':
        os.environ['JINA_IMAGE_ENCODER'] = os.environ.get('JINA_IMAGE_ENCODER', 'docker://jinahub/pod.encoder.vseimageencoder:0.0.5-1.0.7')
        os.environ['JINA_TEXT_ENCODER'] = os.environ.get('JINA_TEXT_ENCODER', 'docker://jinahub/pod.encoder.vsetextencoder:0.0.6-1.0.7')
        os.environ['JINA_TEXT_ENCODER_INTERNAL'] = 'yaml/vse/text-encoder.yml'

This is paramount to start defining the Flow, because the Flow is created from flow-index.yml which already expects some of this environment variables already set. For example we can see shards: $JINA_PARALLEL and uses: $JINA_USES_VSE_IMAGE_ENCODER.

!head -18 flow-index.yml
version: '1'
  prefetch: 10
  - name: loader
    uses: yaml/image-load.yml
    shards: $JINA_PARALLEL
    read_only: true
  - name: normalizer
    uses: yaml/image-normalize.yml
    shards: $JINA_PARALLEL
    read_only: true
  - name: image_encoder
    shards: $JINA_PARALLEL
    timeout_ready: 600000
    read_only: true

Once config is called we can see that the application selects either index or query mode.

Note that config sets some global variables with Docker, such as 'docker://clip_text_encoder' . This is needed becasue this encoder is not part of the core of Jina and we need to specify how to find the docker for a specific part of the Flow that we will construct.

Note that the Pod named loader, defined in the file image-load.yml does not need any docker information. This file starts with !ImageReader which is a crafter already provided in Jina’s core.

Index mode

If index is selected a Flow is created with

f = Flow().load_config('flow-index.yml')

And the Flow can start indexing with f.index

with f:
    f.index(input_fn=input_index_data(num_docs, request_size, data_set), 

Here f.index receives a generator input_index_data that reads the input data and creates jina.Document objects with the images and the captions.

We can plot a Flow using the f.plot(image_path) function. The output of


is the following plot:


In the diagram, each blue node corresponds to one of Pods defined in flow-index.yml. The string inside each node corresponds to the atrribute name assigned to each Pod in flow-index.yml.

Note that there are two main branches: the middle path transforms images to vectors while the bottom branch transforms text to vectors. Both branches transform data to the same vector space.

Query mode

If query mode is selected we can see that another Flow is created from flow_query.yml

f = Flow().load_config('flow-query.yml')
with f:

as before, we can visually inspect the query Flow using f.plot('flow_diagram_index.svg') which will provide the following diagram: