Jina.ai logo
Cross-modal Search with Jina-image
image&text search

Cross-modal Search with Jina

Joan Fontanals Martínez
Joan Fontanals Martínez

Cross-modal search

In this post I will explain how we implemented a search engine in Jina for cross-modal content using the paper VSE++: Improving Visual-Semantic Embeddings with Hard Negatives.

The result is an application that allows:

  • Searching images, using descriptive captions as input query, or
  • Searching text captions, using an image as input query

The code and instructions to run the application can be found in https://github.com/jina-ai/examples/tree/master/cross-modal-search


First, we need to understand the concept of modality: Given our example, one may think that different modalities correspond to different kinds of data (images and text in this case). However, this is not accurate. For example, one can do cross-modal search by searching images from different points of view, or searching for matching titles for given paragraph text.

Therefore, one can consider that a modality is related to a given data distribution from which input may come. For this reason, and to have first-class support for cross and multi-modal search, Jina offers modality as an attribute from its Document protobuf definition.

Now that we are agreed on the concept of modality, we can describe cross-modal search as a set of retrieval applications that try to effectively find relevant documents of modality A by querying with documents from modality B.

Semantic Search

Compared to keyword-based search, the main requirement for content-based search is the ability to extract a meaningful semantic representation of the documents both at index and query time. This implies the projection of documents into a high dimensional vector embedding space where distances (or similarities) between these vectors are considered the measure of relevance between queries and indexed documents.

With current advances in performance of all the Deep Learning methods, even general purpose models (e,g. CNN models trained on ImageNet) can be used to extract meaningful feature vectors (Here Jina uses simple feature vectors from mobilenet pretrained for classification tasks on ImageNet to build a working Pokemon search application).

However, models trained using Deep Metric Learning are especially suited for retrieval. In opposition to common classification architectures (usually trained using Cross-Entropy Loss), these deep metric models tend to optimize a Contrastive Loss metric which tries to put similar objects close to each other and non-related objects further away.

In contrastive loss, the intention is to minimize the distance for positive pairs (y = 1) and to maximize the distance (with some margin m) when negative pairs (y = 0)

Siamese and Triplet Networks

Two very common architectures for Deep Metric Learning are Siamese and Triplet Networks. They both share the idea that different sub-networks (which may or may not share weights) receive different inputs at the same time (positive and negative pairs for Siamese Networks; positive, negative and anchor documents for Triplets), and try to project their own feature vectors onto a common latent space where the contrastive loss is computed and its error propagated to all the sub-networks.

Positive pairs are pairs of objects (images, text, any document) that are semantically related and expected to remain close in the projection space. On the other hand, negative pairs are pairs of documents that should be apart.

In the example, the sub-network used to extract image features is a VGG19 architecture with weights pre-trained on ImageNet, while for the text embedding, output of a hidden layer from a Gated Recurrent Unit (GRU) are used.

Hard Negatives

Besides all the common tricks and techniques to improve the learning of neural networks, for Deep Metric Learning, a key aspect of performance is the choice of positive and negative pairs. It is important for the model to see negative pairs that are not easy to split, which is achieved using Hard Negative Mining. This can impact some evaluation metrics, especially [email protected] with small values of k. Without emphasis on negative pairs the model will be able to extract meaningful neighborhoods but will find it hard to really extract true nearest neighbors, and then underperforming when evaluated at very low ks.

The paper in which the example is based (VSE++: Improving Visual-Semantic Embeddings with Hard Negatives) proposes an advanced hard negative mining strategy that increases the probability of sampling hard negatives at training time, thus obtaining a significant boost on [email protected] for both image-to-caption and caption-to-image retrieval.

The Search Flow in Jina

Jina is a useful choice for this implementation. It is a framework for developing semantic search applications with first-class support for cross-modality. Plus, it makes it easy to plug in your own models and to distribute them with the use of Docker containers, which leads to a very smooth development experience and reduces the boilerplate of complex dependency management.

It allows the description of complex AI-powered search pipelines from simple YAML Flow descriptions: In this example, two Flows are created, one for indexing images and captions and another one for querying.

At index time, images are pre-processed and normalized before being embedded in a vector space. In parallel, images are indexed without any crafting into a Key-Value database so that the user can retrieve and render them.

On the other branch of the Flow, text does not require any preprocessing before encoding (vocabulary lookup and word embedding are done during encoding), so the text indexer takes care of both vector and key-value indexing.

Query time is where the “cross” in cross-modality shines, the key aspect of the design of the Flow is that the branch responsible for obtaining semantic embeddings for images is connected to the text embedding index and vice-versa. This way, images are extracted by providing text as input and captions are retrieved by providing input images.

In both cases, there are two branches of the Flow: One will process images, and the other text. This is controlled by a filter applied at the beginning of each branch to select which inputs can be processed.

Filter modality in Flow:

- !FilterQL
     lookups: {'modality': 'image'}

Plug the Visual Semantic Embedding Models in Jina

As stated, Jina makes it easy to plug in different models, and turns out to be a very suitable tool to transfer this research into a real-world search application.

In order to use the model resulting from the papers’ model, two different encoders executors (called VSEImageEmbedding and VSETextEmbedding) were developed. Each of them just use a specific branch of the original common embedding network.

Since they rely on pickled weights and models, the main challenge is getting the right models and vocabulary files to load the right models. All this boilerplate is abstracted from the user by building the Docker images that will deploy these models very easily.


The example has been run with Flickr8k dataset with good results, although the models have been trained using Flickr30k. This shows the ability of the model to generalize to unseen data, and the ability to work on general-purpose datasets. These models can be easily retrained and fine-tuned for specific use-cases scenarios.

The results are shown using jinabox, which allows to interact with jina directly from the browser and inputing multiple kinds of data.


By Joan Fontanals Martínez on October 2, 2020.

© 2021 Jina AI GmbH. All rights reserved.Terms of Service|Privacy Policy