Jina.ai logo
Building In-Video Visual Content Search with Jina-image

Building In-Video Visual Content Search with Jina

Nan Wang
Nan Wang


Videos are one of the most popular ways to consume data today. Whether it's live streams of our favourite music artists or recorded tutorial videos, we watch them all.

Perhaps you came across a music video of your favourite artist while browsing YouTube. However, you can't recall the video's name, title, or song. In that case, you only have an image in mind; that image can be of the artist holding a guitar and singing.

Now you can just explain the image scenario to Youtube’s search engine, and it will magically come up with the resulting video. Today’s search systems are intelligent enough to retrieve data with just a hint of information. Even if you enter “XYZ holding a guitar”, Youtube's search powered by state-of-the-art deep learning models will be able to come up with some results.

Now imagine you having the capability to create such robust search systems for your internal applications, and that too in a matter of hours! But how would you do that?

That’s where Jina Hub comes in! It lets you use the best open-source models with a single line of code, and that combined with the core Jina framework allows you to create magic!

In this blog post, we will see how to create a search system capable of searching the in-video content without supplement text.

Fig 1. In-video Visual Search Interface

Flow Approach

We don’t have any textual information about the video in this example. So we cannot match the query directly with the text information of the video in the form of subtitles. We need to find a way of matching text to videos. We know that videos are made of frames/images arranged in a sequential order to form a video. Using this concept, we can build our use case. We can find related frames similar to the query text and return the videos containing these frames as output.

Normally, every video comprises of three components - audio, video and text. In this example, we will work only with images. If video consists of only text or a static image forming one unique frame, then our use case would not work as we are leveraging the movement of those frames. Also, one more shortcoming of this tutorial is the ability to extend to multiple frames. It means that if you enter “XYZ holding a guitar and then signing an autograph for a boy in a white t-shirt”, this search query cannot be captured in a single frame and hence, is beyond the scope of this tutorial.

We want a deep learning model that can encode both query text and video frames in the same semantic space for our use case. Therefore, we will use pre-trained cross-modal models from Jina Hub!

Executors from Jina Hub

To encode video frames and query text into the same space, we will use the pre-trained CLIP Model from OpenAI. We will use the image and text encoding part of the CLIP model to calculate the embeddings for this example application.

What is CLIP?

CLIP stands for Contrastive Language-Image Pre-Training. It is trained to learn visual concepts from natural languages with the help of text snippets and image pairs from the internet. It can perform Zero-Shot Learning by encoding text labels and images in the same semantic space and creating the standard embedding for both modalities.


CLIP works very well for our use case of searching for video content. Let’s say we enter the search text "this is a guitar", the CLIP text model will encode it into a vector. Similarly, the CLIP image model can encode an image of a guitar and a violin into the same vector space. Encoding both the text and images into the same space allows us to calculate the distance between the text vector and the image vector to provide relevant results. In this example, the distance between the text "this is a guitar", and the image of the guitar will be smaller than the distance between the same text and the image of the violin.

For this example, we will use SimpleIndexer as our indexer as it allows us to store both vectors and meta-data information in one shot. For searching through the indexed data, we will use the built-in match function of DocumentArrayMemap.

Fig 2. Working of CLIP Model

Building the Flow

Let’s go through each of the steps in the Flow in sequential order to understand what’s happening behind the scenes:

  • frame_extractor: It extracts the frames from the videos allowing us to work with the image data type encoded in the same space as the query text.
  • image_encoder: It uses the CLIP image encoder to encode the extracted frames into the common vector space.
  • text_encoder: It uses the CLIP text encoder to encode the query text into the same vector space where the frames are encoded.
  • indexer: It uses SimpleIndexer to index the encoded text and image data for querying
  • ranker: It ranks the query results based on the degree of similarity in the vector space.

Fig 3. Application Flow

We have seen how the different Flow components work together to process the query text and generate the response. Now, let’s understand the two types of requests in detail - index and query.


For requests to the /index endpoint, the indexing flow uses three different Executors - VideoLoader, CLIPImageEncoder and SimpleIndexer to pre-process and index the data. It follows a sequential flow of data as discussed in the below steps:

  • The input to Flow is Documents with video URIs stored in the uri attribute. These can be files on the cloud or your local machine. After receiving the raw input, control goes to the VideoLoader that extracts the frames from the videos and stores them as image arrays in the blob attribute of the chunks.
  • The processed frames are passed onto the CLIPImageEncoder that calculates the 512-dimensional embedding vector for each chunk using the CLIP model for images.
  • Finally, the control is passed to SimpleIndexer that stores and indexes all the Documents within the memory map.


For requests to the /search endpoint, also known as query endpoint, the query flow uses three different Executors - CLIPTextEncoder, SimpleIndexer, and SimpleRanker to pre-process the query text and match it with the indexed data. It follows a sequential flow of data as discussed in the below steps:

  • The user input gets stored in the text attribute of the Document. After that, the control goes to CLIPTextEncoder, which converts the query text into vector embedding.
  • After getting the embeddings for the search query, the SimpleIndexer compares the query embedding vector with the indexed data to retrieve the top-K nearest neighbours.
  • In the end, the control goes to the SimpleRanker that ranks the results and shows the most relevant ones.

Tips: Find more information at Jina Hub about CLIPTextEncoder, CLIPImageEncoder and SimpleIndexer.


In this blog, we learned how to create an intelligent in-video visual content search system by leveraging state-of-the-art opensource models with Jina’s framework. This use case can further be extended to incorporate audio data and video frames to improve the quality of search results. We can use the AudioCLIP model from OpenAI to generate embeddings for audio in the same semantic space as images and text.

You can find the application code in the following GitHub Repository

In the future posts, we will cover more about building SOTA search applications by leveraging Jina Hub. Stay tuned and happy Searching!

© Jina AI 2020-2022. All rights reserved.