Building In-Video Visual Content Search with Jina
Videos are one of the most popular ways to consume data today. Whether it's live streams of our favourite music artists or recorded tutorial videos, we watch them all.
Perhaps you came across a music video of your favourite artist while browsing YouTube. However, you can't recall the video's name, title, or song. In that case, you only have an image in mind; that image can be of the artist holding a guitar and singing.
Now you can just explain the image scenario to Youtube’s search engine, and it will magically come up with the resulting video. Today’s search systems are intelligent enough to retrieve data with just a hint of information. Even if you enter “XYZ holding a guitar”, Youtube's search powered by state-of-the-art deep learning models will be able to come up with some results.
Now imagine you having the capability to create such robust search systems for your internal applications, and that too in a matter of hours! But how would you do that?
That’s where Jina Hub comes in! It lets you use the best open-source models with a single line of code, and that combined with the core Jina framework allows you to create magic!
In this blog post, we will see how to create a search system capable of searching the in-video content without supplement text.
Fig 1. In-video Visual Search Interface
We don’t have any textual information about the video in this example. So we cannot match the query directly with the text information of the video in the form of subtitles. We need to find a way of matching text to videos. We know that videos are made of frames/images arranged in a sequential order to form a video. Using this concept, we can build our use case. We can find related frames similar to the query text and return the videos containing these frames as output.
Normally, every video comprises of three components - audio, video and text. In this example, we will work only with images. If video consists of only text or a static image forming one unique frame, then our use case would not work as we are leveraging the movement of those frames. Also, one more shortcoming of this tutorial is the ability to extend to multiple frames. It means that if you enter “XYZ holding a guitar and then signing an autograph for a boy in a white t-shirt”, this search query cannot be captured in a single frame and hence, is beyond the scope of this tutorial.
We want a deep learning model that can encode both query text and video frames in the same semantic space for our use case. Therefore, we will use pre-trained cross-modal models from Jina Hub!
Executors from Jina Hub
To encode video frames and query text into the same space, we will use the pre-trained CLIP Model from OpenAI. We will use the image and text encoding part of the CLIP model to calculate the embeddings for this example application.
What is CLIP?
CLIP stands for Contrastive Language-Image Pre-Training. It is trained to learn visual concepts from natural languages with the help of text snippets and image pairs from the internet. It can perform Zero-Shot Learning by encoding text labels and images in the same semantic space and creating the standard embedding for both modalities.
CLIP works very well for our use case of searching for video content. Let’s say we enter the search text "this is a guitar", the CLIP text model will encode it into a vector. Similarly, the CLIP image model can encode an image of a guitar and a violin into the same vector space. Encoding both the text and images into the same space allows us to calculate the distance between the text vector and the image vector to provide relevant results. In this example, the distance between the text "this is a guitar", and the image of the guitar will be smaller than the distance between the same text and the image of the violin.
For this example, we will use SimpleIndexer as our indexer as it allows us to store both vectors and meta-data information in one shot. For searching through the indexed data, we will use the built-in
match function of
Fig 2. Working of CLIP Model
Building the Flow
Let’s go through each of the steps in the Flow in sequential order to understand what’s happening behind the scenes:
frame_extractor: It extracts the frames from the videos allowing us to work with the image data type encoded in the same space as the query text.
image_encoder: It uses the CLIP image encoder to encode the extracted frames into the common vector space.
text_encoder: It uses the CLIP text encoder to encode the query text into the same vector space where the frames are encoded.
indexer: It uses SimpleIndexer to index the encoded text and image data for querying
ranker: It ranks the query results based on the degree of similarity in the vector space.
Fig 3. Application Flow
We have seen how the different Flow components work together to process the query text and generate the response. Now, let’s understand the two types of requests in detail -
For requests to the
/index endpoint, the indexing flow uses three different Executors -
SimpleIndexer to pre-process and index the data. It follows a sequential flow of data as discussed in the below steps:
- The input to Flow is Documents with video URIs stored in the
uriattribute. These can be files on the cloud or your local machine. After receiving the raw input, control goes to the
VideoLoaderthat extracts the frames from the videos and stores them as image arrays in the
blobattribute of the chunks.
- The processed frames are passed onto the
CLIPImageEncoderthat calculates the 512-dimensional embedding vector for each chunk using the CLIP model for images.
- Finally, the control is passed to
SimpleIndexerthat stores and indexes all the Documents within the memory map.
For requests to the
/search endpoint, also known as query endpoint, the query flow uses three different Executors -
SimpleRanker to pre-process the query text and match it with the indexed data. It follows a sequential flow of data as discussed in the below steps:
- The user input gets stored in the
textattribute of the Document. After that, the control goes to
CLIPTextEncoder, which converts the query text into vector embedding.
- After getting the embeddings for the search query, the
SimpleIndexercompares the query embedding vector with the indexed data to retrieve the top-K nearest neighbours.
- In the end, the control goes to the
SimpleRankerthat ranks the results and shows the most relevant ones.
In this blog, we learned how to create an intelligent in-video visual content search system by leveraging state-of-the-art opensource models with Jina’s framework. This use case can further be extended to incorporate audio data and video frames to improve the quality of search results. We can use the AudioCLIP model from OpenAI to generate embeddings for audio in the same semantic space as images and text.
You can find the application code in the following GitHub Repository
In the future posts, we will cover more about building SOTA search applications by leveraging Jina Hub. Stay tuned and happy Searching!