Jina.ai logo
Advancing Open Domain Question Answering with RocketQA-image
odqa
qa chatbot

Advancing Open Domain Question Answering with RocketQA

Nan Wang, Shubham Saboo
Nan Wang, Shubham Saboo

Overview

We discussed building an Open Domain Question Answering (ODQA) system with Jina in our previous post. The two-stage pipeline consisting of a retriever and reader is widely used in practice.

As the reader part is relatively developed, most of the recent research focuses on the retriever part, which is in the process of development. RocketQA is one of the successful attempts in this direction. Until July 2021, it was the top algorithm on the MS MARCO leaderboard. Recently, RocketQA released its code and models. We are proud to partner with RocketQA, which will allow you to use the RocketQA pre-trained models directly from Jina Hub.

In this post, We'll introduce the idea of RocketQA and show you how to use it to build an ODQA using Jina.

Where does DPR Fail?

As one of the dense-vector methods, DPR (Dense Passage Retrieval) is the first attempt to show that dense vector retrieval can outperform the term-based methods with a simple training procedure.

Negative sampling is one of the most important methods used in DPR. The DPR proposed using gold passages (correct question-answer pairs) from a mini-batch as a positive sample and BM-25 to generate the negative samples. To be more specific, given a positive sample, DPR uses BM-25 to retrieve the most matched passages as negative samples that do not contain the right answers. By feeding both the positive and negative samples to the training procedure, the dual encoder model learns to create a vector space. In this vector space, relevant questions and answers will have smaller distances than the irrelevant pairs.

This method works well in general. But after a close look, we will notice that some of the negative samples are false negatives because of the noisy training data. For example, "DNA is made up of molecules called nucleotides" is a correct answer, but it is considered a negative sample.

Another issue with negative sampling in DPR is that the negative samples are generated from the same batch as the positive samples. This setting is very different from the actual use cases. In practice, both the matched and the mismatched passages are retrieved from the whole corpus instead of a set of selected passages. This mismatch leads to the situation where the model was trained with some simple negative samples but was asked to distinguish hard negative samples during inference.

How does RocketQA work?

RocketQA introduced a cross-attention encoder to rerank the retrieved results and a four-step pipeline to improve the training procedure. Let's understand both of them in detail:

Cross Encoder

Besides the dual encoders for independently encoding the questions and passages, RocketQA uses another transformer-based model to learn the cross-correlation between the questions and the passages. This model is called a cross encoder, which is more precise due to the cross attention to the question and the passages. However, it requires more computations and can only be applied to a limited number of candidates.

Four-step Training Procedure

RocketQA uses a four-step procedure to train the dual encoder and the cross encoder using an end-to-end pipeline.

  1. RocketQA uses cross-batch sampling to generate hard negative samples. This solves the issue of in-batch sampling of DPR so that the model gets aware of the negative samples, which are hard to distinguish.
  2. In this step, RocketQA trains the cross encoder. Instead of using BM25 for generating negative samples as in step 1. RocketQA uses the dual encoder trained from step 1 to filter out the false positive ones from the hard negative samples. Another benefit of this is to finetune the cross encoder with the data distribution learned from the dual encoder.
  3. In this step, RocketQA retrains the dual encoder. To filter out the false-positive samples, RocketQA uses both the dual encoder from step 1 and the cross encoder from step 2 to further remove the data noises.
  4. In this step, together with the cross encoder from step 2, and the dual encoder from step 3 it further filters the noise in the data. As both the cross encoder and the dual encoder have been trained, RocketQA lets you use them for generating the training data from the unlabeled dataset. This augmented data is combined with the labelled data to finetune the dual encoder.

After training, the dual encoder and cross encoder is used to retrieve the passages. The cross encoder will return a confidence score for each pair of questions and answers.

Comparison between Conventional DPR and RocketQA

Using an example, the following table will illustrate the advantages of RocketQA over the conventional DPR model:

QuestionDPRRocketQA
What is DNA made of?DNA is what makes us; from the colour of our eyes to the colour of our skin. Without DNA, it would be impossible for us to exist. DNA is made of the nitrogen basses cytosine, guanine, adenine, and thymine, phosphates, the sugar deoxyribose, and is all held together by hydrogen bonds.
DNA contains the instructions for making you. How you look, what blood type you have, even your tendency to get some diseases. It is found inside the nucleus in just about every single cell of your body. In this lab, you'll break away the membrane around the cell and its nucleus so that you can see your very own DNA. Materials. DNA is made up of molecules called nucleotides. Each nucleotide contains a phosphate group, a sugar group and a nitrogen base. The four types of nitrogen bases are adenine (A), thymine (T), guanine (G) and cytosine (C)
DNA is a double helix formed by base pairs attached to a sugar-phosphate backbone. DNA, or deoxyribonucleic acid, is the hereditary material in humans and almost all other organisms. Nearly every cell in a persons body has the same DNA. Most DNA is located in the cell nucleus (where it is called nuclear DNA), but a small amount of DNA can also be found in the mitochondria (where it is called mitochondrial DNA or mtDNA). The information in DNA is stored as a code made up of four chemical bases: adenine (A), guanine (G), cytosine (C), and thymine (T). The structure of the double helix is somewhat like a ladder, with the base pairs forming the ladder rungs and the sugar and phosphate molecules forming the vertical sidepieces of the ladder. An important property of DNA is that it can replicate, or make copies of itself.DNA is made up of molecules called nucleotides. Each nucleotide contains a phosphate group, a sugar group and a nitrogen base. The four types of nitrogen bases are adenine (A), thymine (T), guanine (G) and cytosine (C). The order of these bases is what determines DNA's instructions, or genetic code.

Using RocketQA with Jina

RocketQA is available at Jina Hub, and it integrates seamlessly with Jina. We create a Flow for indexing the Document in the code below. The Document passages are stored in the .tags['para'] field. You can pass the .tags['title'] to improve accuracy. In the index Flow, we use RocketQADualEncoder to encode the passages into vectors and store them with SimpleIndexer.

from jina import Document, Flow

# Creating a Document object
doc = Document(tags={'title': title, 'para': para})

# Creating the indexing flow with RockectQADualEncoder and SimpleIndexer
flow = (Flow()
        .add(uses='jinahub+docker://RocketQADualEncoder',
        uses_with={'use_cuda': False})
        .add(uses='jinahub://SimpleIndexer',
        uses_metas={'workspace': 'workspace_rocketqa'}))

# Indexing the Documents using the flow
with flow:
   flow.post(on='/index', inputs=[doc,])

For querying, we will create a query flow as shown below. Besides the RocketQADualEncoder, we also use RocketQAReranker for reranking the results which implements the cross encoder part in RocketQA.

from jina import Flow

# Creating the Query flow 
flow = (Flow(use_cors=True, protocol='http', port_expose=45678)
    .add(uses='jinahub+docker://RocketQADualEncoder',
        uses_with={'use_cuda': False})
    .add(uses='jinahub://SimpleIndexer',
        uses_metas={'workspace': 'workspace_rocketqa'},
        uses_with={'match_args': {'limits': 10}})
    .add(uses='jinahub+docker://RocketQAReranker',        
         uses_with={'model': 'v1_marco_ce', 'use_cuda': False}))
 
 # Opening the query flow for incoming queries
 with flow:
     while True:
         question = input('Question?: ')
         if not question:
             break
        f.post(on='/search', inputs=Document(text=question),
        on_done=print_answers)

You can find the complete source code here.

Summary

In this post, we have given a short introduction to RocketQA and shown you how to use it with Jina. The ODQA field is an active topic attracting more and more researchers. At Jina AI, we are committed to reducing the friction between academic research and their real-world applications by making state-of-the-art frameworks accessible to all.

Stay tuned for more updates and Happy searching!

References

© Jina AI 2020-2022. All rights reserved.