Jina Embeddings と Qdrant で RAG システムを構築する

Retrieval-augmented generation（RAG）は、元々は大規模言語モデル（LLMs）の拡張として考案された技術群ですが、現在では AI モデルの役割を完全に再考する代替パラダイムとして認識されつつあります。

ここでは、LlamaIndex フレームワーク、オープンソースの Mistral LLM、Jina Embeddings v2、そしてQdrant の AI 対応ベクトルデータベースを使用して、基本的な RAG エンジンの作り方をご紹介します。

tagRAG の文脈

RAG システムでは、AI モデルの役割は依然として中心的ですが、より限定的です。学習データから得られた内部知識からプロンプトに応答するのではなく、プロンプト内で関連情報が提示されます。モデルの役割は、その情報を理解し、アウトプットへと統合することです。

これは AI モデルの強みを活かし、弱点を最小限に抑えます。

例えば、Mistral-Instruct モデルに以下のようなプロンプトを与えてみましょう：

Answer the query. Please be brief, concise, and complete.
Query: Why is Taylor Swift planning to attend the Superbowl?
Answer:

私がこれを実行したとき、モデルは以下のように応答しました：

I don't have real-time information. However, as of my last update, 
Taylor Swift is not officially confirmed to attend or perform at 
the Superbowl. If there are any updates, I recommend checking the 
latest news sources.

大規模言語モデルの更新と再トレーニングには多大なコストがかかるため、その内部情報は常に最新の状態を維持することができません。

さらに、大規模言語モデルには「幻覚」、つまり部分的または完全に間違っている可能性があり、検証が必要な、もっともらしい回答を作り出すことがよく知られています。

Answer the query. Please be brief, concise, and complete.
Query: What was Weird Al Yankovic's first Taylor Swift parody?
Answer:

Weird Al Yankovic's first Taylor Swift parody was "TMZ," a spoof 
of Swift's song "Blank Space." The parody was released in 2014 
as part of Yankovic's album "Mandatory Fun." The song humorously 
addresses the obsession with celebrity gossip and scandals, as 
reported by the tabloid news website TMZ.

Weird Al Yankovic の TMZ は 2011 年にアルバム Alpocalypse でリリースされ、Swift の You Belong with Me のパロディです。

ニューラルネットワークの性質上、幻覚を完全に防ぐことは困難、あるいは不可能であり、AI モデルを情報源として使用する計画を大きく損なっています。

OpenAI の CEO である Sam Altman は次のように述べています：

「これらのモデル［LLM］をデータベースとして考えることは大きな間違いだと思います。[...] ある程度は機能しますが、世界で最も高価で、最も遅く、最も不正確なデータベースのようなものです。」

RAG は、AI モデルの世界に関する内部知識への依存を最小限に抑え、その最大の強みである自然言語理解と流暢な言語生成を活用します。

Mistral-Instruct が答えられない別の質問の例を見てみましょう：

Answer the query. Please be brief, concise, and complete.

Query: What is the title of Taylor Swift's next album?
Answer:

As of the time I am providing this answer, Taylor Swift has 
not yet announced the title of her next album. She has released 
several re-recorded albums, but the title of her next original 
album remains unknown. I recommend checking her official social 
media channels or website for the latest updates.

RAG システムであれば、最新のニュース記事のデータベースを持っているか、Taylor Swift の次のアルバムのタイトルに関連するテキストをインターネットで検索することができます。そして、見つけたテキストをプロンプトに挿入します。

この質問に関連するAP の記事を見つけたと仮定して、以下のようなプロンプトを構築します：

Context information is below.
---------------------
February 6, 2024
LOS ANGELES (AP) — Forget her "Reputation." Taylor Swift has 
a whole new album coming out.

Accepting the Grammy for best pop vocal album on Sunday night, 
Taylor said she'd been keeping a secret for two years.

"My brand-new album comes out April 19. It's called 'The Tortured 
Poets Department.' I'm going to go and post the cover right now 
backstage," she announced.

And so she did.

---------------------
Given the context information and not prior knowledge, answer the query. 
Please be brief, concise, and complete.
If the context information does not contain an answer to the query, 
respond with "No information".
Query: What is the title of Taylor Swift's next album?
Answer:

Mistral-Instruct は以下のように応答しました：

The title of Taylor Swift's next album is 'The Tortured Poets Department'.

このようなプロンプトの与え方は、幻覚の可能性を大幅に減少させます。例えば：

Context information is below.
---------------------
Alpocalypse is the thirteenth studio album by American parody 
musician "Weird Al" Yankovic, released on June 21, 2011.

Track List

1. Perform This Way (Parody of "Born This Way" by Lady Gaga)
2. CNR
3. TMZ (Parody of "You Belong with Me" by Taylor Swift)
4. Skipper Dan
5. Polka Face (Polka medley including songs by artists such as 
   Lady Gaga, Justin Bieber, Britney Spears, Ke$ha, and others)
6. Craigslist
7. Party in the CIA (Parody of "Party in the U.S.A." by Miley 
   Cyrus)
8. Ringtone
9. Another Tattoo (Parody of "Nothin' on You" by B.o.B featuring 
   Bruno Mars)
10. If That Isn't Love
11. Whatever You Like (Parody of "Whatever You Like" by T.I.)
12. Stop Forwarding That Crap to Me

---------------------
Given the context information and not prior knowledge, answer the 
query. Please be brief, concise, and complete.
If the context information does not contain an answer to the query, 
respond with "No information".
Query: What was Weird Al Yankovic's first Taylor Swift parody, what 
year was it released, and what song was he parodying?
Answer:

Weird Al Yankovic's first Taylor Swift parody was "TMZ," and it was 
released in 2011. He was parodying "You Belong with Me" by Taylor 
Swift.

tagRAG システムの構造

RAG システムには、最低限以下の 2 つのコンポーネントが必要です：

生成 AI モデル
情報検索システム

この記事では、生成 AI モデルとして Mistral-Instruct を使用し、Hugging Face Inference API を通じてアクセスします。情報検索システムには、Qdrant ベクトルデータベースと Jina Embeddings を組み合わせて使用します。

データソースとして、2023 年 5 月に科学技術国家会議が発表した国家人工知能研究開発戦略計画 2023 更新版を使用します。この約 25,000 語の文書は、米国の AI 政策と開発に関する政策および戦略計画文書です。

これは典型的な「長すぎて読めない」政府の政策文書です。その大部分は非常に退屈で、この分野に利害関係や強い関心がない限り、おそらく読もうとはしないでしょう。しかし、その内容を知りたいと思うかもしれません。

この記事では、テキストとその内容に関する質問に答えられる RAG システムを作成するステップを説明します。

tag前提条件のインストール

まず、関連する Python ライブラリがすべてインストールされていることを確認してください。Python 環境で以下を実行します：

pip install llama-index qdrant-client 
pip install pdfminer.six llama-index-vector-stores-qdrant
pip install llama-index-llms-openai llama-index-embeddings-jinaai 
pip install llama-index-llms-huggingface "huggingface_hub[inference]"

Mistral-Instruct の大規模言語モデルを中心に RAG システムを構築するために LlamaIndex を使用し、PDF ファイルを Qdrant ベクトルストアに処理するために pdfminer.six を使用します。

tagHugging Face Inference API を介して Mistral-Instruct に接続

まず、LLM をセットアップします。Hugging Face Inference API へのアクセスとアクセストークンが必要です。Hugging Face アカウントをお持ちの場合は、アカウント設定ページから取得できます。

アカウントをお持ちでない場合は、まずアカウントを作成し、その後アクセストークンを作成してください。

今後使用するためにトークンを変数に格納します：

hf_inference_api_key = '<your HuggingFace Inference API token>'

次に、プロンプトテンプレートを作成します：

from llama_index.core import PromptTemplate

qa_prompt_tmpl = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query. Please be brief, concise, and complete.\n"
    "If the context information does not contain an answer to the query, "
    "respond with \"No information\".\n"
    "Query: {query_str}\n"
    "Answer: "
)
qa_prompt = PromptTemplate(qa_prompt_tmpl)

このテンプレートは、LLM に対して、提供されたコンテキスト情報のみを使用して質問に回答するよう指示します。このプロンプトは Mistral-Instruct との作業でうまく機能していますが、ユースケースに応じて他の表現も試してみることができます。

from llama_index.llms.huggingface import HuggingFaceInferenceAPI

mixtral_llm = HuggingFaceInferenceAPI(
    model_name="mistralai/Mixtral-8x7B-Instruct-v0.1", 
    token=hf_inference_api_key
)

最後に、Mistral-Instruct への接続を保持する LlamaIndex フレームワークのオブジェクトを作成し初期化します。以下のコードは、LlamaIndex を使用した RAG システムの作成に関する以前の投稿からのものです。

これにより Mistral LLM への接続が変数 mistral_llm に格納されます。

tagQdrant と Jina Embeddings を使用した情報検索

検索システムをセットアップするには、Jina Embeddings API キーが必要です。Jina Embeddings のウェブサイトで、100万トークンの無料枠付きのキーを取得できます。

今後使用するために Jina Embeddings キーを変数に格納します：

jina_emb_api_key = "<your Jina Embeddings API key>"

次に、LlamaIndex を使用して Jina Embeddings サーバーのコネクタオブジェクトを作成し、特に英語のモノリンガルモデルを選択します：

from llama_index.embeddings.jinaai import JinaEmbedding

jina_embedding_model = JinaEmbedding(
    api_key=jina_emb_api_key,
    model="jina-embeddings-v2-base-en",
)

tagテキストデータの読み込み

次に、ドキュメントを読み込んで段落に分割します。まず、ホワイトハウスのウェブサイトから PDF をダウンロードして変数 pdf_data に格納します：

import urllib.request

uri = "https://www.whitehouse.gov/wp-content/uploads/2023/05/National-Artificial-Intelligence-Research-and-Development-Strategic-Plan-2023-Update.pdf"
pdf_data = urllib.request.urlopen(uri).read()

次に、PDF をページごとに処理し、テキストを抽出して二重改行で分割することで段落にチャンク分けします。これらはリスト text_paras に格納されます：

import regex as re
from io import BytesIO, StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

text_paras = []
parser = PDFParser(BytesIO(pdf_data))
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
for page in PDFPage.create_pages(doc):
    output_string = StringIO()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    interpreter.process_page(page)
    page_text = output_string.getvalue()
    text_paras.extend(re.split(r'\n\s*\n', page_text))

すべてが読み込まれたことを確認します：

assert len(text_paras) == 615

次に、この短いテキストのリストをLlamaIndex Document オブジェクトに変換します：

from llama_index.core.readers import StringIterableReader

rag_docs = StringIterableReader().load_data(text_paras)

テキストを確認できます：

print(rag_docs[0].text)

結果：

NATIONAL ARTIFICIAL INTELLIGENCE 
RESEARCH AND DEVELOPMENT 
STRATEGIC PLAN 
2023 UPDATE

tagQdrant ベクトルデータベースのセットアップ

この記事では、RAG システムのデータベースを実装するために Qdrant Vector Search Cloud を使用します。Qdrant の無料枠には 1GB のストレージが含まれており、このチュートリアルには十分すぎるほどです。

続行する前に、Qdrant Cloud のウェブサイトでアカウントを作成する必要があります。

アカウントを作成してログインしたら、クラスターを作成する必要があります。Qdrant ウェブサイトの「クイックスタート」の手順に従って、無料クラスターをセットアップし、API キーと Qdrant ホストサーバーの名前を取得してください。

キーとホスト名を変数に格納します：

qdrant_api_key = "<your API key>"
qdrant_server = "https://<your server>"

次に、qdrant_client と llama_index パッケージから関連コンポーネントをインポートする必要があります：

import qdrant_client
from llama_index.vector_stores.qdrant import QdrantVectorStore

client = qdrant_client.QdrantClient(qdrant_server, api_key=qdrant_api_key)
vector_store = QdrantVectorStore(client=client, collection_name="NTSC")

これにより、無料クラスターに NTSC というコレクションが作成されます。

メモリまたはディスク上にローカルデータストアを構築し、Web API を介して Docker コンテナで独自の Qdrant サーバーインスタンスをホストする方法については、Qdrant のドキュメントをご覧ください。

tagRAG システムを完成させる

ここでは、LlamaIndex の定型コードを使用してこれらのコンポーネントを完全な RAG システムにまとめます：

from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.storage.storage_context import StorageContext
from llama_index.core import (
		VectorStoreIndex,
		ServiceContext,
		get_response_synthesizer,
)

# set up the service and storage contexts
service_context = ServiceContext.from_defaults(
    llm=mixtral_llm, embed_model=jina_embedding_model
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# create an index
index = VectorStoreIndex.from_documents(
    rag_docs, storage_context=storage_context, service_context=service_context
)

# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=2,
)

# configure response synthesizer
response_synthesizer = get_response_synthesizer(
    service_context=service_context,
    text_qa_template=qa_prompt,
    response_mode="compact",
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

このシステムは、Jina Embeddings を使用してテキストソースの段落をインデックス化し、すべてのプロンプトのコンテキストに上位 2 件のマッチを組み込みます。たとえば上位 3 件に変更するには、上記の VectorIndexRetriever の初期化で similarity_top_k パラメータを変更するだけです：

# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=3,
)

これでシステムの準備が整いました。

tagドキュメントに対するクエリ

簡単なクエリを試してみましょう：

response = query_engine.query("""
What is the Biden Administration's policy with regard to AI? 
""")
print(response.response)

結果（あなたの結果と完全に一致しない場合があります）：

The Biden Administration prioritizes responsible AI systems that 
are ethical, trustworthy, and safe, and serve the public good. 
The Administration has committed to substantial funding for AI 
R&D, and has established the National AI Initiative Office to 
coordinate AI activities across the federal government. 

The Administration's AI policy focuses on developing evaluative
techniques for AI, workforce development, public-private
partnerships, and international collaboration in AI research.

もっと具体的な質問も可能です：

response = query_engine.query("""
What protections does the AI Bill of Rights propose to offer?
""")

print(response.response)

The AI Bill of Rights proposes five core protections: Safe and 
Effective Systems; Algorithmic Discrimination Protections; Data 
Privacy; Notice and Explanation; and Human Alternatives, 
Consideration, and Fallback.

さらに具体的な質問も：

response = query_engine.query("Who is Kei Koizumi?")
print(response.response)

Kei Koizumi is the Principal Deputy Director for Policy at the Office 
of Science and Technology Policy (OSTP).

より想像力に富んだ質問もできます：

response = query_engine.query("""
What rights will AI's receive under President Biden's proposed 
AI Bill of Rights?
""")
print(response.response)

The proposed AI Bill of Rights does not grant rights to AI systems, but 
rather outlines principles to ensure that AI technologies are 
developed and used in a manner that respects human rights and values. 
The principles include safety, transparency, fairness, disclosure, 
data privacy, non-discrimination, and the ability for individuals to 
opt out of certain AI systems.

response = query_engine.query("""
Why is President Biden's proposing an AI Bill of Rights?
Does AI really need rights?
""")
print(response.response)

President Biden's proposed AI Bill of Rights aims to ensure that AI 
technologies are developed and used ethically, legally, and with 
societal considerations in mind. The Bill of Rights is not granting 
"rights" to AI, but rather setting guidelines for the responsible 
development and deployment of AI systems to protect individuals and 
communities from potential negative impacts.

response = query_engine.query("""
Has Donald Trump weighed in on AI?
Will he Make Humans Great Again?
""")
print(response.response)

No information. The context information does not mention Donald 
Trump's views on AI.

tagJina AI と Qdrant

Jina Embeddings の 8k トークンのコンテキストウィンドウは、より大きなテキストブロックを扱えるため、検索拡張生成に特に適しています。他の RAG アプリケーションでは個々の文や文のペアを保存する必要がある場合が多いのに対し、このチュートリアルではサイズを気にせず段落全体を使用することができました。

これにより、同じ情報を伝えるために保存する必要があるテキスト項目の数が減少し、計算コストを抑え、メモリを節約し、検索を高速化することができます。

Qdrant のベクターデータベースは設定が簡単で、高速かつコスト効率が良く、ご覧の通り、RAG システムに統合するのに数行の Python コードだけで済みます。

Qdrant の AI 対応データベースとクラウドサービスの詳細については、ウェブサイトをご覧ください。

tagお問い合わせ

Jina AI は、あらゆる規模と種類の企業に信頼性の高い、手頃な価格の AI テクノロジーを提供することに尽力しています。私たちはお客様のユースケースについてお話を伺い、ビジネスプロセスに AI を適合させるお手伝いをさせていただきたいと考えています。Jina AI の提供サービスの詳細とお問い合わせについては、Jina AI のウェブサイトをご覧いただくか、Discord コミュニティにご参加ください。