使用 Jina Embeddings 和 Qdrant 構建 RAG 系統

檢索增強生成是一組快速發展的技術，它最初被設計作為大型語言模型（LLM）的擴展，但現在越來越被視為一種完全重新構想 AI 模型角色的替代範式。

我們將向您展示如何使用 LlamaIndex 框架、開源 Mistral LLM、Jina Embeddings v2 以及 Qdrant 的 AI 就緒向量資料庫來構建一個基本的 RAG 引擎。

tagRAG 的背景

在 RAG 系統中，AI 模型的角色仍然很重要但更加有限。它不是從其訓練數據中獲得的內部知識來回答提示，而是在其提示中呈現相關信息。然後模型的角色是理解並將信息合成為輸出。

這發揮了 AI 模型的優勢並最小化了其缺點。

例如，我們可能給 Mistral-Instruct 模型以下提示：

Answer the query. Please be brief, concise, and complete.
Query: Why is Taylor Swift planning to attend the Superbowl?
Answer:

當我這樣做時，模型回應：

I don't have real-time information. However, as of my last update, 
Taylor Swift is not officially confirmed to attend or perform at 
the Superbowl. If there are any updates, I recommend checking the 
latest news sources.

大型語言模型的更新和再訓練成本很高，因此它們的內部信息永遠無法完全保持最新。

此外，眾所周知大型語言模型會「產生幻覺」，即編造出看似合理但可能部分或完全錯誤且需要驗證的答案。

Answer the query. Please be brief, concise, and complete.
Query: What was Weird Al Yankovic's first Taylor Swift parody?
Answer:

Weird Al Yankovic's first Taylor Swift parody was "TMZ," a spoof 
of Swift's song "Blank Space." The parody was released in 2014 
as part of Yankovic's album "Mandatory Fun." The song humorously 
addresses the obsession with celebrity gossip and scandals, as 
reported by the tabloid news website TMZ.

怪人艾爾·揚科維奇的 TMZ 於 2011 年在專輯 Alpocalypse 中發行，是對 Swift 的 You Belong with Me 的模仿。

神經網絡的本質使幻覺難以完全避免，這嚴重削弱了將 AI 模型作為信息來源的任何計劃。

用 OpenAI 的 CEO Sam Altman 的話來說：

「我認為當人們把這些模型（LLM）當作數據庫時是一個很大的錯誤。[…] 它會運作，但效果有限。但它是世界上最貴、最慢、最不精確的數據庫。」

RAG 最大限度地減少了對 AI 模型內部世界知識的依賴，並利用了它的最大優勢：自然語言理解和流暢的語言生成。

讓我們再看一個 Mistral-Instruct 無法回答的問題：

Answer the query. Please be brief, concise, and complete.

Query: What is the title of Taylor Swift's next album?
Answer:

As of the time I am providing this answer, Taylor Swift has 
not yet announced the title of her next album. She has released 
several re-recorded albums, but the title of her next original 
album remains unknown. I recommend checking her official social 
media channels or website for the latest updates.

現在，RAG 系統可能會有一個最近新聞文章的數據庫，或者也許會在網上搜索與 Taylor Swift 的下一張專輯標題相關的文本。然後，它會將找到的文本插入提示中。

假設它識別出這篇美聯社文章與查詢相關，並構建以下提示：

Context information is below.
---------------------
February 6, 2024
LOS ANGELES (AP) — Forget her "Reputation." Taylor Swift has a 
whole new album coming out.

Accepting the Grammy for best pop vocal album on Sunday night, 
Taylor said she'd been keeping a secret for two years.

"My brand-new album comes out April 19. It's called 'The Tortured 
Poets Department.' I'm going to go and post the cover right now 
backstage," she announced.

And so she did.

---------------------
Given the context information and not prior knowledge, answer the query. 
Please be brief, concise, and complete.
If the context information does not contain an answer to the query, 
respond with "No information".
Query: What is the title of Taylor Swift's next album?
Answer:

Mistral-Instruct 回應道：

The title of Taylor Swift's next album is 'The Tortured Poets Department'.

這種提示方式也大大降低了產生幻覺的可能性。例如：

Context information is below.
---------------------
Alpocalypse is the thirteenth studio album by American parody 
musician "Weird Al" Yankovic, released on June 21, 2011.

Track List

1. Perform This Way (Parody of "Born This Way" by Lady Gaga)
2. CNR
3. TMZ (Parody of "You Belong with Me" by Taylor Swift)
4. Skipper Dan
5. Polka Face (Polka medley including songs by artists such as 
   Lady Gaga, Justin Bieber, Britney Spears, Ke$ha, and others)
6. Craigslist
7. Party in the CIA (Parody of "Party in the U.S.A." by Miley 
   Cyrus)
8. Ringtone
9. Another Tattoo (Parody of "Nothin' on You" by B.o.B featuring 
   Bruno Mars)
10. If That Isn't Love
11. Whatever You Like (Parody of "Whatever You Like" by T.I.)
12. Stop Forwarding That Crap to Me

---------------------
Given the context information and not prior knowledge, answer the 
query. Please be brief, concise, and complete.
If the context information does not contain an answer to the query, 
respond with "No information".
Query: What was Weird Al Yankovic's first Taylor Swift parody, what 
year was it released, and what song was he parodying?
Answer:

Weird Al Yankovic's first Taylor Swift parody was "TMZ," and it was 
released in 2011. He was parodying "You Belong with Me" by Taylor 
Swift.

tagRAG 系統的結構

RAG 系統至少有兩個組件：

生成式 AI 模型
信息檢索系統

在本文中，我們將使用 Mistral-Instruct 作為生成式 AI 模型，並通過 Hugging Face Inference API 訪問它。我們將使用 Qdrant 向量資料庫和 Jina Embeddings 作為信息檢索系統。

作為數據源，我們將使用 2023 年國家人工智能研究和開發戰略計劃更新，這是由向美國總統提供科技事務建議的國家科學技術委員會於 2023 年 5 月發布的文件。這份約 25,000 字的文件是美國 AI 政策和發展的政策與戰略規劃文件。

這也是一份典型的「太長不讀」政府政策文件。其中大部分內容都很乏味，除非您在這個領域有利害關係或密切興趣，否則您可能不會閱讀它。但您可能仍然想知道它說了什麼。

本文將引導您完成創建一個能夠回答有關文本及其內容問題的 RAG 系統的步驟。

tag安裝先決條件

首先，確保您已安裝所有相關的 Python 庫。在您的 Python 環境中運行以下命令：

pip install llama-index qdrant-client 
pip install pdfminer.six llama-index-vector-stores-qdrant
pip install llama-index-llms-openai llama-index-embeddings-jinaai 
pip install llama-index-llms-huggingface "huggingface_hub[inference]"

我們將使用 LlamaIndex 來構建一個以 Mistral-Instruct 大型語言模型為基礎的 RAG 系統，並使用 pdfminer.six 來將 PDF 檔案處理成 Qdrant 向量存儲。

tag通過 Hugging Face Inference API 連接 Mistral-Instruct

首先，設置 LLM。你需要有 Hugging Face Inference API 的存取權限，包括存取令牌。如果你有 Hugging Face 帳號，可以從你的帳號設定頁面獲取。

如果你還沒有帳號，請先註冊一個，然後創建一個存取令牌。

將你的令牌放入一個變數以供後續使用：

hf_inference_api_key = ''

接下來，我們構建一個提示模板：

from llama_index.core import PromptTemplate

qa_prompt_tmpl = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query. Please be brief, concise, and complete.\n"
    "If the context information does not contain an answer to the query, "
    "respond with \"No information\".\n"
    "Query: {query_str}\n"
    "Answer: "
)
qa_prompt = PromptTemplate(qa_prompt_tmpl)

這個模板指示 LLM 僅使用提示中提供的上下文信息來回答問題。這個提示在我們使用 Mistral-Instruct 時效果很好，但你可以嘗試其他表述方式來看看什麼最適合你的使用案例。

from llama_index.llms.huggingface import HuggingFaceInferenceAPI

mixtral_llm = HuggingFaceInferenceAPI(
    model_name="mistralai/Mixtral-8x7B-Instruct-v0.1", 
    token=hf_inference_api_key
)

最後，創建並初始化一個 LlamaIndex 框架的物件，用於保存與 Mistral-Instruct 的連接。以下代碼來自之前關於使用 LlamaIndex 創建 RAG 系統的文章。

這將把與 Mistral LLM 的連接放入變數 mistral_llm。

tag使用 Qdrant 和 Jina Embeddings 進行信息檢索

要設置檢索系統，你需要一個 Jina Embeddings API 金鑰。你可以在 Jina Embeddings 網站免費獲得一個包含 100 萬個代幣預付額度的金鑰。

將你的 Jina Embeddings 金鑰放入一個變數以供後續使用：

jina_emb_api_key = ""

然後，使用 LlamaIndex 創建一個連接到 Jina Embeddings 伺服器的連接器物件，特別選擇英語單語模型：

from llama_index.embeddings.jinaai import JinaEmbedding

jina_embedding_model = JinaEmbedding(
    api_key=jina_emb_api_key,
    model="jina-embeddings-v2-base-en",
)

tag載入文本數據

接下來，我們將載入文件並將其分割成段落。首先，從白宮網站下載 PDF 到變數 pdf_data：

import urllib.request

uri = "https://www.whitehouse.gov/wp-content/uploads/2023/05/National-Artificial-Intelligence-Research-and-Development-Strategic-Plan-2023-Update.pdf"
pdf_data = urllib.request.urlopen(uri).read()

接下來，我們將逐頁處理 PDF，提取文本然後通過按雙行分割將其分成段落。這些內容存儲在列表 text_paras 中：

import regex as re
from io import BytesIO, StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

text_paras = []
parser = PDFParser(BytesIO(pdf_data))
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
for page in PDFPage.create_pages(doc):
    output_string = StringIO()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    interpreter.process_page(page)
    page_text = output_string.getvalue()
    text_paras.extend(re.split(r'\n\s*\n', page_text))

檢查是否全部載入：

assert len(text_paras) == 615

接下來，我們將這個短文本列表轉換為 LlamaIndex Document 物件：

from llama_index.core.readers import StringIterableReader

rag_docs = StringIterableReader().load_data(text_paras)

你可以檢查文本：

print(rag_docs[0].text)

結果：

NATIONAL ARTIFICIAL INTELLIGENCE 
RESEARCH AND DEVELOPMENT 
STRATEGIC PLAN 
2023 UPDATE

tag設置 Qdrant 向量數據庫

在本文中，我們將使用 Qdrant Vector Search Cloud 來實現 RAG 系統的數據庫。Qdrant 的免費層包括 1GB 的存儲空間，對於本教程來說已經足夠了。

在繼續之前，你需要在 Qdrant Cloud 網站上創建一個帳號。

一旦你有了帳號並登入，你需要創建一個叢集。按照 Qdrant 網站上的「快速入門」說明來設置一個免費叢集並獲取 API 和 Qdrant 主機伺服器名稱。

將金鑰和主機名存儲在變數中：

qdrant_api_key = ""
qdrant_server = "https://"

接下來，我們需要從 qdrant_client 和 llama_index 套件中導入相關組件：

import qdrant_client
from llama_index.vector_stores.qdrant import QdrantVectorStore

client = qdrant_client.QdrantClient(qdrant_server, api_key=qdrant_api_key)
vector_store = QdrantVectorStore(client=client, collection_name="NTSC")

這將在你的免費叢集中建立一個名為 NTSC 的集合。

你可以閱讀 Qdrant 文件以了解如何在記憶體或硬碟中建構本地數據存儲，以及如何通過 web API 在 docker 容器中託管你自己的 Qdrant 服務器實例。

tag完成 RAG 系統

現在我們將使用 LlamaIndex 的樣板代碼將這些組件組裝成一個完整的 RAG 系統：

from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.storage.storage_context import StorageContext
from llama_index.core import (
		VectorStoreIndex,
		ServiceContext,
		get_response_synthesizer,
)

# set up the service and storage contexts
service_context = ServiceContext.from_defaults(
    llm=mixtral_llm, embed_model=jina_embedding_model
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# create an index
index = VectorStoreIndex.from_documents(
    rag_docs, storage_context=storage_context, service_context=service_context
)

# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=2,
)

# configure response synthesizer
response_synthesizer = get_response_synthesizer(
    service_context=service_context,
    text_qa_template=qa_prompt,
    response_mode="compact",
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

這個系統將使用 Jina Embeddings 為我們文本來源的段落建立索引，並將兩個最佳匹配放入每個提示的上下文中。例如，要改為三個最佳匹配，只需修改上面 VectorIndexRetriever 初始化中的 similarity_top_k 參數：

# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=3,
)

現在系統已經可以使用了。

tag查詢文檔

讓我們試一個直接的查詢：

response = query_engine.query("""
What is the Biden Administration's policy with regard to AI? 
""")
print(response.response)

結果（你的可能不完全相同）：

The Biden Administration prioritizes responsible AI systems that 
are ethical, trustworthy, and safe, and serve the public good. 
The Administration has committed to substantial funding for AI 
R&D, and has established the National AI Initiative Office to 
coordinate AI activities across the federal government. 

The Administration's AI policy focuses on developing evaluative
techniques for AI, workforce development, public-private
partnerships, and international collaboration in AI research.

或者更具體一點：

response = query_engine.query("""
What protections does the AI Bill of Rights propose to offer?
""")

print(response.response)

The AI Bill of Rights proposes five core protections: Safe and 
Effective Systems; Algorithmic Discrimination Protections; Data 
Privacy; Notice and Explanation; and Human Alternatives, 
Consideration, and Fallback.

甚至非常具體：

response = query_engine.query("Who is Kei Koizumi?")
print(response.response)

Kei Koizumi is the Principal Deputy Director for Policy at the Office 
of Science and Technology Policy (OSTP).

你也可以問一些富有想象力的問題：

response = query_engine.query("""
What rights will AI's receive under President Biden's proposed 
AI Bill of Rights?
""")
print(response.response)

The proposed AI Bill of Rights does not grant rights to AI systems, but 
rather outlines principles to ensure that AI technologies are 
developed and used in a manner that respects human rights and values. 
The principles include safety, transparency, fairness, disclosure, 
data privacy, non-discrimination, and the ability for individuals to 
opt out of certain AI systems.

response = query_engine.query("""
Why is President Biden's proposing an AI Bill of Rights?
Does AI really need rights?
""")
print(response.response)

President Biden's proposed AI Bill of Rights aims to ensure that AI 
technologies are developed and used ethically, legally, and with 
societal considerations in mind. The Bill of Rights is not granting 
"rights" to AI, but rather setting guidelines for the responsible 
development and deployment of AI systems to protect individuals and 
communities from potential negative impacts.

response = query_engine.query("""
Has Donald Trump weighed in on AI?
Will he Make Humans Great Again?
""")
print(response.response)

No information. The context information does not mention Donald 
Trump's views on AI.

tagJina AI 和 Qdrant

Jina Embeddings 的 8k token 上下文窗口使其特別適合於檢索增強生成，因為它可以處理更大的文本塊。在其他 RAG 應用程式通常必須存儲單獨的句子或句子對的情況下，在本教程中，我們可以使用整個段落而不用擔心它們的大小。

這減少了需要存儲的文本項目數量以承載相同的信息，降低了計算成本，節省了記憶體，並使檢索更快。

Qdrant 的向量資料庫配置簡單、速度快且具有成本效益，如你所見，只需幾行 Python 代碼就可以將其整合到 RAG 系統中。

要了解更多關於 Qdrant 的 AI 就緒資料庫和雲端服務，你可以訪問他們的網站。

tag聯繫我們

Jina AI 致力於為各種規模和類型的企業提供可靠、實惠的 AI 技術。我們很樂意瞭解您的使用案例，並幫助將 AI 融入您的業務流程中。欲了解更多關於 Jina AI 的產品並與我們聯繫，請查看 Jina AI 網站或加入我們的 Discord 社群。