基于 Jina Embeddings 和 Qdrant 构建 RAG 系统

检索增强生成是一组快速发展的技术，最初是作为大语言模型（LLMs）的扩展被提出的，但现在越来越被视为一种重新构想 AI 模型角色的替代范式。

我们将向您展示如何使用 LlamaIndex 框架、开源的 Mistral LLM、Jina Embeddings v2 和 Qdrant 的 AI 就绪型向量数据库构建一个基础的 RAG 引擎。

tagRAG 的背景

在 RAG 系统中，AI 模型的角色仍然是核心的，但更加有限。它不是从训练数据中获得的内部知识来回答提示，而是在提示中呈现相关信息。模型的作用则是理解并将信息综合成输出。

这充分发挥了 AI 模型的优势，同时最小化了其缺点。

例如，我们可能给 Mistral-Instruct 模型以下提示：

Answer the query. Please be brief, concise, and complete.
Query: Why is Taylor Swift planning to attend the Superbowl?
Answer:

当我这样做时，模型回应道：

I don't have real-time information. However, as of my last update, 
Taylor Swift is not officially confirmed to attend or perform at 
the Superbowl. If there are any updates, I recommend checking the 
latest news sources.

大语言模型更新和重新训练的成本很高，所以它们的内部信息永远不可能完全保持最新。

此外，大语言模型众所周知会"产生幻觉"，即编造看似合理但可能部分或完全不正确的答案，需要验证。

Answer the query. Please be brief, concise, and complete.
Query: What was Weird Al Yankovic's first Taylor Swift parody?
Answer:

Weird Al Yankovic's first Taylor Swift parody was "TMZ," a spoof 
of Swift's song "Blank Space." The parody was released in 2014 
as part of Yankovic's album "Mandatory Fun." The song humorously 
addresses the obsession with celebrity gossip and scandals, as 
reported by the tabloid news website TMZ.

Weird Al Yankovic 的 TMZ 在 2011 年的专辑 Alpocalypse 中发布，是对 Swift 的 You Belong with Me 的模仿。

神经网络的本质使得幻觉难以（如果不是不可能）完全预防，这严重削弱了将 AI 模型用作信息源的任何计划。

用 OpenAI 的 CEO Sam Altman 的话说：

"我认为人们把这些模型（LLMs）当作数据库是一个大错误。[...] 它会起作用，某种程度上。但它是世界上最贵、最慢、最不准确的数据库。"

RAG 最大限度地减少了对 AI 模型内部世界知识的依赖，并利用了它最大的优势：自然语言理解和流畅的语言生成。

让我们再看一个 Mistral-Instruct 无法回答的问题：

Answer the query. Please be brief, concise, and complete.

Query: What is the title of Taylor Swift's next album?
Answer:

As of the time I am providing this answer, Taylor Swift has 
not yet announced the title of her next album. She has released 
several re-recorded albums, but the title of her next original 
album remains unknown. I recommend checking her official social 
media channels or website for the latest updates.

现在，RAG 系统可能有一个最新新闻文章的数据库，或者也许会在互联网上搜索与 Taylor Swift 下一张专辑标题相关的文本。然后，它会将找到的文本插入提示中。

假设它识别出这篇美联社文章与查询相关，并构建以下提示：

Context information is below.
---------------------
February 6, 2024
LOS ANGELES (AP) — Forget her "Reputation." Taylor Swift has 
a whole new album coming out.

Accepting the Grammy for best pop vocal album on Sunday night, 
Taylor said she'd been keeping a secret for two years.

"My brand-new album comes out April 19. It's called 'The Tortured 
Poets Department.' I'm going to go and post the cover right now 
backstage," she announced.

And so she did.

---------------------
Given the context information and not prior knowledge, answer the query. 
Please be brief, concise, and complete.
If the context information does not contain an answer to the query, 
respond with "No information".
Query: What is the title of Taylor Swift's next album?
Answer:

Mistral-Instruct 回答道：

The title of Taylor Swift's next album is 'The Tortured Poets Department'.

这种提示方式也显著降低了产生幻觉的可能性。例如：

Context information is below.
---------------------
Alpocalypse is the thirteenth studio album by American parody 
musician "Weird Al" Yankovic, released on June 21, 2011.

Track List

1. Perform This Way (Parody of "Born This Way" by Lady Gaga)
2. CNR
3. TMZ (Parody of "You Belong with Me" by Taylor Swift)
4. Skipper Dan
5. Polka Face (Polka medley including songs by artists such as 
   Lady Gaga, Justin Bieber, Britney Spears, Ke$ha, and others)
6. Craigslist
7. Party in the CIA (Parody of "Party in the U.S.A." by Miley 
   Cyrus)
8. Ringtone
9. Another Tattoo (Parody of "Nothin' on You" by B.o.B featuring 
   Bruno Mars)
10. If That Isn't Love
11. Whatever You Like (Parody of "Whatever You Like" by T.I.)
12. Stop Forwarding That Crap to Me

---------------------
Given the context information and not prior knowledge, answer the 
query. Please be brief, concise, and complete.
If the context information does not contain an answer to the query, 
respond with "No information".
Query: What was Weird Al Yankovic's first Taylor Swift parody, what 
year was it released, and what song was he parodying?
Answer:

Weird Al Yankovic's first Taylor Swift parody was "TMZ," and it was 
released in 2011. He was parodying "You Belong with Me" by Taylor 
Swift.

tagRAG 系统的构成

RAG 系统至少包含两个组件：

生成式 AI 模型。
信息检索系统。

在本文中，我们将使用 Mistral-Instruct 作为生成式 AI 模型，并通过 Hugging Face Inference API 访问它。我们将使用 Qdrant 向量数据库和 Jina Embeddings 作为信息检索系统。

作为数据源，我们将使用国家科技委员会（该机构就科学和技术事务向美国总统提供建议）于 2023 年 5 月发布的《国家人工智能研发战略规划 2023 年更新》。这份约 25,000 字的文件是美国 AI 政策和发展的政策和战略规划文件。

这也是一份典型的"太长不读"政府政策文件。其中大部分内容都很枯燥，除非你在这个领域有利益相关或密切兴趣，否则你可能不会去读它。但你可能仍然想知道它说了什么。

本文将指导您创建一个能够回答有关该文本及其内容问题的 RAG 系统的步骤。

tag安装先决条件

首先，确保您已安装所有相关的 Python 库。在您的 Python 环境中运行以下命令：

pip install llama-index qdrant-client 
pip install pdfminer.six llama-index-vector-stores-qdrant
pip install llama-index-llms-openai llama-index-embeddings-jinaai 
pip install llama-index-llms-huggingface "huggingface_hub[inference]"

我们将使用 LlamaIndex 来围绕 Mistral-Instruct 大语言模型构建一个 RAG 系统，并使用 pdfminer.six 来将 PDF 文件处理到 Qdrant 向量存储中。

tag通过 Hugging Face Inference API 连接 Mistral-Instruct

首先，设置 LLM。你需要访问 Hugging Face Inference API，包括访问令牌。如果你有 Hugging Face 账户，可以从你的账户设置页面获取令牌。

如果你没有账户，首先需要注册一个，然后创建一个访问令牌。

将你的令牌放在一个变量中以供后续使用：

hf_inference_api_key = '<your HuggingFace Inference API token>'

接下来，我们构建一个提示模板：

from llama_index.core import PromptTemplate

qa_prompt_tmpl = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query. Please be brief, concise, and complete.\n"
    "If the context information does not contain an answer to the query, "
    "respond with \"No information\".\n"
    "Query: {query_str}\n"
    "Answer: "
)
qa_prompt = PromptTemplate(qa_prompt_tmpl)

这个模板指示 LLM 仅使用提示中提供的上下文信息来回答问题。这个提示在我们使用 Mistral-Instruct 时效果很好，但你可以尝试其他措辞来看看什么最适合你的用例。

from llama_index.llms.huggingface import HuggingFaceInferenceAPI

mixtral_llm = HuggingFaceInferenceAPI(
    model_name="mistralai/Mixtral-8x7B-Instruct-v0.1", 
    token=hf_inference_api_key
)

最后，为 LlamaIndex 框架创建并初始化一个对象，该对象持有与 Mistral-Instruct 的连接。下面的代码来自之前关于使用 LlamaIndex 创建 RAG 系统的文章。

这将把与 Mistral LLM 的连接放入变量 mistral_llm 中。

tag使用 Qdrant 和 Jina Embeddings 进行信息检索

要设置检索系统，你需要一个 Jina Embeddings API 密钥。你可以在 Jina Embeddings 网站获得一个免费的 100 万 token 预付额度的密钥。

将你的 Jina Embeddings 密钥放在一个变量中以供后续使用：

jina_emb_api_key = "<your Jina Embeddings API key>"

然后，使用 LlamaIndex 创建一个 Jina Embeddings 服务器的连接器对象，特别选择英语单语模型：

from llama_index.embeddings.jinaai import JinaEmbedding

jina_embedding_model = JinaEmbedding(
    api_key=jina_emb_api_key,
    model="jina-embeddings-v2-base-en",
)

tag加载文本数据

接下来，我们将加载文档并将其分割成段落。首先，从白宫网站下载 PDF 到变量 pdf_data：

import urllib.request

uri = "https://www.whitehouse.gov/wp-content/uploads/2023/05/National-Artificial-Intelligence-Research-and-Development-Strategic-Plan-2023-Update.pdf"
pdf_data = urllib.request.urlopen(uri).read()

接下来，我们将逐页处理 PDF，提取文本并通过在双换行符处分割将其分块成段落。这些都存储在列表 text_paras 中：

import regex as re
from io import BytesIO, StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

text_paras = []
parser = PDFParser(BytesIO(pdf_data))
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
for page in PDFPage.create_pages(doc):
    output_string = StringIO()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    interpreter.process_page(page)
    page_text = output_string.getvalue()
    text_paras.extend(re.split(r'\n\s*\n', page_text))

检查是否所有内容都已加载：

assert len(text_paras) == 615

接下来，我们将这个短文本列表转换为 LlamaIndex Document 对象：

from llama_index.core.readers import StringIterableReader

rag_docs = StringIterableReader().load_data(text_paras)

你可以检查文本：

print(rag_docs[0].text)

结果：

NATIONAL ARTIFICIAL INTELLIGENCE 
RESEARCH AND DEVELOPMENT 
STRATEGIC PLAN 
2023 UPDATE

tag设置 Qdrant 向量数据库

在本文中，我们将使用 Qdrant Vector Search Cloud 来实现 RAG 系统的数据库。Qdrant 的免费层包括 1GB 的存储空间，这对于本教程来说绰绰有余。

在继续之前，你需要在 Qdrant Cloud 网站上创建一个账户。

一旦你有了账户并登录后，你需要创建一个集群。按照 Qdrant 网站上的"快速入门"说明来设置一个免费集群并获取 API 和 Qdrant 主机服务器的名称。

将密钥和主机名存储在变量中：

qdrant_api_key = "<your API key>"
qdrant_server = "https://<your server>"

接下来，我们需要从 qdrant_client 和 llama_index 包中导入相关组件：

import qdrant_client
from llama_index.vector_stores.qdrant import QdrantVectorStore

client = qdrant_client.QdrantClient(qdrant_server, api_key=qdrant_api_key)
vector_store = QdrantVectorStore(client=client, collection_name="NTSC")

这将在你的免费集群中创建一个名为 NTSC 的集合。

你可以阅读 Qdrant 文档，了解如何在内存中或磁盘上构建本地数据存储，以及如何通过 web API 在 docker 容器中托管自己的 Qdrant 服务器实例。

tag完成 RAG 系统

现在我们将使用 LlamaIndex 的样板代码将这些组件组装成一个完整的 RAG 系统：

from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.storage.storage_context import StorageContext
from llama_index.core import (
		VectorStoreIndex,
		ServiceContext,
		get_response_synthesizer,
)

# set up the service and storage contexts
service_context = ServiceContext.from_defaults(
    llm=mixtral_llm, embed_model=jina_embedding_model
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# create an index
index = VectorStoreIndex.from_documents(
    rag_docs, storage_context=storage_context, service_context=service_context
)

# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=2,
)

# configure response synthesizer
response_synthesizer = get_response_synthesizer(
    service_context=service_context,
    text_qa_template=qa_prompt,
    response_mode="compact",
)

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

该系统将使用 Jina Embeddings 对文本源中的段落进行索引，并将两个最佳匹配放入每个提示的上下文中。例如，要更改为三个最佳匹配，只需修改上面 VectorIndexRetriever 初始化中的 similarity_top_k 参数：

# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=3,
)

现在系统已经准备就绪。

tag查询文档

让我们尝试一个直接的查询：

response = query_engine.query("""
What is the Biden Administration's policy with regard to AI? 
""")
print(response.response)

结果（你的可能不完全相同）：

The Biden Administration prioritizes responsible AI systems that 
are ethical, trustworthy, and safe, and serve the public good. 
The Administration has committed to substantial funding for AI 
R&D, and has established the National AI Initiative Office to 
coordinate AI activities across the federal government. 

The Administration's AI policy focuses on developing evaluative
techniques for AI, workforce development, public-private
partnerships, and international collaboration in AI research.

或者更具体一点：

response = query_engine.query("""
What protections does the AI Bill of Rights propose to offer?
""")

print(response.response)

The AI Bill of Rights proposes five core protections: Safe and 
Effective Systems; Algorithmic Discrimination Protections; Data 
Privacy; Notice and Explanation; and Human Alternatives, 
Consideration, and Fallback.

甚至可以非常具体：

response = query_engine.query("Who is Kei Koizumi?")
print(response.response)

Kei Koizumi is the Principal Deputy Director for Policy at the Office 
of Science and Technology Policy (OSTP).

你也可以问一些富有想象力的问题：

response = query_engine.query("""
What rights will AI's receive under President Biden's proposed 
AI Bill of Rights?
""")
print(response.response)

The proposed AI Bill of Rights does not grant rights to AI systems, but 
rather outlines principles to ensure that AI technologies are 
developed and used in a manner that respects human rights and values. 
The principles include safety, transparency, fairness, disclosure, 
data privacy, non-discrimination, and the ability for individuals to 
opt out of certain AI systems.

response = query_engine.query("""
Why is President Biden's proposing an AI Bill of Rights?
Does AI really need rights?
""")
print(response.response)

President Biden's proposed AI Bill of Rights aims to ensure that AI 
technologies are developed and used ethically, legally, and with 
societal considerations in mind. The Bill of Rights is not granting 
"rights" to AI, but rather setting guidelines for the responsible 
development and deployment of AI systems to protect individuals and 
communities from potential negative impacts.

response = query_engine.query("""
Has Donald Trump weighed in on AI?
Will he Make Humans Great Again?
""")
print(response.response)

No information. The context information does not mention Donald 
Trump's views on AI.

tagJina AI 和 Qdrant

Jina Embeddings 的 8k token 上下文窗口使其特别适合检索增强生成，因为它可以处理更大的文本块。在其他 RAG 应用程序通常必须存储单个句子或句子对的情况下，在本教程中，我们可以使用整个段落而不用担心其大小。

这减少了需要存储的文本项目数量以携带相同的信息，降低了计算成本，节省了内存，并使检索更快。

Qdrant 的向量数据库配置简单、快速且经济实惠，正如你所看到的，只需几行 Python 代码就可以将其集成到 RAG 系统中。

要了解更多关于 Qdrant 的 AI 就绪数据库和云服务的信息，你可以访问他们的网站。

tag联系我们

Jina AI 致力于为各种规模和类型的企业提供可靠、经济的 AI 技术。我们很乐意了解你的使用案例并帮助将 AI 融入你的业务流程。要了解更多关于 Jina AI 产品的信息并与我们联系，请访问 Jina AI 网站或加入我们的 Discord 社区。