Jina Embeddings v2 雙語模型現已在 Hugging Face 開源

Jina AI 已經在 Hugging Face 上發布了其最先進的開源雙語嵌入模型，包括德語-英語和中文-英語語言對。

在本教程中，我們將介紹一個非常簡單的安裝和使用案例，內容包括：

從 Hugging Face 下載 Jina Embedding 模型。
使用模型獲取德語和英語文本的編碼。
建立一個基本的基於嵌入的跨語言查詢神經搜索引擎。

我們將向您展示如何使用 Jina Embeddings 編寫英語查詢來檢索匹配的德語文本，反之亦然。

本教程同樣適用於中文模型。只需按照標題為Querying in Chinese的章節（位於末尾）中的說明獲取中英雙語模型和中文示例文檔。

tag雙語嵌入模型

雙語嵌入模型是一種將兩種語言的文本（在本教程中是德語和英語，對於中文模型則是中文和英語）映射到相同嵌入空間的模型。而且，它的實現方式是，如果一個德語文本和一個英語文本表達相同的含義，它們對應的嵌入向量將會非常接近。

這種模型非常適合跨語言信息檢索應用，我們將在本教程中展示，但也可以作為基於 RAG 的聊天機器人、多語言文本分類、摘要、情感分析以及任何使用嵌入的其他應用的基礎。通過使用這些模型，您可以將兩種語言的文本視為用同一種語言編寫的。

雖然許多巨型語言模型聲稱支持多種不同語言，但它們對所有語言的支持並不平等。越來越多的人質疑互聯網上英語主導造成的偏見以及機器翻譯文本的廣泛在線發布所導致的輸入源失真。通過專注於兩種語言，我們可以更好地控制兩種語言的嵌入質量，最大限度地減少偏見，同時產生更小的模型，其性能與聲稱可處理數十種語言的巨型模型相似或更高。

Jina Embeddings v2 雙語模型支持 8,192 個輸入上下文 token，使其不僅可以支持兩種語言，還可以與同類模型相比支持相對較大的文本段落。這使得它們非常適合需要處理更多文本信息進行嵌入的複雜用例。

tag在 Google Colab 上跟進學習

本教程有一個配套筆記本，您可以在 Google Colab 上運行，或在您自己的系統上本地運行。

tag安裝prerequisites

確保當前環境已安裝相關庫。您需要最新版本的 transformers，即使已經安裝，也請運行：

pip install -U transformers

本教程將使用 Meta 的 FAISS 庫進行向量搜索和比較。要安裝它，請運行：

pip install faiss-cpu

我們還將使用 Beautiful Soup 處理本教程中的輸入數據，請確保已安裝：

pip install bs4

tag訪問 Hugging Face

您需要訪問 Hugging Face，特別是一個帳戶和訪問令牌來下載模型。

如果您還沒有 Hugging Face 帳戶：

前往 https://huggingface.co/，您應該會在頁面右上角看到"Sign Up"按鈕。點擊並按照指示創建新帳戶。

登入帳戶後：

按照 Hugging Face 網站上的說明獲取訪問令牌。

您需要將此 token 複製到名為 HF_TOKEN 的環境變數中。如果您在筆記本（例如在 Google Colab）中工作或在 Python 程式中內部設定，請使用以下 Python 程式碼：

import os

os.environ['HF_TOKEN'] = "<your token here>"

在您的 shell 中，使用提供的語法來設定環境變數。在 bash 中：

export HF_TOKEN="<your token here>"

tag下載德語和英語的 Jina Embeddings v2

一旦設定好您的 token，您就可以使用 transformers 函式庫下載 Jina Embeddings 德語-英語雙語模型：

from transformers import AutoModel

model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-de', trust_remote_code=True)

第一次執行時可能需要幾分鐘，但模型會在本地端快取，所以之後重新啟動教程時不用擔心。

tag下載英語數據

在本教程中，我們將獲取 Pro Git: Everything You Need to Know About Git 這本書的英語版本。這本書也有中文和德語版本，我們稍後會在本教程中使用。

要下載 EPUB 版本，請執行以下命令：

wget -O progit-en.epub https://open.umn.edu/opentextbooks/formats/3437

這會將書籍複製到本地目錄中名為 progit-en.epub 的檔案。

作者 Scott Chacon 和 Ben Straub 的「Pro Git」紙本版封面。 — 紙本版的封面。

或者，您也可以直接訪問 https://open.umn.edu/opentextbooks/formats/3437 來下載到本地硬碟。此書根據 Creative Commons Attribution Non Commercial Share Alike 3.0 授權發布。

tag處理數據

這篇文本具有階層式章節的內部結構，我們可以通過在底層的 XHTML 數據中尋找 <section> 標籤輕鬆找到。以下程式碼讀取 EPUB 檔案並使用 EPUB 檔案的內部結構和 <section> 標籤進行拆分，然後將每個章節轉換為不帶 XHTML 標籤的純文本。它創建一個 Python 字典，其鍵是表示每個章節在書中位置的字串集合，其值是該章節的純文本內容。

from zipfile import ZipFile
from bs4 import BeautifulSoup
import copy

def decompose_epub(file_name):
    
    def to_top_text(section):
        selected = copy.copy(section)
				while next_section := selected.find("section"):
            next_section.decompose()
        return selected.get_text().strip()

    ret = {}
    with ZipFile(file_name, 'r') as zip:
        for name in zip.namelist():
            if name.endswith(".xhtml"):
                data = zip.read(name)
                doc = BeautifulSoup(data.decode('utf-8'), 'html.parser')
                ret[name + ":top"] = to_top_text(doc)
                for num, sect in enumerate(doc.find_all("section")):
                    ret[name + f"::{num}"] = to_top_text(sect)
    return ret

然後，在您之前下載的 EPUB 檔案上運行 decompose_epub 函數：

book_data = decompose_epub("progit-en.epub")

變數 book_data 現在將包含 583 個章節。例如：

print(book_data['EPUB/ch01-getting-started.xhtml::12'])

結果：

The Command Line
There are a lot of different ways to use Git.
There are the original command-line tools, and there are many graphical user interfaces of varying capabilities.
For this book, we will be using Git on the command line.
For one, the command line is the only place you can run all Git commands — most of the GUIs implement only a partial subset of Git functionality for simplicity.
If you know how to run the command-line version, you can probably also figure out how to run the GUI version, while the opposite is not necessarily true.
Also, while your choice of graphical client is a matter of personal taste, all users will have the command-line tools installed and available.
So we will expect you to know how to open Terminal in macOS or Command Prompt or PowerShell in Windows.
If you don't know what we're talking about here, you may need to stop and research that quickly so that you can follow the rest of the examples and descriptions in this book.

tag使用 Jina Embeddings v2 和 FAISS 生成和索引嵌入

對於這 583 個章節中的每一個，我們將生成一個嵌入並將其存儲在 FAISS 索引中。Jina Embeddings v2 模型接受最多 8192 個 tokens 的輸入，對於這樣的書來說已足夠大，我們不需要進行任何進一步的文本分割或檢查任何章節是否有太多 tokens。本書最長的章節大約有 12,000 個字符，對於普通英語來說，應該遠低於 8k token 的限制。

要生成單個嵌入，可以使用我們下載的模型的 encode 方法。例如：

model.encode([book_data['EPUB/ch01-getting-started.xhtml::12']])

這會返回一個包含單個 768 維向量的陣列：

array([[ 6.11135997e-02,  1.67829826e-01, -1.94809273e-01,
         4.45595086e-02,  3.28837298e-02, -1.33441269e-01,
         1.35364473e-01, -1.23119736e-02,  7.51526654e-02,
        -4.25386652e-02, -6.91794455e-02,  1.03527725e-01,
        -2.90831417e-01, -6.21018047e-03, -2.16205455e-02,
        -2.20803712e-02,  1.50471330e-01, -3.31433356e-01,
        -1.48741454e-01, -2.10959971e-01,  8.80039856e-02,
				....

這就是一個嵌入。

Jina Embeddings 模型支援批次處理。最佳批次大小取決於您運行時使用的硬體。批次太大會有記憶體不足的風險，批次太小則處理時間會更長。

⚠️

設置 batch_size=5 在沒有 GPU 的 Google Colab 免費版上可以運作，生成整套嵌入需要大約一小時。

在生產環境中，我們建議使用更強大的硬體或使用 Jina AI 的 Embedding API 服務。點擊下面的連結瞭解它如何運作以及如何開始免費使用。

以下程式碼生成嵌入並將它們儲存在 FAISS 索引中。根據您的資源適當設置 batch_size 變數。

import faiss

batch_size = 5

vector_data = []
faiss_index = faiss.IndexFlatIP(768)

data = [(key, txt) for key, txt in book_data.items()]
batches = [data[i:i + batch_size] for i in range(0, len(data), batch_size)]

for ind, batch in enumerate(batches):
    print(f"Processing batch {ind + 1} of {len(batches)}")
    batch_embeddings = model.encode([x[1] for x in batch], normalize_embeddings=True)
    vector_data.extend(batch)
    faiss_index.add(batch_embeddings)

在生產環境中，Python 字典並不是處理文檔和嵌入的適當或高效方式。您應該使用專門設計的向量資料庫，它會有自己的資料插入指引。

tag使用德語查詢英語結果

當我們從這組文本中查詢時，會發生以下情況：

Jina Embeddings 德語-英語模型將為查詢創建一個嵌入。
我們將使用 FAISS 索引（faiss_index）來獲取與查詢嵌入餘弦值最高的儲存嵌入，並返回其在索引中的位置。
我們將在向量數據陣列（vector_data）中查找對應的文本，並列印出餘弦值、文本位置和文本本身。

這就是以下 query 函數所做的事情。

def query(query_str):
    query = model.encode([query_str], normalize_embeddings=True)
    cosine, index = faiss_index.search(query, 1)
    print(f"Cosine: {cosine[0][0]}")
    loc, txt = vector_data[index[0][0]]
    print(f"Location: {loc}\\nText:\\n\\n{txt}")

現在讓我們試試看。

# Translation: "How do I roll back to a previous version?"
query("Wie kann ich auf eine frühere Version zurücksetzen?")

結果：

Cosine: 0.5202275514602661
Location: EPUB/ch02-git-basics-chapter.xhtml::20
Text:

Undoing things with git restore
Git version 2.23.0 introduced a new command: git restore.
It's basically an alternative to git reset which we just covered.
From Git version 2.23.0 onwards, Git will use git restore instead of git reset for many undo operations.
Let's retrace our steps, and undo things with git restore instead of git reset.

這是一個相當不錯的答案。讓我們再試一個：

# Translation: "What does 'version control' mean?"
query("Was bedeutet 'Versionsverwaltung'?")

結果：

Cosine: 0.5001817941665649
Location: EPUB/ch01-getting-started.xhtml::1
Text:

About Version Control

What is "version control", and why should you care?
Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.
For the examples in this book, you will use software source code as the files being version controlled, though in reality you can do this with nearly any type of file on a computer.
If you are a graphic or web designer and want to keep every version of an image or layout (which you would most certainly want to), a Version Control System (VCS) is a very wise thing to use.
It allows you to revert selected files back to a previous state, revert the entire project back to a previous state, compare changes over time, see who last modified something that might be causing a problem, who introduced an issue and when, and more.
Using a VCS also generally means that if you screw things up or lose files, you can easily recover.
In addition, you get all this for very little overhead.

你可以用自己的德語問題來試試看效果如何。作為一般做法，在處理文本信息檢索時，你應該要請求三到五個回答而不是只要一個。最佳答案通常不是第一個。

tag角色互換：用英文查詢德文文件

Pro Git: Everything You Need to Know About Git 這本書也有德文版本。我們可以使用相同的模型來做一個語言相反的演示。

下載電子書：

wget -O progit-de.epub https://open.umn.edu/opentextbooks/formats/3454

這會將書本複製到名為 progit-de.epub 的檔案。然後我們用與英文書相同的方式處理它：

book_data = decompose_epub("progit-de.epub")

接著用與之前相同的方式生成 embeddings：

batch_size = 5

vector_data = []
faiss_index = faiss.IndexFlatIP(768)

data = [(key, txt) for key, txt in book_data.items()]
batches = [data[i:i + batch_size] for i in range(0, len(data), batch_size)]

for ind, batch in enumerate(batches):
    print(f"Processing batch {ind + 1} of {len(batches)}")
    batch_embeddings = model.encode([x[1] for x in batch], normalize_embeddings=True)
    vector_data.extend(batch)
    faiss_index.add(batch_embeddings)

現在我們可以使用相同的 query 函數來用英文搜尋德文答案：

query("What is version control?")

結果：

Cosine: 0.6719034910202026
Location: EPUB/ch01-getting-started.xhtml::1
Text:

Was ist Versionsverwaltung?

Was ist „Versionsverwaltung", und warum sollten Sie sich dafür interessieren?
Versionsverwaltung ist ein System, welches die Änderungen an einer oder einer Reihe von Dateien über die Zeit hinweg protokolliert, sodass man später auf eine bestimmte Version zurückgreifen kann.
Die Dateien, die in den Beispielen in diesem Buch unter Versionsverwaltung gestellt werden, enthalten Quelltext von Software, tatsächlich kann in der Praxis nahezu jede Art von Datei per Versionsverwaltung nachverfolgt werden.
Als Grafik- oder Webdesigner möchte man zum Beispiel in der Lage sein, jede Version eines Bildes oder Layouts nachverfolgen zu können. Als solcher wäre es deshalb ratsam, ein Versionsverwaltungssystem (engl. Version Control System, VCS) einzusetzen.
Ein solches System erlaubt es, einzelne Dateien oder auch ein ganzes Projekt in einen früheren Zustand zurückzuversetzen, nachzuvollziehen, wer zuletzt welche Änderungen vorgenommen hat, die möglicherweise Probleme verursachen, herauszufinden wer eine Änderung ursprünglich vorgenommen hat und viele weitere Dinge.
Ein Versionsverwaltungssystem bietet allgemein die Möglichkeit, jederzeit zu einem vorherigen, funktionierenden Zustand zurückzukehren, auch wenn man einmal Mist gebaut oder aus irgendeinem Grund Dateien verloren hat.
All diese Vorteile erhält man für einen nur sehr geringen, zusätzlichen Aufwand.

這個章節的標題翻譯為「什麼是版本控制？」，所以這是個不錯的回答。

tag使用中文查詢

這些範例使用 Jina Embeddings v2 在中文和英文上會完全相同。要使用中文模型，只需執行以下程式：

from transformers import AutoModel

model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True)

要取得 Pro Git: Everything You Need to Know About Git 的中文版：

wget -O progit-zh.epub https://open.umn.edu/opentextbooks/formats/3455

然後，處理中文版書籍：

book_data = decompose_epub("progit-zh.epub")

本教程中的所有其他程式碼都可以同樣使用。

tag未來展望：更多語言，包括程式語言

我們將在近期推出更多雙語模型，其中西班牙語和日語的模型已在開發中，還有一個支援英語和幾種主要程式語言的模型。這些模型特別適合管理多語言信息的國際企業，可以作為 AI 支援的信息檢索和基於 RAG 的生成式語言模型的基石，適用於各種前沿 AI 應用場景。

Jina AI 的模型體積小巧且在同類中表現最佳，證明你不需要最大的模型就能獲得最佳性能。通過專注於雙語表現，我們製作出的模型在這些語言上表現更好，更容易適應，並且比使用未經整理數據訓練的大型模型更具成本效益。

Jina Embeddings 可以從 Hugging Face、AWS marketplace（用於 Sagemaker）以及透過 Jina Embeddings web API 獲得。它們已完全整合到許多 AI 處理框架和向量資料庫中。

欲了解更多信息，請訪問 Jina Embeddings 網站，或聯繫我們討論 Jina AI 的產品如何適應您的業務流程。