Jina 分类器 API：高性能零样本和少样本分类

分类是嵌入向量的一个常见下游任务。文本嵌入可以将文本分类为预定义的标签，用于垃圾邮件检测或情感分析。像 jina-clip-v1 这样的多模态嵌入可以应用于基于内容的过滤或标签注释。最近，分类还被用于根据复杂度和成本将查询路由到适当的 LLM，例如简单的算术查询可能被路由到小型语言模型，而复杂的推理任务则可能被引导到功能更强大但成本更高的 LLM。

今天，我们推出了 Jina AI 搜索底座设施的新 Classifier API。它支持零样本和少样本在线分类，基于我们最新的嵌入模型，如 jina-embeddings-v3 和 jina-clip-v1。Classifier API 基于在线被动-激进学习构建，使其能够实时适应新数据。用户可以从零样本分类器开始并立即使用。然后，他们可以通过提交新的示例或在概念漂移发生时增量更新分类器。这使得在各种内容类型上进行高效、可扩展的分类成为可能，而无需大量初始标记数据。用户还可以发布他们的分类器供公众使用。当我们发布新的嵌入模型时，比如即将推出的多语言 jina-clip-v2，用户可以通过 Classifier API 立即访问它们，确保分类能力始终保持最新。

tag零样本分类

Classifier API 提供强大的零样本分类功能，让您无需预先在标记数据上训练就能对文本或图像进行分类。每个分类器都以零样本能力开始，之后可以通过额外的训练数据或更新来增强——我们将在下一节探讨这个话题。

tag示例 1：LLM 请求路由

这是使用分类器 API 进行 LLM 查询路由的示例：

curl https://api.jina.ai/v1/classify \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY_HERE" \
  -d '{
    "model": "jina-embeddings-v3",
    "labels": [
      "Simple task",
      "Complex reasoning",
      "Creative writing"
    ],
    "input": [
      "Calculate the compound interest on a principal of $10,000 invested for 5 years at an annual rate of 5%, compounded quarterly.",
      "分析使用CRISPR基因编辑技术在人类胚胎中的伦理影响。考虑潜在的医疗益处和长期社会后果。",
      "AIが自意識を持つディストピアの未来を舞台にした短編小説を書いてください。人間とAIの関係や意識の本質をテーマに探求してください。",
      "Erklären Sie die Unterschiede zwischen Merge-Sort und Quicksort-Algorithmen in Bezug auf Zeitkomplexität, Platzkomplexität und Leistung in der Praxis.",
      "Write a poem about the beauty of nature and its healing power on the human soul.",
      "Translate the following sentence into French: The quick brown fox jumps over the lazy dog."
    ]
  }'

这个示例演示了使用 jina-embeddings-v3 将多种语言（英语、中文、日语和德语）的用户查询路由到三个类别，这些类别对应于三种不同规模的 LLM。API 响应格式如下：

{
  "usage": {"total_tokens": 256, "prompt_tokens": 256},
  "data": [
    {"object": "classification", "index": 0, "prediction": "Simple task", "score": 0.35216382145881653},
    {"object": "classification", "index": 1, "prediction": "Complex reasoning", "score": 0.34310275316238403},
    {"object": "classification", "index": 2, "prediction": "Creative writing", "score": 0.3487184941768646},
    {"object": "classification", "index": 3, "prediction": "Complex reasoning", "score": 0.35207709670066833},
    {"object": "classification", "index": 4, "prediction": "Creative writing", "score": 0.3638903796672821},
    {"object": "classification", "index": 5, "prediction": "Simple task", "score": 0.3561534285545349}
  ]
}

响应包括：

usage：token 使用信息。
data：分类结果数组，每个输入对应一个结果。
- 每个结果包含预测标签（prediction）和置信度分数（score）。每个类别的 score 通过 softmax 归一化计算 - 对于零样本分类，它基于输入和标签嵌入向量之间的余弦相似度在 classification task-LoRA 下；而对于少样本分类，它基于每个类别的输入嵌入的学习线性变换 - 结果是所有类别概率之和为 1。
- index 对应原始请求中输入的位置。

tag示例 2：图像和文本分类

让我们探索一个使用 jina-clip-v1 的多模态示例。这个模型可以对文本和图像进行分类，非常适合跨各种媒体类型的内容分类。考虑以下 API 调用：

curl https://api.jina.ai/v1/classify \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY_HERE" \
  -d '{
    "model": "jina-clip-v1",
    "labels": [
      "Food and Dining",
      "Technology and Gadgets",
      "Nature and Outdoors",
      "Urban and Architecture"
    ],
    "input": [
      {"text": "A sleek smartphone with a high-resolution display and multiple camera lenses"},
      {"text": "Fresh sushi rolls served on a wooden board with wasabi and ginger"},
      {"image": "https://picsum.photos/id/11/367/267"},
      {"image": "https://picsum.photos/id/22/367/267"},
      {"text": "Vibrant autumn leaves in a dense forest with sunlight filtering through"},
      {"image": "https://picsum.photos/id/8/367/267"}
    ]
  }'

请注意我们如何在请求中上传图像，您也可以使用 base64 字符串来表示图像。API 返回以下分类结果：

{
  "usage": {"total_tokens": 12125, "prompt_tokens": 12125},
  "data": [
    {"object": "classification", "index": 0, "prediction": "Technology and Gadgets", "score": 0.30329811573028564},
    {"object": "classification", "index": 1, "prediction": "Food and Dining", "score": 0.2765541970729828},
    {"object": "classification", "index": 2, "prediction": "Nature and Outdoors", "score": 0.29503118991851807},
    {"object": "classification", "index": 3, "prediction": "Urban and Architecture", "score": 0.2648046910762787},
    {"object": "classification", "index": 4, "prediction": "Nature and Outdoors", "score": 0.3133063316345215},
    {"object": "classification", "index": 5, "prediction": "Technology and Gadgets", "score": 0.27474141120910645}
  ]
}

tag示例 3：检测 Jina Reader 是否获取到真实内容

零样本分类的一个有趣应用是通过 Jina Reader 确定网站的可访问性。虽然这看起来是一个简单的任务，但在实践中却相当复杂。被阻止的信息因网站而异，以不同的语言出现并引用各种原因（付费墙、速率限制、服务器中断）。这种多样性使得依赖正则表达式或固定规则来捕获所有场景变得具有挑战性。

import requests
import json

response1 = requests.get('https://r.jina.ai/https://jina.ai')

url = 'https://api.jina.ai/v1/classify'
headers = {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer $YOUR_API_KEY_HERE'
}
data = {
    'model': 'jina-embeddings-v3',
    'labels': ['Blocked', 'Accessible'],
    'input': [{'text': response1.text[:8000]}]
}
response2 = requests.post(url, headers=headers, data=json.dumps(data))

print(response2.text)

该脚本通过 r.jina.ai 获取内容，并使用 Classifier API 将其分类为 "Blocked" 或 "Accessible"。例如，https://r.jina.ai/https://www.crunchbase.com/organization/jina-ai 可能因访问限制而被标记为 "Blocked"，而 https://r.jina.ai/https://jina.ai 应该是 "Accessible"。

{"usage":{"total_tokens":185,"prompt_tokens":185},"data":[{"object":"classification","index":0,"prediction":"Blocked","score":0.5392698049545288}]}

Classifier API 能够有效区分 Jina Reader 的真实内容和被阻止的结果。

这个示例利用 jina-embeddings-v3，提供了一种快速、自动化的方式来监控网站可访问性，特别适用于多语言环境下的内容聚合或网络爬虫系统。

tag示例 4：从观点中过滤出陈述性内容用于事实依据

零次学习分类的另一个有趣应用是从长文档中过滤出陈述型声明与观点。需要注意的是，分类器本身无法判断内容是否属实。相反，它识别的是以陈述事实风格写就的文本，这些文本随后可以通过成本较高的事实验证 API 进行核实。这种两步流程是有效事实核查的关键:首先过滤掉所有观点和感受,然后将剩余的陈述送去核实。

考虑这段关于 1960 年代太空竞赛的段落：

The Space Race of the 1960s was a breathtaking testament to human ingenuity. When the Soviet Union launched Sputnik 1 on October 4, 1957, it sent shockwaves through American society, marking the undeniable start of a new era. The silvery beeping of that simple satellite struck fear into the hearts of millions, as if the very stars had betrayed Western dominance. NASA was founded in 1958 as America's response, and they poured an astounding $28 billion into the Apollo program between 1960 and 1973. While some cynics claimed this was a waste of resources, the technological breakthroughs were absolutely worth every penny spent. On July 20, 1969, Neil Armstrong and Buzz Aldrin achieved the most magnificent triumph in human history by walking on the moon, their footprints marking humanity's destiny among the stars. The Soviet space program, despite its early victories, ultimately couldn't match the superior American engineering and determination. The moon landing was not just a victory for America - it represented the most inspiring moment in human civilization, proving that our species was meant to reach beyond our earthly cradle.

这段文本刻意混合了不同类型的写作风格 - 从陈述型声明（如"Sputnik 1 于 1959 年 10 月 4 日发射"）到明显的观点（"令人叹为观止的证明"）、情感性语言（"引发内心恐惧"）和解释性声明（"标志着一个新时代的无可争议的开始"）。

零次学习分类器的工作纯粹是语义层面的 - 它识别一段文本是以陈述方式还是以观点/解释方式写就。例如，"The Soviet Union launched Sputnik 1 on October 4, 1959"是以陈述方式写就，而"The Space Race was a breathtaking testament"显然是以观点方式写就。

headers = {
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {API_KEY}'
}

# Step 1: Split text and classify
chunks = [chunk.strip() for chunk in text.split('.') if chunk.strip()]
labels = [
    "subjective, opinion, feeling, personal experience, creative writing, position",
    "fact"
]

# Classify chunks
classify_response = requests.post(
    'https://api.jina.ai/v1/classify',
    headers=headers,
    json={
        "model": "jina-embeddings-v3",
        "input": [{"text": chunk} for chunk in chunks],
        "labels": labels
    }
)

# Sort chunks
subjective_chunks = []
factual_chunks = []
for chunk, classification in zip(chunks, classify_response.json()['data']):
    if classification['prediction'] == labels[0]:
        subjective_chunks.append(chunk)
    else:
        factual_chunks.append(chunk)

print("\nSubjective statements:", subjective_chunks)
print("\nFactual statements:", factual_chunks)

你会得到：

Subjective statements: ['The Space Race of the 1960s was a breathtaking testament to human ingenuity', 'The silvery beeping of that simple satellite struck fear into the hearts of millions, as if the very stars had betrayed Western dominance', 'While some cynics claimed this was a waste of resources, the technological breakthroughs were absolutely worth every penny spent', "The Soviet space program, despite its early victories, ultimately couldn't match the superior American engineering and determination"]

Factual statements: ['When the Soviet Union launched Sputnik 1 on October 4, 1957, it sent shockwaves through American society, marking the undeniable start of a new era', "NASA was founded in 1958 as America's response, and they poured an astounding $28 billion into the Apollo program between 1960 and 1973", "On July 20, 1969, Neil Armstrong and Buzz Aldrin achieved the most magnificent triumph in human history by walking on the moon, their footprints marking humanity's destiny among the stars", 'The moon landing was not just a victory for America - it represented the most inspiring moment in human civilization, proving that our species was meant to reach beyond our earthly cradle']

请记住，一段内容以陈述的方式写就并不意味着它就是真实的。这就是为什么我们需要第二步 - 将这些陈述型声明输入事实验证 API 进行实际核实。例如，让我们用下面的代码验证这个陈述："NASA was founded in 1958 as America's response, and they poured an astounding $28 billion into the Apollo program between 1960 and 1973"。

ground_headers = {
        'Accept': 'application/json',
        'Authorization': f'Bearer {API_KEY}'
    }

ground_response = requests.get(
    f'https://g.jina.ai/{quote(factual_chunks[1])}',
    headers=ground_headers
)

print(ground_response.json())

这会给你：

{'code': 200, 'status': 20000, 'data': {'factuality': 1, 'result': True, 'reason': "The statement is supported by multiple references confirming NASA's founding in 1958 and the significant financial investment in the Apollo program. The $28 billion figure aligns with the data provided in the references, which detail NASA's expenditures during the Apollo program from 1960 to 1973. Additionally, the context of NASA's budget peaking during this period further substantiates the claim. Therefore, the statement is factually correct based on the available evidence.", 'references': [{'url': 'https://en.wikipedia.org/wiki/Budget_of_NASA', 'keyQuote': "NASA's budget peaked in 1964–66 when it consumed roughly 4% of all federal spending. The agency was building up to the first Moon landing and the Apollo program was a top national priority, consuming more than half of NASA's budget.", 'isSupportive': True}, {'url': 'https://en.wikipedia.org/wiki/NASA', 'keyQuote': 'Established in 1958, it succeeded the National Advisory Committee for Aeronautics (NACA)', 'isSupportive': True}, {'url': 'https://nssdc.gsfc.nasa.gov/planetary/lunar/apollo.html', 'keyQuote': 'More details on Apollo lunar landings', 'isSupportive': True}, {'url': 'https://usafacts.org/articles/50-years-after-apollo-11-moon-landing-heres-look-nasas-budget-throughout-its-history/', 'keyQuote': 'NASA has spent its money so far.', 'isSupportive': True}, {'url': 'https://www.nasa.gov/history/', 'keyQuote': 'Discover the history of our human spaceflight, science, technology, and aeronautics programs.', 'isSupportive': True}, {'url': 'https://www.nasa.gov/the-apollo-program/', 'keyQuote': 'Commander for Apollo 11, first to step on the lunar surface.', 'isSupportive': True}, {'url': 'https://www.planetary.org/space-policy/cost-of-apollo', 'keyQuote': 'A rich data set tracking the costs of Project Apollo, free for public use. Includes unprecedented program-by-program cost breakdowns.', 'isSupportive': True}, {'url': 'https://www.statista.com/statistics/1342862/nasa-budget-project-apollo-costs/', 'keyQuote': 'NASA&#x27;s monetary obligations compared to Project Apollo&#x27;s total costs from 1960 to 1973 (in million U.S. dollars)', 'isSupportive': True}], 'usage': {'tokens': 10640}}}

通过事实性得分 1，事实验证 API 确认这个陈述在历史事实中有充分依据。这种方法开启了令人着迷的可能性，从分析历史文档到实时核查新闻文章。通过结合零次学习分类与事实验证，我们创建了一个强大的自动信息分析流程 - 首先过滤掉观点，然后根据可信来源验证剩余的陈述。

tag关于零次学习分类的说明

使用语义标签

在使用零次学习分类时，使用语义上有意义的标签而不是抽象符号或数字至关重要。例如，"Technology"、"Nature"和"Food"远比"Class1"、"Class2"、"Class3"或"0"、"1"、"2"更有效。"Positive sentiment"比"Positive"和"True"更有效。嵌入模型理解语义关系，所以描述性标签使模型能够利用其预训练知识进行更准确的分类。我们之前的文章探讨了如何创建有效的语义标签以获得更好的分类结果。

无状态特性

与传统机器学习方法不同，零次学习分类本质上是无状态的。这意味着给定相同的输入和模型，无论是谁在使用 API 或何时使用，结果都将保持一致。模型不会根据它执行的分类进行学习或更新；每个任务都是独立的。这允许无需设置或训练即可立即使用，并提供了在 API 调用之间更改类别的灵活性。

这种无状态特性与我们接下来要探讨的小样本学习和在线学习方法形成鲜明对比。在那些方法中，模型可以适应新的示例，可能随时间或在用户之间产生不同的结果。

tag小样本分类

小样本分类提供了一种使用最少标记数据创建和更新分类器的简便方法。这种方法提供两个主要端点：train和classify。

train端点让你可以用少量示例创建或更新分类器。你对train的第一次调用将返回一个

classifier_id，您可以在有新数据、发现数据分布变化或需要添加新类别时使用它进行后续训练。这种灵活的方法使您的分类器能够随时间演进，适应新的模式和类别，而无需从头开始。

与零样本分类类似，您将使用 classify 端点进行预测。主要区别在于您需要在请求中包含您的 classifier_id，但不需要提供候选标签，因为它们已经是训练模型的一部分。

tag示例：训练支持工单分配器

让我们通过一个将客户支持工单分类并分配给快速成长的科技创业公司不同团队的示例来探索这些功能。

初始训练

curl -X 'POST' \
  'https://api.jina.ai/v1/train' \
  -H 'accept: application/json' \
  -H 'Authorization: Bearer YOUR_API_KEY_HERE' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "jina-embeddings-v3",
  "access": "private",
  "input": [
    {
      "text": "I cant log into my account after the latest app update.",
      "label": "team1"
    },
    {
      "text": "My subscription renewal failed due to an expired credit card.",
      "label": "team2"
    },
    {
      "text": "How do I export my data from the platform?",
      "label": "team3"
    }
  ],
  "num_iters": 10
}'

请注意，在小样本学习中，即使 team1 team2 这样的类别标签没有内在的语义含义，我们也可以自由使用它们。在响应中，您将获得一个代表这个新创建的分类器的 classifier_id。

{
  "classifier_id": "918c0846-d6ae-4f34-810d-c0c7a59aee14",
  "num_samples": 3,
}

请记下这个 classifier_id，您之后需要用它来引用这个分类器。

更新分类器以适应团队重组

随着示例公司的成长，新类型的问题出现，团队结构也发生变化。小样本分类的优势在于能够快速适应这些变化。我们可以通过提供 classifier_id 和新示例来轻松更新分类器，引入新的团队类别（例如 team4）或随着组织发展将现有问题类型重新分配给不同的团队。

curl -X 'POST' \
  'https://api.jina.ai/v1/train' \
  -H 'accept: application/json' \
  -H 'Authorization: Bearer YOUR_API_KEY_HERE' \
  -H 'Content-Type: application/json' \
  -d '{
  "classifier_id": "b36b7b23-a56c-4b52-a7ad-e89e8f5439b6",
  "input": [
    {
      "text": "Im getting a 404 error when trying to access the new AI chatbot feature.",
      "label": "team4"
    },
    {
      "text": "The latest security patch is conflicting with my company firewall.",
      "label": "team1"
    },
    {
      "text": "I need help setting up SSO for my organization account.",
      "label": "team5"
    }
  ],
  "num_iters": 10
}'

使用训练好的分类器

在推理时，您只需要提供输入文本和 classifier_id。API 会处理输入与之前训练的类别之间的映射，根据分类器的当前状态返回最合适的标签。

curl -X 'POST' \
  'https://api.jina.ai/v1/classify' \
  -H 'accept: application/json' \
  -H 'Authorization: Bearer YOUR_API_KEY_HERE' \
  -H 'Content-Type: application/json' \
  -d '{
  "classifier_id": "b36b7b23-a56c-4b52-a7ad-e89e8f5439b6",
  "input": [
    {
      "text": "The new feature is causing my dashboard to load slowly."
    },
    {
      "text": "I need to update my billing information for tax purposes."
    }
  ]
}'

小样本模式有两个独特的参数。

tag参数 `num_iters`

num_iters 参数调整分类器从训练示例中学习的强度。虽然默认值 10 适用于大多数情况，但您可以根据**对训练数据的信心**战略性地调整这个值。对于对分类至关重要的高质量示例，增加 num_iters 以强化它们的重要性。相反，对于不太可靠的示例，降低 num_iters 以最小化它们对分类器性能的影响。这个参数还可以用于实现时间感知学习，其中更近期的示例获得更高的迭代次数，以适应不断演变的模式，同时保持历史知识。

tag参数 `access`

access 参数让您控制谁可以使用您的分类器。默认情况下，分类器是私有的，只有您可以访问。将访问权限设置为"public"可以允许任何拥有您的 classifier_id 的人**使用他们自己的 API key 和令牌配额来使用它**。这实现了分类器共享的同时保持隐私——用户无法看到您的训练数据或配置，而您也无法看到他们的分类请求。此参数仅与小样本分类相关，因为零样本分类器是无状态的。无需共享零样本分类器，因为无论谁发出请求，相同的请求总会产生相同的响应。

tag关于小样本学习的说明

我们 API 中的小样本分类有一些值得注意的独特特征。与传统机器学习模型不同，我们的实现使用单次在线学习——处理训练示例以更新分类器的权重，但之后不会存储这些示例。这意味着您无法检索历史训练数据，但它确保了更好的隐私和资源效率。

虽然小样本学习很强大，但它确实需要一个预热期才能超越零样本分类的性能。我们的基准测试表明，200-400 个训练示例通常提供足够的数据来看到更优的性能。然而，您不需要一开始就为所有类别提供示例——分类器可以随时间扩展以适应新类别。只需注意，新添加的类别可能会经历短暂的冷启动期或类别不平衡，直到提供足够的示例。

tag基准测试

在我们的基准分析中，我们在各种数据集上评估了零样本和小样本方法，包括情感检测（6 个类别）和垃圾邮件检测（2 个类别）等文本分类任务，以及 CIFAR10（10 个类别）等图像分类任务。评估框架使用标准的训练-测试集划分，零样本不需要训练数据，而小样本使用部分训练集。我们跟踪了训练集大小和目标类别数量等关键指标，以进行受控比较。为确保稳健性，特别是对于小样本学习，每个输入都经过多次训练迭代。我们将这些现代方法与传统基准（如线性 SVM 和 RBF SVM）进行比较，以提供性能背景。

图中展示了 F1 分数。有关完整的基准测试设置，请查看这个 Google 表格。

F1 图表揭示了三个任务中的有趣模式。不出所料，零样本分类从一开始就表现稳定，不受训练数据量大小的影响。相比之下，小样本学习展现出快速的学习曲线，虽然初始表现较低，但随着训练数据的增加迅速超过零样本的表现。这两种方法最终在400 样本左右达到相当的准确度，小样本保持着轻微优势。这种模式在多分类和图像分类场景中都成立，表明当有一些训练数据时，小样本学习特别有优势，而零样本即使在没有任何训练样本的情况下也能提供可靠的性能。下表从 API 用户角度总结了零样本和小样本分类的区别。

特征	零样本	小样本
主要使用场景	通用分类的默认解决方案	用于 v3/clip-v1 领域之外或时效性数据
需要训练数据	否	是
/train 中需要标签	不适用	是
/classify 中需要标签	是	否
需要分类器 ID	否	是
需要语义标签	是	否
状态管理	无状态	有状态
持续模型更新	否	是
访问控制	否	是
最大类别数	256	16
最大分类器数	不适用	16
每次请求最大输入数	1,024	1,024
每个输入最大 token 长度	8,192 tokens	8,192 tokens

tag总结

分类器 API 为文本和图像内容提供强大的零样本和小样本分类功能，由 jina-embeddings-v3 和 jina-clip-v1 等先进的嵌入模型提供支持。我们的基准测试表明，零样本分类无需训练数据即可提供可靠的性能，这使其成为大多数任务的绝佳起点，并支持多达 256 个类别。虽然小样本学习通过训练数据可以实现略高的准确率，但我们建议从零样本分类开始，因为它能够立即产生结果并具有灵活性。

该 API 的多功能性支持各种应用，从路由 LLM 查询到检测网站可访问性和分类多语言内容。无论您是从零样本开始，还是为特定场景转向小样本学习，API 都保持一致的接口，可以无缝集成到您的流程中。我们特别期待看到开发者如何在其应用中利用这个 API，并且我们将在未来推出对 jina-clip-v2 等新嵌入模型的支持。