Aquí Se Habla Español: Top-Quality Spanish-English Embeddings and 8k Context

jina-embeddings-v3 has been released on Sept. 18, 2024. The best <1B multilingual embedding model.

Jina AI is once again demonstrating its commitment to high-quality multilingual AI models by releasing its Spanish-English bilingual model.

This model provides embedding vectors for texts of up to 8k tokens in Spanish or English, designed so that if texts in the two languages mean the same thing, their embeddings will be geometrically close together. Jina Embeddings v2 for Spanish and English is ideally suited for cross-language information retrieval, bilingual semantic analysis, and bilingual RAG applications.

This new model, jina-embeddings-v2-base-es, brings to Spanish the same state-of-the-art performance and ground-breaking feature set of Jina AI’s v2 models for English, German, Chinese, and programming languages:

8,192 tokens of input context, a leader among open-source embedding models.
Real bilingualism instead of uneven multilingualism. Jina AI’s bilingual models are trained to give balanced support to both languages, avoiding the biases of “multilingual” models trained on uncurated Internet scrapes.
jina-embeddings-v2-base-es is compact compared to open-source models of comparable performance. The embeddings themselves are 768 dimensions, saving space and run-time in production.
Jina Embeddings v2 models are fully integrated into major vector databases, RAG frameworks, and AI development libraries:

Jina Embeddings v2 for Spanish and English is accessible via the Jina Embeddings API right now, with one million free tokens, so you pay nothing to try it out.

tagBenchmarks

On Spanish benchmarks, Jina v2 for Spanish and English outperforms the Multilingual E5 base model and the BGE M3 model, the only comparable open-source models with Spanish support. The tests below (MTEB-es) are adapted from the Massive Text Embeddings Benchmark. You can view and run them from this GitHub repository.

Technical table displaying models, sizes, and performance metrics for cross-language, retrieval, and classification tasks.

Jina Embeddings outperforms E5 on all metrics except classification and outperforms BGE-M3 in retrieval, clustering, and cross-language tasks, despite being 15% to 30% of the size of these larger models.

Significantly better performance in retrieval tasks (like finding related documents in a database) and clustering (identifying groups of documents that belong together in a collection)
Roughly equal performance with E5 on reranking (ordering documents by semantic similarity) and near-equal performance on text classification in Spanish.
All three models have very similar benchmark scores for cross-language tasks (finding semantically similar texts in English to a Spanish input, or vice-versa), although Jina Embeddings still performs the best.

When compared to closed-source multilingual models from Open AI and Cohere, Jina Embeddings’ compact size makes its achievements even more impressive.

Table comparing machine translation systems with models, vendors, and metrics like Spanish benchmarks and cross-language rera

On retrieval tasks in Spanish, Jina outperforms the closed-source models offered by Open AI and Cohere and outperforms Open AI (and nearly equals Cohere’s performance) on cross-language tasks.

tagJina Embeddings: AI for a Multilingual World

Spanish is spoken by well over half a billion people, with official status in more than 20 countries, along with the European Union, the United Nations, the World Trade Organization, and FIFA. Introducing this specialized bilingual model makes clear Jina AI’s commitment to bringing AI technologies to everyone.

In addition to Spanish and its high-performance English monolingual model, Jina AI currently offers state-of-the-art embedding models for German, Chinese, and programming languages, with more to come.

Jina AI is committed to advancing AI technology for the broadest audience, placing a high value on transparency, accessibility, affordability, privacy, and data protection.

We value your feedback on all our models. Join our community channel to contribute and stay informed about new developments.