News
Models
Products
keyboard_arrow_down
DeepSearch
Search, read and reason until best answer found.
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
More
keyboard_arrow_down
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms & Conditions


Log in
login
warning
This model is deprecated by newer models.

jina-embeddings-v2-base-es

Spanish-English bilingual embeddings with SOTA performance
Release Postarrow_forward
License
license
Apache-2.0
Release Date
calendar_month
2024-02-14
Input
abc
Text
arrow_forward
Output
more_horiz
Vector
Model Details
Parameters: 161M
Input Token Length: 8K
Output Dimension: 768
Language Support
🇺🇸 English
🇪🇸 Español
Related Models
link
jina-embeddings-v2-base-en
link
jina-embeddings-v2-base-de
link
jina-embeddings-v2-base-zh
Tags
spanish
bilingual
long-context
8k-context
bert-based
production-ready
semantic-search
cross-lingual
text-embeddings
fine-tunable
Available via
Jina APIAWS SageMakerMicrosoft AzureHugging Face
Choose models to compare
Publications (1)
arXiv
February 26, 2024
Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings

Overview

Jina Embeddings v2 Base Spanish is a groundbreaking bilingual text embedding model that addresses the critical challenge of cross-lingual information retrieval and analysis between Spanish and English content. Unlike traditional multilingual models that often show bias towards specific languages, this model delivers truly balanced performance across both Spanish and English, making it indispensable for organizations operating in Spanish-speaking markets or handling bilingual content. The model's most remarkable feature is its ability to generate geometrically aligned embeddings - when texts in Spanish and English express the same meaning, their vector representations naturally cluster together in the embedding space, enabling seamless cross-language search and analysis.

Methods

At the heart of this model lies an innovative architecture based on symmetric bidirectional ALiBi (Attention with Linear Biases), a sophisticated approach that enables processing of sequences up to 8,192 tokens without traditional positional embeddings. The model utilizes a modified BERT architecture with 161M parameters, incorporating Gated Linear Units (GLU) and specialized layer normalization techniques. Training follows a three-stage process: initial pre-training on a massive text corpus, followed by fine-tuning with carefully curated text pairs, and finally, hard-negative training to enhance discrimination between similar but semantically distinct content. This approach, combined with 768-dimensional embeddings, allows the model to capture nuanced semantic relationships while maintaining computational efficiency.

Performance

In comprehensive benchmark evaluations, the model demonstrates exceptional capabilities, particularly in cross-language retrieval tasks where it outperforms significantly larger multilingual models like E5 and BGE-M3 despite being only 15-30% of their size. The model achieves superior performance in retrieval and clustering tasks, showing particular strength in matching semantically equivalent content across languages. When tested on the MTEB benchmark, it exhibits robust performance across various tasks including classification, clustering, and semantic similarity. The extended context window of 8,192 tokens proves especially valuable for long-document processing, showing consistent performance even with documents spanning multiple pages - a capability most competing models lack.

Best Practice

To effectively utilize this model, organizations should ensure access to CUDA-capable GPU infrastructure for optimal performance. The model integrates seamlessly with major vector databases and RAG frameworks including MongoDB, Qdrant, Weaviate, and Haystack, making it readily deployable in production environments. It excels in applications such as bilingual document search, content recommendation systems, and cross-language document analysis. While the model shows impressive versatility, it's particularly optimized for Spanish-English bilingual scenarios and may not be the best choice for monolingual applications or scenarios involving other language pairs. For optimal results, input texts should be properly formatted in either Spanish or English, though the model handles mixed-language content effectively. The model supports fine-tuning for domain-specific applications, but this should be approached with careful consideration of the training data quality and distribution.
Blogs that mention this model
April 29, 2024 • 7 minutes read
Jina Embeddings and Reranker on Azure: Scalable Business-Ready AI Solutions
Jina Embeddings and Rerankers are now available on Azure Marketplace. Enterprises that prioritize privacy and security can now easily integrate Jina AI's state-of-the-art models right in their existing Azure ecosystem.
Susana Guzmán
Futuristic black background with a purple 3D grid, featuring the "Embeddings" and "Reranker" logos with a stylized "A".
February 14, 2024 • 4 minutes read
Aquí Se Habla Español: Top-Quality Spanish-English Embeddings and 8k Context
Jina AI's new bilingual Spanish-English embedding model brings the state-of-the-art in AI to half a billion Spanish speakers.
Jina AI
Digital wireframe rendering of a Gothic-style cathedral, with colorful outlines and pointed spires on a dark background.
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
location_on
Beijing, China
Level 5, Building 6, No.48 Haidian West St. Beijing, China
location_on
Shenzhen, China
402 Floor 4, Fu'an Technology Building, Shenzhen, China
Search Foundation
DeepSearch
Reader
Embeddings
Reranker
Classifier
Segmenter
API Documentation
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
Newsroom
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI © 2020-2025.