Publications (1)
Overview
Jina Embedding B v1 is a specialized text embedding model designed to transform English text into high-dimensional numerical representations while maintaining semantic meaning. The model addresses the critical need for efficient and accurate text embeddings in production environments, particularly valuable for organizations requiring a balance between computational efficiency and embedding quality. With its 110M parameters generating 768-dimensional embeddings, it serves as a practical solution for teams implementing semantic search, document clustering, or content recommendation systems without requiring extensive computational resources.
Methods
The model employs a T5 encoder-based architecture enhanced with mean pooling to generate fixed-length representations. Trained on the carefully curated Linnaeus-Clean dataset, which contains 385 million high-quality sentence pairs filtered down from an initial 1.6 billion pairs, the model underwent a two-phase training process. The first phase utilized contrastive learning with InfoNCE loss on text pairs, while the second phase incorporated triplet training to refine the model's ability to distinguish between similar and dissimilar content. This innovative training approach, combined with rigorous data filtering including language detection and consistency checking, enables the model to capture nuanced semantic relationships effectively.
Performance
In real-world evaluations, Jina Embedding B v1 demonstrates impressive capabilities, particularly in semantic textual similarity tasks. The model achieves state-of-the-art performance on STS12 with a score of 0.751, surpassing established models like all-mpnet-base-v2 and all-minilm-l6-v2. It shows strong performance across various benchmarks while maintaining efficient inference times. However, users should note that the model is specifically optimized for English language content and may not perform optimally on multilingual or code-specific tasks. The model has since been superseded by jina-embeddings-v2-base-en and jina-embeddings-v3, which offer enhanced performance across a broader range of use cases.
Best Practice
For optimal deployment, the model requires a CUDA-capable GPU, though its moderate size allows for efficient inference on standard hardware. The model accepts input sequences up to 512 tokens in length and is particularly well-suited for production environments where consistent, reliable embedding generation is crucial. It performs best on English language content and is ideal for applications like semantic search, document similarity comparison, and content recommendation systems. Teams should consider using the newer v2 or v3 versions for new projects, as they offer improved performance and broader language support. The model is not recommended for tasks requiring multilingual understanding or specialized domain knowledge outside of general English text.