News
Models
Products
keyboard_arrow_down
DeepSearch
Search, read and reason until best answer found.
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
More
keyboard_arrow_down
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms & Conditions


Log in
login
Design
Use Cases
Related Work in Embedding Visualization
Conclusion
Tech blog
May 28, 2025

Correlations: Vibe-Testing Embeddings in GUI

As serious as we are about MTEB, we also love vibe-testing. Correlations is a simple GUI we use for validating citations in DeepSearch, debugging late chunking, and vibe-testing embeddings. Now it's open-source.
Technical screen showing green and yellow visual data, including charts in the lower half and a heat-map-like visualization a
Jina AI
Jina AI • 4 minutes read
GitHub - jina-ai/correlations: Simple UI for debugging correlations of text embeddings
Simple UI for debugging correlations of text embeddings - jina-ai/correlations
GitHubjina-ai

One of the interesting questions people ask us is: "How do you guys vibe-check your embeddings?" Sure, there's MTEB for serious and quantitative evaluation on public benchmarks, but what do you do for open-domain or new problems? Today we want to share a small internal tool we use for debugging and visualization. You can call it our vibe-testing toolkit. We call it Correlations, and it's open source on GitHub.

0:00
/1:23

tagDesign

Correlations generates interactive heatmaps where each cell shows the cosine similarity between two pieces—whether they're chunks from the same or different document collections, modalities, hyperparameters, or models. It supports several interaction:

  • Hover inspection: Original text/image and similarity scores for individual cell pairs
  • Region selection: Interactive area selection for focused analysis of similarity patterns
  • Threshold filtering: Similarity score and text length filters to reduce noise

The tool operates via a two-stage pipeline:

  1. npm run embed: Using the Jina Embeddings API with configurable chunking strategies (newline, punctuation, character-based, or regex patterns)
  2. npm run corr: Browser-based UI serving correlation heatmaps with real-time interactivity

To get started:

npm install
export JINA_API_KEY=your_jina_key_here
npm run embed -- https://jina.ai/news/jina-embeddings-v3-a-frontier-multilingual-embedding-model -o v3-blog.jsonl -t retrieval.query
npm run embed -- https://arxiv.org/pdf/2409.10173 -o v3-arxiv.jsonl -t retrieval.passage
npm run corr -- v3-blog.jsonl v3-arxiv.jsonl

JINA_API_KEY is used for embedding and reading content from a URL when necessary, reading from local text file is of course supported. You can also bring your own embeddings and do npm run corr for visualization only, in which case you don't need JINA_API_KEY. The tool supports both self-correlation analysis (within a single collections) and cross-correlation analysis (between two collections).

tagUse Cases

tagContent Deduplication and Alignment Analysis

We demonstrate the tool's utility through analysis of our jina-embeddings-v3 publications. By comparing the academic paper with the release note, the visualization revealed distinct diagonal patterns in the correlation heatmap, indicating strong chunk-to-chunk alignment between documents. Detailed examination showed systematic content reuse, particularly in technical sections describing LoRA task types.

0:00
/1:19

tagCitation and Reference Validation

The tool proves valuable for validating citation accuracy in retrieval-augmented generation systems, where it becomes critical to verify that retrieved passages genuinely support generated claims. Similarity-based analysis is a powerful and intuitive tool for exploring large data sets, for instance, for revealing patterns by grouping items by similarity.

tagChunking Strategy Exploration

Late chunking and other segmentation strategies can be evaluated by examining how different approaches affect semantic coherence within and between text segments. The visualization helps identify the late chunking effect and the optimal chunk boundaries by revealing similarity patterns that align with semantic structure.

tagCross-Modal Analysis

The tool extends beyond text to support image embeddings via jina-clip-v2, enabling analysis of text-image correlation patterns for multimodal applications.

0:00
/0:08

tagRelated Work in Embedding Visualization

The interpretability challenge is particularly acute when working with high-dimensional embeddings. The landscape of embedding visualization techniques has evolved significantly, with different approaches can be categorized as:

  • Dimensionality Reduction-Based: Traditional approaches using PCA, t-SNE, UMAP that project high-dimensional spaces to 2D/3D
  • Interactive Exploration-Based: Tools like Parallax and TextEssence that enable direct manipulation and exploration
  • Domain-Specific Solutions: Specialized tools like Clustergrammer for biological data
  • Direct Similarity Visualization: Our approach and similar heatmap-based methods that preserve full relational information
Method Approach Use Cases
Correlations Direct pairwise similarity heatmaps Text similarity debugging, alignment analysis
Embedding Projector PCA, t-SNE and custom linear projections Interactive visualization and interpretation
Parallax Algebraic formulae for semantic exploration Understanding semantic relations
TextEssence Comparative corpus analysis Diachronic analysis, corpus comparison
Nomic Atlas Cloud-based scalable visualization Large-scale datasets, collaboration
Clustergrammer Interactive heatmap with clustering High-dimensional biological data
t-SNE Non-linear cluster visualization Model debugging, confusion identification
UMAP Local and global structure preservation Medium-large datasets, general analysis
PCA Linear dimensionality reduction Initial exploration, baseline comparison

tagLimitations of Point-Wise Approaches

Existing visualization tools primarily focus on point-wise representations in 2D spaces, which can lose critical information about pairwise relationships. Moreover, most tools are designed for single embedding space analysis rather than comparative evaluation between different sources, modalities or embedding strategies (e.g. late chunking on vs. off).

For example, we recently encountered two use cases at Jina. The first involves cross-checking citations in DeepSearch, where we need to match generated report to the original excerpts from the reference material. The second is multimodal retrieval, where we need to verify image-text and image-image alignment on new unlabeled data. In both cases, we need to explore the relationships between two collections of embeddings. So we use Correlations to get a sense of how well the matches align and to validate whether the highest correlations consistently correspond to the correct matches.

tagConclusion

Beyond vibe checking, correlations can provide deeper insights into semantic relationships. As a starting point, several key statistics can be extracted from the correlation matrix:

  • Matrix Density: The proportion of correlations above specified thresholds, indicating overall semantic cohesion
  • Eigenvalue Distribution: Principal components analysis reveals the dominant patterns in similarity structure
  • Matrix Rank: Indicates the effective dimensionality of the similarity relationships
  • Condition Number: Measures numerical stability and potential multicollinearity issues

Advanced analysis may also involve extracting meaningful submatrices that represent coherent semantic regions. Extracting k-order maximal-sum principal submatrix from an n-order real matrix is a typical combinatorial optimization problem that can identify the most highly correlated segments.

Categories:
Tech blog
rss_feed
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
location_on
Beijing, China
Level 5, Building 6, No.48 Haidian West St. Beijing, China
location_on
Shenzhen, China
402 Floor 4, Fu'an Technology Building, Shenzhen, China
Search Foundation
DeepSearch
Reader
Embeddings
Reranker
Classifier
Segmenter
API Documentation
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
Newsroom
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI © 2020-2025.