News
Models
Products
keyboard_arrow_down
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
DeepSearch
Search, read and reason until best answer found.
More
keyboard_arrow_down
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms & Conditions


Log in
login
How do I use OCR in visual question answering?
What's this good for?
OCR in action
Get started
Read more
Tech blog
July 20, 2023

Read My Pics: SceneXplain Puts OCR in Your Visual Question Answering!

We're turbocharging SceneXplain's visual question answering with an OCR upgrade, making it easier than ever to get answers out of your images
Hand holding a smartphone with a chat app inquiring about a neon 'Berlin' sign, with shelves of bottles in the background
Alex C-G
Alex C-G • 5 minutes read

Back in June, we introduced visual question answering in SceneXplain, letting you upload an image and ask questions about it:

View on SceneXplain

Now we're bringing a new model to the table, Glide. Glide reads and understands the text in images too, integrating OCR into the question-answering experience:

View on SceneXplain

tagHow do I use OCR in visual question answering?

The process is exactly the same as for visual question answering in general. It's baked right in, meaning you don't have to toggle any additional options. Just enable "Visual question answer" and you're off to the races!

First, upload your image:

We're using this image from Pexels.com

Then click on the speech bubble to ask your question:

Then click the button to send your data and, voila, your question will be answered!

View on SceneXplain

tagWhat's this good for?

  • Accessibility for visually impaired users: SceneXplain can be used to provide descriptive text for images online or in digital documents, helping visually impaired users understand the content more effectively.
  • Search Engine Optimization (SEO): By providing accurate and detailed descriptions for images, SceneXplain can help improve the SEO of websites, as search engines rely heavily on text data. This is especially useful for products like apparel with brand names, slogans, or other text; or infographics that incorporate text into images. This becomes especially convenient when you perform batching via SceneXplain’s API, which we cover in this post.
  • Education: Teachers can use SceneXplain to create descriptive text for diagrams, illustrations, and other visual aids, making it easier for students who prefer or require reading over visual learning.
  • Brand sentiment analysis: You can use SceneXplain to analyze images and associated text on social media platforms to better understand user sentiment and trends or to identify inappropriate content. This goes a step beyond regular brand sentiment analysis since it can detect brand logo text which would otherwise have been outside the model’s training set.
  • Translation: When combined with a translation tool, SceneXplain can be used to read and translate text in images from one language to another, aiding in international communication and travel in foreign countries.

tagOCR in action

The most obvious question you can ask is "What is the text in this image?"

View on SceneXplain

But we're working with a large language model (LLM) here. Why limit ourselves to something so basic?

We can do things like translating graffiti:

View on SceneXplain

...or correcting the translation of one language into another:

View on SceneXplain

...or read signs:

View on SceneXplain

...or just get specific information from a sign:

View on SceneXplain

Maybe you're completely out of touch with geek culture and don't understand why something is funny. SceneXplain has got your back:

View on SceneXplain

...and we can even ask it to describe what we're pointing at. Forget point and click - now you've got point and explain!

View on SceneXplain

tagGet started

OC-are you ready to get started with SceneXplain? Head on over to scenex.jina.ai to create your account, and start uploading your images and asking your questions. Need support or want to share your results? You can do that on our Discord channel.

SceneXplain - Explore image storytelling beyond pixels
Leverage GPT-4 & LLMs for the most advanced image storytelling. Explain visuals for content creators, media, & e-commerce with rich captions, multilingual support, and seamless API integration. Experience the future of image description today.
SceneXplain

tagRead more

Want to dig deeper? We've written a bunch of other posts about what you can do with SceneXplain:

Brand Engagement Reimagined: AI-Powered Sentiment Analysis with SceneXplain
Digging deep into images to uncover sentiment and brand can be a wild ride. Tackle it with SceneXplain, and you’ve got yourself a business power-up, ready to take the market by storm
Jina AIJina AI
Making Visuals Vocal: SceneXplain’s Impact on Product Image Accessibility
SceneXplain transforms product images into audio descriptions, ensuring visual content isn’t just seen, but also heard and understood. It’s a step forward in creating an inclusive digital world for everyone
Jina AIJina AI
SceneXplain vs. MiniGPT4: A Comprehensive Benchmark of Top 5 Image Captioning Algorithms for Understanding Complex Scenes
Uncover the future of image captioning as SceneXplain and its rivals face off in an epic showdown. Explore their impact on accessibility, SEO, and storytelling, and dive into our intriguing results to witness the cutting-edge capabilities of these algorithms.
Jina AIJina AI
Enhancing Digital Accessibility: How SceneXplain Transforms Multimedia Content for Public Sector Organizations
Explore SceneXplain’s impact on digital accessibility, providing exceptional image descriptions and ensuring compliance with European standards for public sector organizations.
Jina AIJina AI
SceneXplain: Unleash the Advanced Image Captioning & Storytelling
Uncover the game-changing potential of SceneXplain, an advanced image captioning solution powered by LLMs. Check out the benchmark against Midjourney, CLIP, BLIP2, and other alternatives. Dive into our blog post and experience the revolution firsthand!
Jina AIJina AI
Categories:
Tech blog
rss_feed

Read more
July 14, 2025 • 11 minutes read
Submodular Optimization for Text Selection, Passage Reranking & Context Engineering
Han Xiao
July 04, 2025 • 13 minutes read
Submodular Optimization for Diverse Query Generation in DeepResearch
Han Xiao
Black and white typographic design of "1993" with a 3D effect, minimalistic black border, and a sense of depth on a white bac
June 30, 2025 • 8 minutes read
Quantization-Aware Training of jina-embeddings-v4
Andrei Ungureanu
Scott Martens
Bo Wang
Retro-style digital screen displaying four pixelated images: a cat, a woman, an abstract figure, and a man's portrait, with l
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
location_on
Beijing, China
Level 5, Building 6, No.48 Haidian West St. Beijing, China
location_on
Shenzhen, China
402 Floor 4, Fu'an Technology Building, Shenzhen, China
Search Foundation
Reader
Embeddings
Reranker
DeepSearch
Classifier
Segmenter
API Documentation
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
Newsroom
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI © 2020-2025.