News
Models
Products
keyboard_arrow_down
DeepSearch
Search, read and reason until best answer found.
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms & Conditions


Log in
login
SceneXplain's Image Captioning Ability
Understanding JSON Schema
The Image-to-JSON Revolution
How to Use Image-to-JSON in SceneXplain
Real-World Applications and API Integration
Image-to-JSON vs. VQA & Image Captioning
In Conclusion
star
Featured
Tech blog
September 14, 2023

SceneXplain's Image-to-JSON: Extract Structured Data from Images with Precision

Pushing the boundaries of visual AI, we're thrilled to unveil SceneXplain's Image-to-JSON feature. Dive into a world where images aren't just seen, but deeply understood, translating visuals into structured data with unparalleled precision.
Diagram illustrating JSON schema annotations with a white Dodge Challenger sports car, highlighting attributes like car type
Engineering Group
Engineering Group • 6 minutes read

In the ever-evolving world of multimodal AI and computer vision, SceneXplain consistently pushes the boundaries. Today, we're thrilled to introduce a feature that promises to redefine the landscape of image captioning: Image-to-JSON. Let's delve into this innovation and understand its transformative potential.

0:00
/
A demo of SceneXplain's new image-to-JSON feature

tagSceneXplain's Image Captioning Ability

SceneXplain stands as a beacon in advanced image captioning and video summarization. Thanks to Jina AI's state-of-the-art multimodal algorithms, SceneXplain transcends traditional captioning, offering rich textual narratives from visuals. With an intuitive interface and a robust API, it's designed for both seasoned users and developers.

SceneXplain - Leading AI Solution for Image Captions and Video Summaries
Experience cutting-edge computer vision with our premier image captioning and video summarization algorithms. Tailored for content creators, media professionals, SEO experts, and e-commerce enterprises. Featuring multilingual support and seamless API integration. Elevate your digital presence today.
SceneXplain

tagUnderstanding JSON Schema

Before delving into the Image-to-JSON feature, it's essential to understand JSON Schema.

JSON Schema is a vocabulary that allows you to annotate and validate JSON documents. Think of it as a blueprint for the structure of your JSON data. It defines the shape of your data, types of data values, and even the range of permissible values. With JSON Schema, you can tailor the data extraction process to your specific needs.

a JSON file the JSON Schema defines the right JSON
{
  "name": "John Doe",
  "age": 30,
  "isStudent": false,
  "courses": ["Math", "Science"]
}
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "name": {
      "type": "string",
      "description": "Full name of the person"
    },
    "age": {
      "type": "number",
      "description": "Age of the person"
    },
    "isStudent": {
      "type": "boolean",
      "description": "Indicates if the person is a student"
    },
    "courses": {
      "type": "array",
      "items": {
        "type": "string"
      },
      "description": "List of courses the person is enrolled in"
    }
  },
  "required": ["name", "age", "isStudent", "courses"]
}
JSON Schema
The home of JSON Schema
JSON SchemaJSON Schema

tagThe Image-to-JSON Revolution

In traditional image captioning, the process has been linear: input an image and receive a text description. This approach, while effective, lacked the flexibility to extract specific data or focus on particular areas within an image. Enter SceneXplain's Image-to-JSON feature, our innovative solution to these limitations.

With Image-to-JSON, users upload an image and accompany it with a custom JSON Schema. The result? A structured JSON output tailored to capture specific information, whether it's in enums, lists, strings, booleans, or numbers.

The image we used in SceneXplain on three different tasks: image captioning; visual question answering; Image-to-JSON. Results can be found below.
Left: image captioning; Center: visual question answering
Image-to-JSON

From Prompting to Structured Outputs

The concept of prompting, popularized by large language models (LLMs), involves guiding AI responses using specific questions or instructions. For example, prompting an LLM with "Describe the Eiffel Tower" yields a textual description. However, this output, while informative, is unstructured.

Image-to-JSON takes prompting to the next level. The description field in the JSON schema serves as an advanced prompt. Instead of just a textual response, SceneXplain processes the image and structures its output based on the provided schema. This ensures not just relevance but also precision and consistency in the format.

This structured approach is especially crucial for applications that demand consistent data formats. While free-form text outputs offer flexibility, they can be challenging to integrate into systems that require structured data. Image-to-JSON bridges this gap, combining the adaptability of prompting with the reliability of structured outputs.

In essence, SceneXplain's Image-to-JSON is a testament to the evolution of AI comprehension. It showcases how AI can be both versatile in understanding visuals and precise in delivering structured, actionable data.

tagHow to Use Image-to-JSON in SceneXplain

To harness this feature, users need to upload their image and define a corresponding JSON schema. To do this, click the dropdown button on the right of the input box and then select "Add JSON Schema".

This schema comprises key-value pairs, with two essential keys:

  • type: This determines the result format, such as string, list, boolean, etc.
  • description: This serves as a prompt, guiding the kind of information to extract from the image.

Let's explore this with increasingly complex examples:

Basic Inventory Check:

{
  "type": "object",
  "properties": {
    "brands": {
      "type": "list",
      "description": "Identify brands on the shelf."
    }
  }
}

Season Identification:

{
  "type": "object",
  "properties": {
    "season": {
      "type": "string",
      "enum": ["Spring", "Summer", "Autumn", "Winter"],
      "description": "Determine the predominant season in the image."
    }
  }
}

Detailed Landscape Analysis:

{
  "type": "object",
  "properties": {
    "flora": {
      "type": "list",
      "description": "List all visible plant species."
    },
    "fauna": {
      "type": "list",
      "description": "List all visible animal species."
    },
    "timeOfDay": {
      "type": "string",
      "enum": ["Morning", "Afternoon", "Evening", "Night"],
      "description": "Identify the time of day."
    }
  }
}

tagSome Examples

One can also use Image-to-JSON as an advanced OCR solution

tagReal-World Applications and API Integration

Beyond the user interface, this feature can be seamlessly integrated into systems via our API. For developers looking to harness the power of Image-to-JSON programmatically, our API documentation provides comprehensive guidance.

tagImage-to-JSON vs. VQA & Image Captioning

The table below provides a clear comparison between SceneXplain's Image-to-JSON, Visual Question Answering (VQA), Traditional Image Captioning, and the Good-Old OCR based on various features.

Task SceneXplain's Image-to-JSON Visual Question Answering Traditional Image Captioning OCR
Flexibility Customizable JSON output Customizable queries Fixed text description Extracted text snippets
Output Types Structured: Enums, Lists, Strings, Booleans, Numbers (including nested structures) Text only Text only Text only
Granularity of Information High (detailed structured data) Medium (depends on the query) Low (general description) Low (text without context)
User Control Full via JSON Schema Limited by precise prompting None None
Custom Queries Supported via "description" key Possible Not available Not applicable
Integration Complexity Moderate (due to structured output) Low (simple text output) Low (simple text output) Low (simple text output)
Scalability High (designed for large-scale data processing) Medium (depends on backend) Medium (depends on backend) High (simple text extraction)

tagIn Conclusion

SceneXplain's Image-to-JSON isn't just an incremental improvement; it's a monumental leap. By offering unparalleled flexibility and precision, we're empowering users to extract the exact insights they seek from images. As we continue our innovation journey, we eagerly await the myriad ways you'll employ this feature to redefine visual comprehension.

Stay connected for more groundbreaking updates from SceneXplain!

Categories:
star
Featured
Tech blog
rss_feed

Read more
May 07, 2025 • 9 minutes read
Model Soup’s Recipe for Embeddings
Bo Wang
Scott Martens
April 16, 2025 • 10 minutes read
On the Size Bias of Text Embeddings and Its Impact in Search
Scott Martens
Black background with a simple white ruler marked in centimeters, emphasizing a minimalist design.
April 01, 2025 • 17 minutes read
Using DeepSeek R1 Reasoning Model in DeepSearch
Andrei Ungureanu
Alex C-G
Brown background with a stylized whale graphic and the text "THINK:" and ":SEARCH>" in code-like font.
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
location_on
Beijing, China
Level 5, Building 6, No.48 Haidian West St. Beijing, China
location_on
Shenzhen, China
402 Floor 4, Fu'an Technology Building, Shenzhen, China
Search Foundation
DeepSearch
Reader
Embeddings
Reranker
Classifier
Segmenter
API Documentation
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
Newsroom
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI © 2020-2025.