notifications
News
For Power Users
PromptPerfect
Premier tool for prompt engineering
SceneXplain
Leading AI solution for image captions and video summaries
BestBanner
Blog to banner, without the prompts!
JinaChat
More modality, longer memory, less cost
Rationale
Ultimate AI decision-making tools


For Developers
DocArray
The data structure for multimodal data
Jina
Build multimodal AI applications on the cloud
Finetuner
Fine-tune embeddings on domain specific data for better search quality
CLIP-as-service
Embed images and sentences into fixed-length vectors with CLIP
JCloud
Deploy a local project as a cloud service. Radically easy, no nasty surprises.
LangChain-Serve
Langchain apps on production with Jina & FastAPI
VectorDB
A Python vector database you just need - no more, no less
Executor Hub
Share and discover building blocks for multimodal AI applications
DALL-E Flow
A human-in-the-Loop workflow for creating HD images from text
DiscoArt
Create compelling Disco Diffusion artworks in one line of code
ThinkGPT
Agent techniques to augment your LLM and push it beyond its limits
DevGPT
Your virtual development team
RunGPT
An open-source cloud-native of large multimodal models serving framework
Jerboa
An experimental finetuner for open-source LLMs


For Enterprises
Finetuner+
Empower your enterprise with on-premise finetuning solutions
Inference
State-of-the-art multimodal models available for inference
Jina AI Cloud
Cloud hosting platform for multimodal AI applications
Searchscape
Navigate, interact, refine: reimagine product discovery
Semantica
Bridging the semantic gap in your existing search infrastructure


Company
Contact sales
About us
Join us
open_in_new
Open day
Intern program


arrow_circle_leftBack to Newsroom
Tech blog
Inference: How Can Jina AI Offer the Best-in-Class Model-as-a-Service So Affordably?
Jie Fu, Saahil Ognawala
May 25, 2023 • 5 minutes read

In today's rapidly evolving business landscape, enterprises face many challenges. From managing large volumes of unstructured data to delivering personalized customer experiences, businesses need advanced solutions to stay ahead of the curve. Machine learning (ML) has emerged as a powerful tool for addressing these challenges by automating repetitive tasks, processing data efficiently, and generating insights from multimedia content.

However, implementing ML solutions has traditionally been plagued by roadblocks such as costly and time-consuming MLOps steps and the need to distinguish between public and custom neural network models. This is where Jina AI's Inference steps in, offering a comprehensive solution to streamline access to curated, state-of-the-art ML models.

Inference | Effortlessly integrate state-of-the-art ML models into your software | Jina AI
Effortlessly integrate state-of-the-art ML models into your software
Jina AIJina AI

What is Inference?

Inference is a secure, scalable cloud platform that allows enterprises to harness the power of highly curated ML models without the complexities of traditional ML implementation. Jina AI's team of experts researches and selects the best models for various tasks, including text embedding, image embedding, image captioning, super-resolution, visual reasoning, visual question answering, and neural search, ensuring top performance and cost-efficiency for your business needs.

The upcoming release of Inference will also include support for text generation, further expanding its capabilities.

In this article, we're going to work through the technical and academic innovations that our team has leveraged to be able to provide models-as-a-service, which is at least as performant, and most cases more, in terms of queries-per-second, while being significantly more affordable compared to alternatives on the market.

The Art of Model Curation

Jina AI's model curation process involves an in-depth evaluation and comparison of state-of-the-art models, drafting efficient architectures for splitting and hosting models to achieve fast and cost-effective inference, and validating performance and reliability before making them available to the public. By entrusting us with the curation process, businesses can focus on leveraging these robust ML solutions to enhance identified tasks to be automated using machine learning.

The Science of Model Deployment

Jina Executors encapsulate specific functionalities and enable the construction of scalable and efficient inference systems. They are responsible for performing specific tasks within the pipeline, called a Jina Flow.

Generally, the following is the way to wrap a model using a Jina Executor:

from jina import Executor, requests

class MyModelExecutor(Executor):
    def __init__(self):
        super().__init__()
        # Initialize your model here

    @requests
    def encode(self, *args, **kwargs):
        # Implement your model's encoding logic here
        # Use `self.model` to access your model

        # Process inputs and generate outputs
        outputs = ...

        return outputs 

Memory Limits

Usually, calling init() will load the whole model into memory. However, dealing with large models, such as generative models, can consume excessive resources. This poses a challenge, particularly when using consumer-grade GPUs with limited VRAM capacity.

For example, BLIP2 (blip2-flan-t5-xl) takes more than 15GB just to load the model into memory. For some consumer-grade GPUs, such as the RTX 3080, that's all the VRAM they have. The risk of out-of-memory errors is very high when performing inference.

Untitled

A possible compromise is half-precision, i.e., using 16-bit floating point number representation instead of 32-bit. However, this can severely impair the performance of the model.

Below are two examples comparing image-captioning performance with 16-bit (left column) and 32-bit (right column) versions of BLIP2:

Untitled

The qualitative difference should be quite clear: Using the bigger model results in more accurate and concise captions.

Splitting Models across Executors

A practical solution to overcome resource limitations is to split the model into different parts. For instance, the BLIP2 model consists of an image encoder, a query former, and a language model decoder, each of which could, in principle, have its own Executor.

Untitled

With BLIP2, we can intuitively split the model into two Executors. Call them QFormerEncoder and LMDecoder.

QFormerEncoder (the image encoder plus the query former) extracts image features, while LMDecoder (the language model decoder) generates the output based on image features and a text prompt if one is provided. There is no coupling between these two Executors, so we can load and run them entirely asynchronously from each other.

Splitting models into multiple Executors addresses resource constraints and improves scalability. In the case of QFormerEncoder and LMDecoder, their processing times can differ significantly, so there is less idle runtime if they are disconnected from each other. By adjusting replica counts, we can achieve a high level of workload balance. For example, if the QFormerEncoder can process 12 documents per second, while the LMDecoder can only process 6, setting the replica count of LMDecoder to 2 for every replica of QFormerEncoder will ensure balanced processing and guarantee a processing speed of 12 queries per second for each QFormerEncoder.

Inference vs. The Rest

Jina AI stands out from other machine learning models-as-a-service providers in terms of cost-effectiveness, comprehensive enterprise support, and model curation. By choosing Inference from Jina AI, your business gains access to the best ML solutions without breaking the bank.

For comparison, consider the following table:

ProviderModelsPerformanceCost / API call
CohereCurated models for - Embedding, Generation, and ClassificationNo information available publicly~ US$ 0.002
ReplicateNon-curated list of models for - Embedding, Generation, Classification, Captioning …2 seconds per image caption with BLIP2~ US$ 0.005
Jina AICurated models for - Embedding, Captioning, Reasoning, Upscaling, (Generation)0.5 seconds per image caption with BLIP2US$ 0.0001

Get Started with Inference Today

In this article, we learned how businesses can benefit from using models-as-a-service, such as Jina AI's Inference, to automate repetitive tasks with machine learning and accelerate their growth. Jina's technological innovations make it possible to offer this service for a highly competitive price, especially for bulk commercial use cases.

Beginning your AI journey with Inference is simple, thanks to its easy integration, comprehensive documentation, and dedicated support team. Unlock the power of AI to enhance your business operations and stay ahead of the competition by signing up for Inference on Jina AI Cloud.

Get in Touch

Talk to us and keep the conversation going by joining the Jina AI Discord!

Jina AI Discord

Categories:
Tech blog

Learn more
SceneXplain's Image-to-JSON: Extract Structured Data from Images with Precision
Pushing the boundaries of visual AI, we're thrilled to unveil SceneXplain's Image-to-JSON feature. Dive into a world where images aren't just seen, but deeply understood, translating visuals into structured data with unparalleled precision.
Engineering Group
September 14, 2023 • 6 minutes read
Embeddings: The Swiss Army Knife of AI
Embeddings are an essential technology for modern AI. This article explains what they are and how they work.
Scott Martens
September 13, 2023 • 10 minutes read
PromptPerfect's LLM-as-a-Database is Blurring the Boundaries of Search vs. Generation
Diving into PromptPerfect's LLM-as-a-database: a middle ground between few-shot prompts and RAG. Its nuanced approach outshines GPT-4 in complex queries. Evolution in action.
Engineering Group
September 11, 2023 • 7 minutes read
Offices
location_on
Berlin, Germany (HQ)
Ohlauer Str. 43 (1st floor), zone A, 10999 Berlin, Germany
Geschäftsanschrift: Leipziger str. 96, 10117 Berlin, Germany
location_on
Beijing, China
Level 5, Building 6, No.48 Haidian West St. Beijing Haidian, China
location_on
Shenzhen, China
402, Floor 4, Fu'an Technology Building, Shenzhen Nanshan, China
© Jina AI GmbH 2020-2023. All rights reserved.
email
Privacy PolicyTerms and ConditionsPrivacy Settings