Embed images and sentences into fixed-length vectors with CLIP

Easy, low-latency and highly scalable service that can easily be integrated into new and existing solutions.

Use CLIP out of the box with CaS

CLIP is a powerful embedding model that outputs the similarity between text and images. While it delivers great results, the model in itself is not scalable, and integrating it into existing systems takes time, effort and machine learning knowledge.

CLIP-as-service is an easy-to-use service that is low-latency and highly scalable. It integrates easily into new and existing solutions as a microservice.

Free Tier

  • ViT-L/14-336px hosted completely free
  • 15,000 queries / month
  • Latency of > 500ms
  • 8 embeddings (images or text) per query

Premium Tier

  • Wide model selection
  • unlimited queries
  • Latency of < 500ms
  • up to 128 embeddings per query
  • Uptime of > 99.9%


Horizontally scale up and down multiple CLIP models on single GPU, with automatic load balancing.


No learning curve, minimalist design on client and server. Intuitive and consistent API for image and sentence embedding.


Async client support. Easily switch between gRPC, HTTP, WebSocket protocols with TLS and compression.

Ready to get started? It's Free!

Get free access via your personal authentication token.