Text Embeddings Inference

The Text Embeddings App transforms raw text into dense, high-dimensional vectors using state-of-the-art embedding models such as BERT, RoBERTa, or other models. These embeddings capture semantic meaning and can be used as input for downstream ML tasks or stored in vector databases.

Supported Models

Text Embeddings Inference currently supports Nomic, BERT, CamemBERT, XLM-RoBERTa models with absolute positions, JinaBERT model with Alibi positions and Mistral, Alibaba GTE, Qwen2 models with Rope positions, MPNet, and ModernBERT.

More detailed description can be found in Github Repo

Key Features

  • No model graph compilation step

  • Metal support for local execution on Macs

  • Small docker images and fast boot times. Get ready for true serverless!

  • Token based dynamic batching

  • Optimized transformers code for inference using Flash Attention, Candle and cuBLASLt

  • Safetensors weight loading

  • ONNX weight loading

  • Production ready (distributed tracing with Open Telemetry, Prometheus metrics)

Apolo deployment

Field
Description

Resource Preset

Required. Apolo preset for resources. E.g. gpu-xlarge, H100X1, mi210x2. Sets CPU, memory, GPU count, and GPU provider.

Hugging Face Model

Required. Provide a Model Name in specified field. And Higging Face token if model is gated. E.g. sentence-transformers/all-mpnet-base-v2

Enable HTTP Ingress

Exposes an application externally over HTTPS

Web Console UI

Step1 - Select the Preset you want to use (Currently only GPU-accelerated presets are supported)

Step2 - Select Model from HuggingFace repositories

Part 1 Text Embeddings Inference installation process
Part 2 Text Embeddings Inference installation process

If Model is gated, please provide the HuggingFace token, as a string of Apolo Secret.

Step3 - Install and wait for the outputs, at the Outputs section of an app

Outputs section

Apolo cli

Below is a streamlined example command that deploys Text Embeddings Inference app that deploys to a Nvidia preset:

apolo app install -f tei.yaml
# Example of tei.yaml

template_name: "text-embeddings-inference"
input:
  preset:
   name: "gpu-l4-x1"
  model:
    model_hf_name: "sentence-transformers/all-mpnet-base-v2"
  ingress_http:
    http_auth: false
    enabled: true

Usage

import requests
import json

# URL of your TEI server (adjust if running locally or behind a proxy)
TEI_ENDPOINT = "https://<YOUR_OUTPUTS_ENDPOINT>"

# Example texts to embed
texts = [
    "The quick brown fox jumps over the lazy dog.",
    "Artificial intelligence is transforming the world."
]

# Request payload
payload = {
    "inputs": texts,
    "normalize": True  # Optional: normalize vectors to unit length
}

if __name__ == '__main__':

    # Make the request
    response = requests.post(
        TEI_ENDPOINT,
        headers={"Content-Type": "application/json"},
        data=json.dumps(payload)
    )

    # Check for errors
    if response.status_code != 200:
        print(f"Error {response.status_code}: {response.text}")
        exit(1)

    # Parse and print the embeddings
    embeddings = response.json()
    for i, embedding in enumerate(embeddings):
        print(f"Text: {texts[i]}")
        print(f"Embedding: {embedding}")
        print()

References

Last updated

Was this helpful?