Text Embeddings Inference
The Text Embeddings App transforms raw text into dense, high-dimensional vectors using state-of-the-art embedding models such as BERT, RoBERTa, or other models. These embeddings capture semantic meaning and can be used as input for downstream ML tasks or stored in vector databases.
Supported Models
Text Embeddings Inference currently supports Nomic, BERT, CamemBERT, XLM-RoBERTa models with absolute positions, JinaBERT model with Alibi positions and Mistral, Alibaba GTE, Qwen2 models with Rope positions, MPNet, and ModernBERT.
More detailed description can be found in Github Repo
Key Features
No model graph compilation step
Metal support for local execution on Macs
Small docker images and fast boot times. Get ready for true serverless!
Token based dynamic batching
Optimized transformers code for inference using Flash Attention, Candle and cuBLASLt
Safetensors weight loading
ONNX weight loading
Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
Apolo deployment
Resource Preset
Required. Apolo preset for resources. E.g. gpu-xlarge
, H100X1
, mi210x2
. Sets CPU, memory, GPU count, and GPU provider.
Hugging Face Model
Required. Provide a Model Name in specified field. And Higging Face token if model is gated. E.g. sentence-transformers/all-mpnet-base-v2
Enable HTTP Ingress
Exposes an application externally over HTTPS
Web Console UI
Step1 - Select the Preset you want to use (Currently only GPU-accelerated presets are supported)
Step2 - Select Model from HuggingFace repositories


If Model is gated, please provide the HuggingFace token, as a string of Apolo Secret.
Step3 - Install and wait for the outputs, at the Outputs section of an app

Apolo cli
Below is a streamlined example command that deploys Text Embeddings Inference app that deploys to a Nvidia preset:
apolo app install -f tei.yaml
# Example of tei.yaml
template_name: "text-embeddings-inference"
input:
preset:
name: "gpu-l4-x1"
model:
model_hf_name: "sentence-transformers/all-mpnet-base-v2"
ingress_http:
http_auth: false
enabled: true
Usage
import requests
import json
# URL of your TEI server (adjust if running locally or behind a proxy)
TEI_ENDPOINT = "https://<YOUR_OUTPUTS_ENDPOINT>"
# Example texts to embed
texts = [
"The quick brown fox jumps over the lazy dog.",
"Artificial intelligence is transforming the world."
]
# Request payload
payload = {
"inputs": texts,
"normalize": True # Optional: normalize vectors to unit length
}
if __name__ == '__main__':
# Make the request
response = requests.post(
TEI_ENDPOINT,
headers={"Content-Type": "application/json"},
data=json.dumps(payload)
)
# Check for errors
if response.status_code != 200:
print(f"Error {response.status_code}: {response.text}")
exit(1)
# Parse and print the embeddings
embeddings = response.json()
for i, embedding in enumerate(embeddings):
print(f"Text: {texts[i]}")
print(f"Embedding: {embedding}")
print()
References
Last updated
Was this helpful?