Apolo
HomeConsoleGitHub
  • Apolo concepts
  • CLI Reference
  • Examples/Use Cases
  • Flow CLI
  • Actions Reference
  • Apolo Extras CLI
  • Python SDK
  • Getting started
    • Introduction
    • First Steps
      • Getting Started
      • Training Your First Model
      • Running Your Code
    • Apolo Base Docker image
    • FAQ
    • Troubleshooting
    • References
  • Apolo Console
    • Getting started
      • Sign Up, Login
      • Organizations
      • Clusters
      • Projects
    • Apps
      • Pre-installed apps
        • Files
        • Buckets
        • Disks
        • Images
        • Secrets
        • Jobs
          • Remote Debugging with PyCharm Professional
          • Remote Debugging with VS Code
        • Flows
      • Available apps
        • Terminal
        • LLM Inference
          • vLLM Inference details
          • Multi-GPU Benchmarks Report
        • PostgreSQL
        • Text Embeddings Inference
        • Jupyter
        • VSCode
        • PyCharm Community Edition
        • MLflow
        • Apolo Deploy
        • Dify
        • Weaviate
        • Fooocus
        • Stable Diffusion
        • Hugging Face
        • Service Deployment
        • Spark Application
  • Apolo CLI
    • Installing CLI
    • Apps
      • Files
      • Jobs
      • Images
      • Available apps
        • VSCode
        • Hugging Face
        • Service Deployment
  • Administration
    • Apolo Credits
    • Cluster Management
      • Creating a Cluster
      • Managing Users and Quotas
      • Managing organizations
      • Creating Node Pools
      • Managing Presets
Powered by GitBook
On this page
  • Apolo Deployment
  • Web Console UI
  • Apolo cli
  • Usage
  • References

Was this helpful?

  1. Apolo Console
  2. Apps
  3. Available apps

LLM Inference

PreviousTerminalNextvLLM Inference details

Last updated 8 days ago

Was this helpful?

is a high-performance and memory-efficient inference engine for large language models. It uses a novel GPU KV cache management strategy to serve transformer-based models at scale, supporting multiple GPUs (including NVIDIA and AMD) with ease. vLLM enables fast decoding and efficient memory utilization, making it suitable for production-level deployments of large LLMs.

Key Features

  • High Throughput Inference: Novel GPU KV caching enables faster token generation compared to traditional implementations.

  • Multi-GPU Support: Scales seamlessly to multiple GPUs, including AMD (MI200s, MI300s) and NVIDIA (V100/H100/A100) resource pools.

  • Easy Model Downloading: Built-in integration with Hugging Face model repositories.

  • Flexible Configuration: Control precision (--dtype), context window size, parallelism (tensor-parallel-size, pipeline-parallel-size), etc.

  • Lightweight & Extensible: Minimal overhead for deployment and easy to integrate with existing MLOps or monitoring solutions.

Installation and Deployment on Apolo

You can deploy vLLM on the Apolo platform using the vLLM app. Apolo automates resource allocation, persistent storage, ingress, GPU detection, and environment variable injection, so you can focus on model configuration.

Highlights of the Apolo Installation Flow:

  1. Resource Allocation: Choose an Apolo preset (e.g. gpu-xlarge, mi210x2) that specifies CPU, memory, and GPU resources.

  2. GPU Auto-Configuration: If your preset includes multiple GPUs, environment variables (e.g. CUDA_VISIBLE_DEVICES or HIP_VISIBLE_DEVICES) are automatically set, along with a sensible default for parallelization.

  3. Ingress Setup: Enable an ingress to expose vLLM’s HTTP endpoint for external access.

  4. Integration with Hugging Face: You can pass your Hugging Face token via an environment variable to pull private models.


Apolo Deployment

Parameter Descriptions

The following parameters can be set with Apolo’s CLI (apolo run --pass-config ... install ... --set <key>=<value>). Many are optional but can be used to customize your deployment:

Resource Preset

Required. Apolo preset for resources. E.g. gpu-xlarge, H100X1, mi210x2. Sets CPU, memory, GPU count, and GPU provider.

Hugging Face Model

Required. Provide a Model Name in specified field. And Higging Face token if model is gated. E.g. sentence-transformers/all-mpnet-base-v2

Enable HTTP Ingress

Exposes an application externally over HTTPS

Hugging Face Tokenizer Name

Name or path of the huggingface tokenizer to use. If unspecified, model name or path will be used.

Server Extra Args

Cache Config

Optional. Configure storage cache path, used to persist your model. Important for Autoscaling purposes. If not specified, PV will be created automatically and attached to the application.

Any additional chart values can also be provided through --set flags, but the above are the most common.


Web Console UI

Step1 - Select the Preset you want to use (Currently only GPU-accelerated presets are supported)

Step 3 - Install and wait for the outputs, at the Outputs section of an app

Apolo cli

apolo app install -f llm.yaml --cluster <CLUSTER> --org <ORG> --project <PROJECT>
# Example of llm.yaml

template_name: "llm-inference"
input:
  preset:
   name: "gpu-l4-x1"
  hugging_face_model:
    model_hf_name: "meta-llama/Llama-3.1-8B-Instruct"
    hf_token: <INSERT_HF_TOKEN>
  ingress_http:
    http_auth: false
    enabled: true

Explanation:

  • preset.name=gpu-l4-x1 requests 1 GPUs (AMD MI210). Apolo automatically sets HIP_VISIBLE_DEVICES=0,1 , ROCR_VISIBLE_DEVICES=0,1 and default parallelization flags unless overridden.

  • model_hf_name: "meta-llama/Llama-3.1-8B-Instruct": The Hugging Face model to load.

  • ingress_http: Creates a public domain (e.g. vllm-large.apps.<YOUR_CLUSTER_NAME>.org.neu.ro) pointing to the vLLM deployment.

Usage

import requests

API_URL = "<APP_HOST>/v1/chat/completions"

headers = {
    "Content-Type": "application/json",
}

data = {
    "model": "meta-llama/Llama-3.1-8B-Instruct",  # Must match the model name loaded by vLLM
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant that gives concise and clear answers.",
        },
        {
            "role": "user",
            "content": (
                "I'm preparing a presentation for non-technical stakeholders "
                "about the benefits and limitations of using large language models in our customer support workflows. "
                "Can you help me outline the key points I should include, with clear, jargon-free explanations and practical examples?"
            ),
        },
    ]
}


if __name__ == '__main__':

    response = requests.post(API_URL, headers=headers, json=data)
    response.raise_for_status()

    reply = response.text
    status_code = response.status_code
    print("Assistant:", reply)
    print("Status Code:", status_code)

References

Optional. Specify additional args for llm. See

Step2 - Select Model from repositories

If Model is , please provide the HuggingFace token, as a string of Apolo Secret.

Below is a streamlined example command that deploys vLLM using the app that deploys to a Nvidia preset:

(for the usage of apolo run and resource presets)

(for discovering or hosting models)

vLLM
HuggingFace
gated
app-llm-inference
vLLM Official GitHub Repo
app-llm-inference Helm Chart Repository
Apolo Documentation
Hugging Face Model Hub
https://docs.vllm.ai/en/v0.5.1/models/engine_args.html
Part 1 - vLLM app deployment