vLLM Inference

Overview

vLLM is a high-performance and memory-efficient inference engine for large language models. It uses a novel GPU KV cache management strategy to serve transformer-based models at scale, supporting multiple GPUs (including NVIDIA and AMD) with ease. vLLM enables fast decoding and efficient memory utilization, making it suitable for production-level deployments of large LLMs.

Managing application via Apolo CLI

vLLM can be installed on Apolo either via the CLI or the Web Console. Below are the detailed instructions for installing using Apolo CLI.

Install via Apolo CLI

Step 1 — Use the CLI command to get the application configuration file template:

apolo app-template get llm-inference > llm.yaml

Step 2 — Customize the application parameters. Below is an example configuration file:

# Application template configuration for: llm-inference
# Fill in the values below to configure your application.
# To use values from another app, use the following format:
# my_param:
#   type: "app-instance-ref"
#   instance_id: "<app-instance-id>"
#   path: "<path-from-get-values-response>"

template_name: llm-inference
template_version: v25.7.0
input:
  # Select the resource preset used per service replica.
  preset:
    # The name of the preset.
    name: <>
  # Enable access to your application over the internet using HTTPS.
  ingress_http:
    # Enable or disable HTTP ingress.
    enabled: true
  # Hugging Face Model Configuration.
  hugging_face_model:
    # The name of the Hugging Face model.
    model_hf_name: <>
    # The Hugging Face API token.
    hf_token: <>

Explanation of configuration parameters:

  1. Resource Preset: Choose a compute preset to define the hardware resources allocated. Example: H100x1

  2. HTTP Authentication: Enabled for secure access.

  3. HuggingFace Model: Set HuggingFace model that you want to deploy.

Step 3 — Deploy the application in your Apolo project:

Monitor the application status using:

To uninstall the application, use:

If you want to see logs of the application, use:

For instructions on how to access the application, please refer to the Usage section.

Usage

After installation, you can utilize vLLM for different kind of workflows:

  1. Go to the Installed Apps tab.

  2. You will see a list of all running apps, including the vLLM app you just installed. To open the detailed information & uninstall the app, click the Details button.

Once in the Details page, find the API endpoint URL in the Outputs section. Use this URL in your API calls as shown in the script below.

References:

Last updated

Was this helpful?