LLM Inference
Overview
vLLM is a high-performance and memory-efficient inference engine for large language models. It uses a novel GPU KV cache management strategy to serve transformer-based models at scale, supporting multiple GPUs (including NVIDIA and AMD) with ease. vLLM enables fast decoding and efficient memory utilization, making it suitable for production-level deployments of large LLMs.
Managing application via Apolo CLI
Shell can be installed on Apolo either via the CLI or the Web Console. Below are the detailed instructions for installing using Apolo CLI.
Install via Apolo CLI
Step 1 — Use the CLI command to get the application configuration file template:
apolo app-template get llm-inference > llm.yaml
Step 2 — Customize the application parameters. Below is an example configuration file:
# Application template configuration for: llm-inference
# Fill in the values below to configure your application.
# To use values from another app, use the following format:
# my_param:
# type: "app-instance-ref"
# instance_id: "<app-instance-id>"
# path: "<path-from-get-values-response>"
template_name: llm-inference
template_version: v25.7.0
input:
# Select the resource preset used per service replica.
preset:
# The name of the preset.
name: <>
# Enable access to your application over the internet using HTTPS.
ingress_http:
# Enable or disable HTTP ingress.
enabled: true
# Hugging Face Model Configuration.
hugging_face_model:
# The name of the Hugging Face model.
model_hf_name: <>
# The Hugging Face API token.
hf_token: <>
Explanation of configuration parameters:
Resource Preset: Choose a compute preset to define the hardware resources allocated. Example:
H100x1
HTTP Authentication: Enabled for secure access.
HuggingFace Model: Set HuggingFace model that you want to deploy.
Step 3 — Deploy the application in your Apolo project:
apolo app install -f llm.yaml
Monitor the application status using:
apolo app list
To uninstall the application, use:
apolo app uninstall <app-id>
If you want to see logs of the application, use:
apolo app logs <app-id>
For instructions on how to access the application, please refer to the Usage section.
Usage
After installation, you can utilize vLLM for different kind of workflows:
Go to the Installed Apps tab.
You will see a list of all running apps, including the vLLM app you just installed. To open the detailed information & uninstall the app, click the Details button.
Once in the Details" page, scroll down to the Outputs sections. To launch the applications, find the HTTP API output with with the public domain address, copy and open it and paste to the script.
import requests
API_URL = "<APP_HOST>/v1/chat/completions"
headers = {
"Content-Type": "application/json",
}
data = {
"model": "meta-llama/Llama-3.1-8B-Instruct", # Must match the model name loaded by vLLM
"messages": [
{
"role": "system",
"content": "You are a helpful assistant that gives concise and clear answers.",
},
{
"role": "user",
"content": (
"I'm preparing a presentation for non-technical stakeholders "
"about the benefits and limitations of using large language models in our customer support workflows. "
"Can you help me outline the key points I should include, with clear, jargon-free explanations and practical examples?"
),
},
]
}
if __name__ == '__main__':
response = requests.post(API_URL, headers=headers, json=data)
response.raise_for_status()
reply = response.text
status_code = response.status_code
print("Assistant:", reply)
print("Status Code:", status_code)
References:
Apolo Documentation (for the usage of
apolo run
and resource presets)Hugging Face Model Hub (for discovering or hosting models)
Last updated
Was this helpful?