LLaMA4
LLaMA4 is a high-performance and scalable family of large language models developed by Meta AI. It is optimized for inference efficiency and supports both instruction-tuned and base variants, making it suitable for a wide range of production-level NLP tasks and deployments.
LLaMA 4 is part of the vLLM application and is designed for instant deployment, minimizing configuration effort.
If you need to customize the deployment, please use the configurable application variant. Examples of tunable parameters include server-extra-args
, Ingress authentication, and more.
Key Features
Minimal Setup: Automatic detection of available preset based on chosen model.
High Throughput Inference: Novel GPU KV caching enables faster token generation compared to traditional implementations.
Multi-GPU Support: Scales seamlessly to multiple GPUs, including AMD (MI200s, MI300s) and NVIDIA (V100/H100/A100) resource pools.
Easy Model Downloading: Built-in integration with Hugging Face model repositories.
Lightweight & Extensible: Minimal overhead for deployment and easy to integrate with existing MLOps or monitoring solutions.
Installation and Deployment on Apolo
You can deploy Llama4 on the Apolo platform using the LLAMA4
app. Apolo automates resource allocation, persistent storage, ingress, GPU detection, and environment variable injection, so you can focus on model configuration.
Highlights of the Apolo Installation Flow:
Integration with Hugging Face: You can pass your Hugging Face token via an environment variable to pull private models.
Preset Auto-Configuration: Preset will be chosen by default, according to HuggingFace model minimal vRAM requirements.
Ingress Setup: By default will be enabled without authentication.
Autoscaling: Apolo provides built-in support for horizontal pod autoscaling of LLaMA 4 deployments based on incoming request load. Parameters for autoscaling (min replicas - 1, max replicas - 5, 100 requests per second to start scaling)
Cache: The caching of your model is enabled by default, the storage path is
storage://{cluster_name}/{org_name}/{project_name}/llm_bundles
Apolo Deployment
Hugging Face Token
Required. Provide a Hugging Face token if model is gated.
Model
Required. Select model size from the dropdown.
Web Console UI
Step1 - Select HuggingFace token
Step2 - Choose the size of your model

If Model is gated, please make sure that your HuggingFace token has an access to it.
Step 3 - Install and wait for the outputs, at the Outputs section of an app
Autoscaling
If you Enable the autoscaling, the following parameters will be applied:
Default Autoscaling Configuration:
Trigger Threshold: 100 requests per second
Replica Range: Minimum 1, Maximum 5 replicas
Behavior: The system monitors request volume and automatically adjusts the number of running replicas to match current demand, ensuring efficient use of GPU resources while maintaining performance.
Usage
import requests
API_URL = "<APP_HOST>/v1/chat/completions"
headers = {
"Content-Type": "application/json",
}
data = {
"model": "meta-llama/Llama-4-Scout-17B-16E", # Must match the model name loaded by vLLM
"messages": [
{
"role": "system",
"content": "You are a helpful assistant that gives concise and clear answers.",
},
{
"role": "user",
"content": (
"I'm preparing a presentation for non-technical stakeholders "
"about the benefits and limitations of using large language models in our customer support workflows. "
"Can you help me outline the key points I should include, with clear, jargon-free explanations and practical examples?"
),
},
]
}
if __name__ == '__main__':
response = requests.post(API_URL, headers=headers, json=data)
response.raise_for_status()
reply = response.text
status_code = response.status_code
print("Assistant:", reply)
print("Status Code:", status_code)
References
Apolo Documentation (for the usage of
apolo run
and resource presets)Hugging Face Model Hub (for discovering or hosting models)
Last updated
Was this helpful?