GPT OSS

GPT OSS is a family of open-source large language models developed for broad accessibility and performance. They are optimized for high-quality text generation and support both base and instruction-tuned variants, making them suitable for a wide range of NLP workloads such as chat, reasoning, and content creation.

GPT OSS models are available through the vLLM application and are designed for instant deployment, minimizing the need for manual setup. For advanced use cases, you can switch to the configurable application variant, which allows customization of parameters such as server-extra-args, Ingress authentication, and more.

Key Features

Minimal Setup: Automatic detection of available preset based on chosen model.
High Throughput Inference: Novel GPU KV caching enables faster token generation compared to traditional implementations.
Multi-GPU Support: Scales seamlessly to multiple GPUs, including AMD (MI200s, MI300s) and NVIDIA (V100/H100/A100) resource pools.
Easy Model Downloading: Built-in integration with Hugging Face model repositories.
Lightweight & Extensible: Minimal overhead for deployment and easy to integrate with existing MLOps or monitoring solutions.

Installation and Deployment on Apolo

You can deploy GPT OSS on the Apolo platform using the OpenAI GPT OSS app. Apolo automates resource allocation, persistent storage, ingress setup, GPU detection, and environment variable injection—allowing you to focus entirely on model selection and configuration.

Highlights of the Apolo Installation Flow:

Integration with Hugging Face: You can pass your Hugging Face token via an environment variable to pull private models.
Preset Auto-Configuration: Preset will be chosen by default, according to HuggingFace model minimal vRAM requirements.
Ingress Setup: By default will be enabled without authentication.
Autoscaling: Apolo provides built-in support for horizontal pod autoscaling of GPT OSS deployments based on incoming request load. Parameters for autoscaling (min replicas - 1, max replicas - 5, 100 requests per second to start scaling)
Cache: The caching of your model is enabled by default, the storage path is
```
storage://{cluster_name}/{org_name}/{project_name}/llm_bundles
```

Apolo Deployment

Hugging Face Token

Required. Provide a Hugging Face token if model is gated.

Model

Required. Select model size from the dropdown.

Web Console UI

Step1 - Select HuggingFace token

Step2 - Choose the size of your model

If Model is gated, please make sure that your HuggingFace token has an access to it.

Step 3 - Install and wait for the application to be deployed. Once installed, you can find the API endpoint URL in the Outputs section of the app details page.

Autoscaling

If you Enable the autoscaling, the following parameters will be applied:

Default Autoscaling Configuration:

Trigger Threshold: 100 requests per second
Replica Range: Minimum 1, Maximum 5 replicas
Behavior: The system monitors request volume and automatically adjusts the number of running replicas to match current demand, ensuring efficient use of GPU resources while maintaining performance.

Usage

import requests

API_URL = "<APP_HOST>/v1/chat/completions"

headers = {
    "Content-Type": "application/json",
}

data = {
    "model": "openai/gpt-oss-20b",  # Must match the model name loaded by vLLM
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant that gives concise and clear answers.",
        },
        {
            "role": "user",
            "content": (
                "I'm preparing a presentation for non-technical stakeholders "
                "about the benefits and limitations of using large language models in our customer support workflows. "
                "Can you help me outline the key points I should include, with clear, jargon-free explanations and practical examples?"
            ),
        },
    ]
}


if __name__ == '__main__':

    response = requests.post(API_URL, headers=headers, json=data)
    response.raise_for_status()

    reply = response.text
    status_code = response.status_code
    print("Assistant:", reply)
    print("Status Code:", status_code)

References

vLLM Official GitHub Repo
GPT OSS Official Page
vLLM application Helm Chart Repository
Apolo Documentation (for the usage of apolo run and resource presets)
Hugging Face Model Hub (for discovering or hosting models)
Apolo Hugging Face application management
vLLM inference install via CLI
GPT OSS HuggingFace Collection
Managing Apps

PreviousDeepSeekR1 NextMistral

Last updated 1 month ago

Was this helpful?