DeepSeek-R1 distilled models
Large language models (LLMs) have become essential tools for AI applications, but deploying them efficiently remains a challenge. DeepSeek Distilled models offer a balance between performance and efficiency, making them ideal for various real-world use cases.
In this guide, we will walk through deployment process of DeepSeek Distilled LLMs using vLLM on the Apolo platform.
Apolo simplifies deployment by handling:
Resource Allocation – Ensuring requested GPU and CPU are always available.
Workload Scheduling – Dynamically managing inference jobs.
Orchestration & Isolation – Serving models endpoints securely in isolated environments.
Ingress & Authentication – Providing built-in API access control and request routing.
Hardware Considerations
Since Apolo abstracts infrastructure management, you won’t need to configure hardware manually. Instead, you operate on resource presets that incorporate a collection of resources available at runtime. For high-performance inference, consider utilizing Apolo resource preset with:
GPU: NVIDIA A100 (40GB), H100, or RTX 4090. VRAM requirements depend on the model and the required context length. Minimum of 24GB VRAM recommended.
CPU: At least 8 vCPUs for handling requests efficiently.
RAM: 32GB+ for handling larger batch sizes or concurrent requests.
Storage: consider utilizing Apolo storage that is backed by high-throughput SSD/NVMe to persist model binaries and for faster startups.
In this guide we will be outlining 3 distinct ways to deploy Large Language Models on Apolo platform, you can choose one that suits you most or try all three:
Deploy with CLI
Deploy with Apolo-Flow
Deploy with Apps GUI
By the end of this guide, you’ll have a fully operational DeepSeek Distilled model deployed on Apolo, serving inference requests efficiently with vLLM.
Let’s get started!
Deploy with Apolo CLI
The easiest way to start the server is to utilize CLI.
First, make sure you've completed the getting-started guide from the main documentation page since here we assume that you already installed Apolo CLI, logged into the platform, created (or joined ) organization and project.
Second, pick the resource preset you want to run you server on. For this, checkout the Apolo cluster settings either in web console, or via CLI command apolo config show
.
In this example, we have the following presets that could do the work:
deepseek-ai/DeepSeek-R1-Distill-Llama-70B
70B Llama model requires around 140 GB of vRAM in FP16 format. Therefore, we use H100x2 here.
To start vLLM server, use the following CLI command:
The rest will be done by the platform: it allocates the resources, creates needed job, starts the server, exposes the endpoint, collects the metrics and so on...
Brief break down of the command:
-s H100x2
specifies the preset hardware config name that will be used while running the job.-v storage:hf-cache:/root/.cache/huggingface
mounts the platform storage folderhf-cache
into the job's filesystem under /root/.cache/huggingface. This is where vLLM looks for model binaries on the startup.--http-port 8000
tells Apolo which port from within the job to expose.vllm/vllm-openai
is a container image name, used to run the job.
The rest of the command are vLLM arguments. Full description could be found in the corresponding section of vLLM documentation.
Upon the successful activation of the command you should see the following output:
Note the 'Http URL'. - this is the endpoint to direct your queries to once the job is fully up and running.
You will see the following line when the server is ready to accept requests:
Query the model
When the model is up and running, you could use any standard client to query it. In this example, we use curl
to send the request:
Note, the usage of authorization header ("Authorization: Bearer...") is obligatory here, however, you could disable it. See the description of all possible job parameters and configurations in Apolo CLI documentation.
To cleanup hit ctrl+c in the console window, where the job was launched. Alternatively, use cli command:
apolo kill <job-id>
to terminate the job.
This is it! You now have a fully operational DeepSeek Distilled model deployed on Apolo, serving inference requests efficiently with vLLM!
Deploy with Apolo Flow
Apolo Flow allows one to template the workloads configuration into config files. By the end of the getting-started guide, you should be able to create and run a flow from the template. The below scenario combines exposing DeepSeek Distilled model and making querying it in browser available via custom WebUI interface.
For this scenario, we need to create a flow with the following components:
Apolo storage volumes that host model binaries and web server files
vLLM server that serves the requests, exactly the same as we did previously
OpenWebUI server that acts as a web interface where you could chat with your model.
Here is the configuration file snipped that implements those things:
You could find full configuration file in our GitHub repository.
The overall description of flow configuration syntax could be found in a dedicated documentation page. Let's now start vllm
and web
jobs.
You can observe server startup process by connecting to the logs stream using cli command:
apolo-flow logs vllm
.
Query the model
In this case, we describe another approach to "consume" the service provided by the LLM inference server. Here we start the OpenWebUI web server, that resembles OpenAI's ChatGPT application. It allows you to chat with various LLM servers at the same time, including OpenAI-compatible (which vLLM is) and Ollama APIs.
To start web chat with your DeepSeek model, issue cli command:
apolo-flow run web
Now the mentioned above OpenWebUI job's URL will be automatically opened in your default browser. After you log into the system, you will be able to chat with your model. Just make sure vllm finished it startup.
This is it! You have crated and ran a flow from template, your flow has spun up a DeepSeek distilled model using vLLM and exposed its interface via web browser using OpenWebUI web-server app.
Deploy with Apolo Apps GUI
Apolo applications allow you to deploy various systems together with their auxiliary resources as a holistic application. This also includes LLM model inference servers that enables scaling and reliability required for the production-ready projects.
The deployment process itself is quite straightforward: navigate to the Apolo web console, select LLM Inference application and start installation.
The following configuration screenshot resembles the previously discussed case:
After the installation completes, you can find the endpoint for inference in application outputs.
Deploy other DeepSeek-R1 distilled models
To deploy any other model, simply derive the amount of required vRAM to fit the model. Afterwards, select a corresponding resource preset and adjust vLLM CLI arguments. For vLLM you might need to tweak the number tensor parallel replicas as well as context length (max-model-len).
The rest stays the same — Apolo takes care of operations leaving you focused on the job.
Last updated
Was this helpful?