DeepSeek-R1 model deployment
Last updated
Was this helpful?
Last updated
Was this helpful?
In this guide, we will walk through deployment process of DeepSeek R1 model using vLLM and Ray on the Apolo platform.
DeepSeek-R1 is a large language model that, according to it's developers, achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks. The model was trained via large-scale reinforcement learning without supervised fine-tuning as a preliminary step.
DeepSeek-R1 model way exceeds the memory capacity of a single GPU or even the entire DGX (NVIDIA's full-fledged AI machine). The model has 671 Billion parameters therefore it requires ~ 1.3 TB of vRAM only to host it's weights in an original format. It also has 128K tokens capacity, which bloats the memory requirements even further.
You could estimate your model vRAM requirements using any on-line calculator, for instance ollama-gpu-calculator.
However, for most of the use-cases it's more efficient to host a quantized version of the model that drastically reduces the required vRAM for weights and speeds-up the inference process slightly sacrificing a quality of results. Therefore, here we going to deploy an AWQ-quantized version of the DeepSeek-R1 model (cognitivecomputations/DeepSeek-R1-AWQ). We are going to use two NVIDIA DGX nodes with 8 x NVIDIA Tesla H100 SMX cards (80 Gb of vRAM) within.
We are going to run vLLM inference server on top of the Apolo platform. As you know, Apolo simplifies deployment by handling resource allocation, workloads scheduling, orchestration & isolation of processes and provisioning ingress and authentication capabilities.
We are also going to leverage Ray, a distributed computing framework that enables multi-GPU and multi-node inference to serve the model at a full context-length configuration, which is often required for the advanced tasks.
Here is a diagram overview of what we are trying to achieve:
Please note, if you don't have an access to such a resource preset, a generic guideline for building Ray cluster capable enough is to:
Run all jobs on the same preset, so the hardware stays heterogeneous
Set --pipeline-parallel-size
to the number of jobs dedicated for your cluster, including the head job.
Set --tensor-parallel-size
to the number of GPUs in each job.
The overall vRAM capacity should be around 1.2 Tb for inference on a full context length. You could reduce it by lowering the context length.
The easiest way to start is to utilize CLI.
Before going further, make sure you've completed the getting-started guide from the main documentation page, installed Apolo CLI, logged into the platform, created (or joined) organization and project.
Please note, the deployment consists of two main components: Ray head job and Ray worker job, which together form a static Ray cluster for hosting the vLLM server. Within a Ray worker node we will launch vLLM server that will be spread across the virtual cluster.
First, let's start a Ray head job:
Brief break down of the command:
-s dgx
specifies the preset hardware config name that will be used while running the job.
--life-span 10d
configures for how long should Apolo run the job.
-v storage:hf-cache:/root/.cache/huggingface
mounts the platform storage folder hf-cache
into the job's filesystem under /root/.cache/huggingface. This is where vLLM looks for model binaries on the startup.
--entrypoint ray
overwrites a default entrypoint of the container image to ray
executable.
vllm/vllm-openai
is a container image name, used to run the job.
start --block --head --port=6379
is a command arguments sent to ray
executable.
After the scheduling, a Ray head job will start waiting for the incoming connections and serving the requests. Leave it for now running.
Before going further, note the output of the following command:
It derives a runtime information from the running job on the platform. We are particularly interested in Internal Hostname named
row. This is a domain name within the cluster, needed for the worker job to connect to the head.
Second, let's start a Ray worker node. Open a new console window and copy-paste the following command. Please parameterize the address of your Ray head job accordingly.
This instructs Apolo to start Ray worker Job on another DGX machine. Within this job, we will also launch vLLM server on top of the running Ray cluster that will utilize a high-throughput intra-cluster communication (InfiniBand) protocol to serve a model distributed onto two machines. The startup process might take some time, depending on your cluster settings.
You will see the following line when the server is ready to accept requests:
When the model is up and running, you could use any standard client to query it. To derive the endpoint for sending the request, check the status of a ray-worker job:
There you are going to see a public domain name of the job, at the Http URL
row. Now, let's use curl
to send the request:
To cleanup hit ctrl+c in the console windows, where the jobs were launched. Alternatively, use cli command:
apolo kill <job-id>
to terminate the job.
This is it! You now have a fully operational DeepSeek R1 distributed model deployment on Apolo, serving inference requests efficiently with vLLM!
Apolo Flow allows one to template the workloads configuration into config files. The below configuration file snippet arranges the previously discussed deployment process in a couple of Apolo-Flow job descriptions. We also expand this scenario with the deployment of OpenWebUI server that acts as a web interface for chatting with your models.
You could find full configuration file in our GitHub repository.
Now start ray_head
, ray_worker
and web
jobs. Please give Ray a minute to spin-up a Head job before continuing with the worker node.
As in previous case, give vLLM some time to load such a huge LLM model into the GPU memory.
You can observe server startup process by connecting to the logs stream using cli command:
You will see the following line when the server is ready to accept requests:
You could use any client that best fits your needs. You could also notice that a web page with OpenWebUI was opened in your default browser. There you could query your model:
That's it! Now you know how to spin-up the DeepSeek-R1 model using vLLM and exposed its interface via web browser using OpenWebUI web-server app.
Don't forget to terminate the launched jobs when you don't need it:
This will terminate all jobs launched in this flow.