DeepSeek-R1 model deployment
In this guide, we will walk through deployment process of DeepSeek R1 model using vLLM and Ray on the Apolo platform.
DeepSeek-R1 is a large language model that, according to it's developers, achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks. The model was trained via large-scale reinforcement learning without supervised fine-tuning as a preliminary step.
Hardware Considerations
DeepSeek-R1 model way exceeds the memory capacity of a single GPU or even the entire DGX (NVIDIA's full-fledged AI machine). The model has 671 Billion parameters therefore it requires ~ 1.3 TB of vRAM only to host it's weights in an original format. It also has 128K tokens capacity, which bloats the memory requirements even further.
You could estimate your model vRAM requirements using any on-line calculator, for instance ollama-gpu-calculator.
For most of the use-cases it's more efficient to host a quantized version of the model that drastically reduces the memory requirements and speeds-up the inference process slightly sacrificing a quality of the results. Here we are going to deploy an AWQ-quantized version of the DeepSeek-R1 model (cognitivecomputations/DeepSeek-R1-AWQ). We are going to use two NVIDIA DGX nodes with 8 x NVIDIA Tesla H100 SMX cards (80 Gb of vRAM) within which in sum gives us approx 1.2 TB of vRAM, more than enough to fit the model at full context length.
If you have access to a different resource pools in your cluster, you should first find out how many workers should be connected to the Ray cluster. You better use the same resource preset to split the load evenly. Here is what you need to do:
Identify the amount of vRAM each GPU has. Either from the GPU model, or manually:
In this case, we have 2 GPUs with 80 GB of vRAM each. This means, to you will need to have ~ 7 workers running on this preset (assuming we need 1.2 Tb of vRAM).
Software Consideration
We are going to run vLLM inference server on top of the Apolo platform. As you know, Apolo simplifies deployment by handling resource allocation, workloads scheduling, orchestration & isolation of processes and provisioning ingress and authentication capabilities.
We are also going to leverage Ray, a distributed computing framework that enables multi-GPU and multi-node inference to serve the model at a full context-length configuration, which is often required for the advanced tasks.
Here is a diagram overview of what we are trying to achieve:
Please note, if you don't have an access to such a resource preset, a generic guideline for building Ray cluster capable enough is to:
Run all jobs on the same preset, so the hardware stays heterogeneous
Set
--pipeline-parallel-size
to the number of jobs dedicated for your cluster, including the head job.Set
--tensor-parallel-size
to the number of GPUs in each job.The overall vRAM capacity should be around 1.2 Tb for inference on a full context length. You could reduce it by lowering the context length.
Deploy with Apolo CLI
The easiest way to start is to utilize CLI.
Before going further, make sure you've completed the getting-started guide from the main documentation page, installed Apolo CLI, logged into the platform, created (or joined) organization and project.
Please note, the deployment consists of two main components: Ray head job and Ray worker job, which together form a static Ray cluster for hosting the vLLM server. Within a Ray worker node we will launch vLLM server that will be spread across the virtual cluster.
First, let's start a Ray head job:
Brief break down of the command:
-s dgx
specifies the preset hardware config name that will be used while running the job.--life-span 10d
configures for how long should Apolo run the job.-v storage:hf-cache:/root/.cache/huggingface
mounts the platform storage folderhf-cache
into the job's filesystem under /root/.cache/huggingface. This is where vLLM looks for model binaries on the startup.--entrypoint ray
overwrites a default entrypoint of the container image toray
executable.vllm/vllm-openai
is a container image name, used to run the job.start --block --head --port=6379
is a command arguments sent toray
executable.
After the scheduling, a Ray head job will start waiting for the incoming connections and serving the requests. Leave it for now running.
Before going further, note the output of the following command:
It derives a runtime information from the running job on the platform. We are particularly interested in Internal Hostname named
row. This is a domain name within the cluster, needed for the worker job to connect to the head.
Second, let's start a Ray worker(s) node(s). If you need more than one worker nodes, you should start all of them but one now. If you are good with only one worker, proceed to the next step.
Open a new console window and copy-paste the following command. Please parameterize the address of your Ray head job accordingly.
Third, let's start last Ray worker, that will also start the model deployment. The model's endpoint will be hosted at this job, therefore, you should use this job's host URL to send the requests.
Run the following command in your console:
This instructs Apolo to start Ray worker Job on another DGX machine (unless you change the preset name). Within this job, we will also launch vLLM server on top of the running Ray cluster that will utilize a high-throughput intra-cluster communication (InfiniBand) protocol to serve a model distributed onto two machines. The startup process might take some time, depending on your cluster settings.
You will see the following line when the server is ready to accept requests:
Query the model
When the model is up and running, you could use any standard client to query it. To derive the endpoint for sending the request, check the status of a ray-worker job:
There you are going to see a public domain name of the job, at the Http URL
row. Now, let's use curl
to send the request:
To cleanup hit ctrl+c in the console windows, where the jobs were launched. Alternatively, use cli commands:
apolo ps
to list your running jobsapolo kill <job-id1> <job-id2> ...
to terminate the job(s)
This is it! You now have a fully operational DeepSeek R1 distributed model deployment on Apolo, serving inference requests efficiently with vLLM!
Deploy with Apolo Flow
Apolo Flow allows one to template the workloads configuration into config files. The below configuration file snippet arranges the previously discussed deployment process in a couple of Apolo-Flow job descriptions. We also expand this scenario with the deployment of OpenWebUI server that acts as a web interface for chatting with your models.
You should also adjust the preset names according to the Hardware Considerations.
You could find full configuration file in our GitHub repository.
Now start ray_head
, ray_worker
and web
jobs. Please give Ray a minute to spin-up a Head job before continuing with the worker node.
As in previous case, give vLLM some time to load such a huge LLM model into the GPU memory.
You can observe server startup process by connecting to the logs stream using cli command:
You will see the following line when the server is ready to accept requests:
Query the model
You could use any client that best fits your needs. You could also notice that a web page with OpenWebUI was opened in your default browser. There you could query your model:
That's it! Now you know how to spin-up the DeepSeek-R1 model using vLLM and exposed its interface via web browser using OpenWebUI web-server app.
Don't forget to terminate the launched jobs when you don't need it:
This will terminate all jobs launched in this flow.
Last updated
Was this helpful?