Multi-GPU Benchmarks Report
This is a report summarizing the benchmark methodology, the environment, the metrics, and the conclusions based on the data we collected.
1. Introduction
We conducted a series of vLLM inference benchmarks to evaluate performance under different GPU presets, model sizes, and parallelization strategies (pipeline vs. tensor). We focused on:
Prompt tokens/s and generation tokens/s (throughput).
Requests in different states (running, swapped, waiting).
KV-cache usage on GPU and CPU.
Average request latencies.
Error counts (e.g., OOM or network-related failures).
These tests spanned a range of context lengths (from 2048 up to 128k tokens) and model sizes (e.g., 1.5B, 8B, 32B). We also compared pipeline parallel with tensor parallel deployments to see how different parallelization strategies impacted throughput.
2. Environment & Setup
2.1 GPU Presets
We used the following GPU presets, each with distinct CPU, memory, and GPU counts/types:
Preset
vCPUs
Memory
VRAM
GPU Count
GPU Type
gpu-small
30.0
63.0 GB
16 GB
1
NVIDIA V100 (PCIe 16GB)
gpu-medium
60.0
126.0 GB
32 GB
2
NVIDIA V100 (PCIe 16GB)
gpu-large
120.0
252.0 GB
64 GB
4
NVIDIA V100 (PCIe 16GB)
gpu-xlarge
120.0
504.0 GB
128 GB
8
NVIDIA V100 (PCIe 16GB)
mi210x1
15
65.0 GB
64 GB
1
AMD MI210
mi210x2
30
130.0 GB
128 GB
2
AMD MI210
H100X1
63.0
265.0 GB
80 GB
1
NVIDIA H100 (PCIe 80GB)
H100X2
126.0
530.0 GB
160 GB
2
NVIDIA H100 (PCIe 80GB)
Each preset was deployed via a Helm-based “apolo” flow.
2.2 Models
We tested multiple Hugging Face models:
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
(smaller ~1.5B params)
meta-llama/Llama-3.2-3B-Instruct
(~3B params)
meta-llama/Llama-3.1-8B-Instruct
(~8B params)
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
(~32B params)
We then varied context length from 2048 up to 128k tokens to measure how throughput, memory usage, and latencies scaled with input size.
2.3 Benchmark Script & Methodology
A custom Python script:
Deploys each (preset, model) combination, With the following arguments:
--host=0.0.0.0 --port=8000 --model=meta-llama/Llama-3.1-8B-Instruct --code-revision=main --tokenizer= --tensor-parallel-size=<number_of_gpus> --dtype=half --max-model-len=<context_length> --enforce-eager --trust-remote-code
For context length we use a different length for each run and for tensor-parallel-size we use the number of GPUs available on the presetWaits until the endpoint is ready (
/v1/models
returning 200),Sends load (e.g., 100 requests) at concurrency=1,
100 requests @10req at a time to the
/v1/completions
endpointUses a fixed prompt: "Let's explore some architecture patterns for microservices"
Configures max_tokens=512/2048 and temperature=0.7
Polls
/metrics
every second for:prompt_tokens_total
&generation_tokens_total
=> used to compute tokens/snum_requests_running
,num_requests_swapped
,num_requests_waiting
gpu_cache_usage_perc
,cpu_cache_usage_perc
Tracks per-request latencies and errors,
Continues polling until all requests are finalized (no pending/running/swapped),
Writes aggregated metrics (averages) into a CSV file,
Generates bar charts for each metric.
We repeated these steps for pipeline parallel vs. tensor parallel where relevant.
3. Overview of Parallel Strategies
Pipeline Parallel: Splits the model layers into stages across multiple GPUs so that each GPU processes a portion of layers in sequence.
Tensor Parallel: Splits tensors (e.g., weight matrices) across multiple GPUs in a more fine-grained way so the same layers are effectively distributed among GPUs.
In general, tensor parallel is often more efficient for large or similarly sized GPUs, whereas pipeline parallel can help in some multi-GPU cases but can introduce significant inter-stage waiting time and memory overheads, especially if each pipeline stage has different computational loads.
Below are side-by-side tables comparing pipeline parallel vs. tensor parallel for Llama-3B at 2048 context length —where we have overlapping data. Each cell shows (Prompt TPS, Generation TPS). After each table, we list observations for that model.
Llama-3.2-3B-Instruct - 2048 context length @10req concurrency for 100 total requests
H100X1
47.64
47.35
564.79
573.80
-0.6%
1.6%
H100X2
36.14
40.98
449.28
500.43
13.4%
11.4%
gpu-medium
34.17
43.03
416.18
521.92
25.9%
25.4%
gpu-small
47.22
47.80
581.37
576.27
1.2%
-0.9%
mi210x1
47.79
46.29
581.30
574.77
-3.1%
-1.1%
mi210x2
34.20
40.59
425.94
489.30
18.7%
14.9%
Observations (Llama-3B, 2048 ctx, Pipeline vs. Tensor)
H100X1: ~-0.6% prompt speedup, ~1.6% gen speedup.
H100X2: ~13.4% prompt speedup, ~11.4% gen speedup.
gpu-medium: ~25.9% prompt speedup, ~25.4% gen speedup.
gpu-small: ~1.2% prompt speedup, ~-0.9% gen speedup.
mi210x1: ~-3.1% prompt speedup, ~-1.1% gen speedup.
mi210x2: ~18.7% prompt speedup, ~14.9% gen speedup.
Overall, we notice that:
For small models, if they fit on a single GPU, splitting them across multiple GPUs doesn't help. It actually slows down inference.
The tensor parallel size split is faster than the pipeline parallel size split on multi-gpu setups, which is expected.
4. Results & Observations
Below we break down context length runs and highlight models performance on specific configurations
4.1 2048-Token Context Benchmarks
Below we have three models: Qwen-1.5B, Llama-3B, Llama-8B, and Qwen-32B.
4.1.1 Qwen-1.5B - 2048 context length @10req at a time for 100 total requests
H100X1
45.74
649.78
6.95
71.64
5.10
0
H100X2
37.44
521.10
8.47
57.64
4.33
0
gpu-large
38.09
535.91
8.35
58.93
4.43
0
gpu-medium
39.73
552.69
7.96
61.43
4.68
0
gpu-small
46.63
676.10
6.79
74.31
5.21
0
mi210x1
43.73
618.48
7.24
68.46
5.06
0
mi210x2
9.35
125.16
8.84
55.09
4.13
10
Observations (Qwen-1.5B, 2048 ctx):
Highest Prompt TPS:
gpu-small
with 46.63 tokens/sHighest Gen TPS:
gpu-small
with 676.10 tokens/sNotable errors on: mi210x2 (10 errs)
4.1.2 Llama-3B - 2048 context length @10req at a time for 100 total requests
H100X1
9.12
105.12
7.95
64.26
5.36
10
H100X2
40.23
500.80
7.83
55.76
4.55
0
gpu-large
41.73
510.79
7.37
58.00
4.84
0
gpu-medium
43.03
532.96
7.24
59.75
4.91
0
gpu-small
47.22
576.56
6.49
65.87
5.51
0
gpu-xlarge
42.42
510.52
7.31
57.83
4.91
0
mi210x1
45.35
570.08
6.73
65.08
5.31
0
mi210x2
38.17
475.52
8.30
52.58
4.30
0
Observations (Llama-3B, 2048 ctx)
Highest Prompt TPS:
gpu-small
with 47.22 tokens/sHighest Gen TPS:
gpu-small
with 576.56 tokens/sNotable errors on: H100X1 (10 errs)
4.1.3 Llama-8B - 2048 context length @10req at a time for 100 total requests
H100X1
28.07
399.07
11.60
42.91
3.04
0
H100X2
5.08
66.26
11.01
45.72
3.19
20
gpu-large
34.24
492.47
9.52
52.81
3.69
0
gpu-medium
34.56
491.24
9.17
54.03
3.84
1
gpu-xlarge
34.27
488.51
9.41
52.93
3.75
0
mi210x1
28.09
407.72
11.66
43.22
3.02
0
mi210x2
31.46
444.93
10.17
48.53
3.47
0
Observations (Llama-8B, 2048 ctx):
Highest Prompt TPS:
gpu-medium
with 34.56 tokens/sHighest Gen TPS:
gpu-large
with 492.47 tokens/sNotable errors on: gpu-medium (1 errs), H100X2 (20 errs)
4.1.4 Qwen-32B - 2048 context length @10req at a time for 100 total requests
H100X1
14.44
123.72
22.27
13.42
1.95
0
H100X2
20.39
175.25
15.88
18.87
2.70
0
gpu-xlarge
8.20
73.99
12.06
27.64
3.53
10
Observations (Qwen-32B, 2048 ctx):
Highest Prompt TPS:
H100X2
with 20.39 tokens/sHighest Gen TPS:
H100X2
with 175.25 tokens/sNotable errors on: gpu-xlarge (10 errs)
4.2 8192-Token Context Benchmarks
4.2.1 DeepSeek-R1-Distill-Qwen-1.5B - 8192 context length @10req at a time for 100 total requests
H100X1
23.80
605.55
12.79
69.47
3.21
0
H100X2
20.07
517.68
15.33
58.55
2.89
0
gpu-large
20.76
531.39
15.24
58.74
2.61
0
gpu-medium
22.98
551.03
13.77
61.23
2.97
0
gpu-small
27.97
666.18
11.56
71.97
3.87
0
mi210x1
24.30
602.95
12.67
68.55
3.41
0
mi210x2
19.05
469.01
17.02
50.65
2.56
0
Observations
Highest Prompt TPS:
gpu-small
with 27.97 tokens/sHighest Gen TPS:
gpu-small
with 666.18 tokens/s
4.2.2 Llama-3.2-3B-Instruct - 8192 context length @10req at a time for 100 total requests
H100X1
37.67
522.51
8.28
58.67
4.70
0
H100X2
38.42
496.42
8.23
55.05
4.63
0
gpu-large
40.86
523.26
7.82
57.39
4.85
0
gpu-medium
38.88
531.07
8.23
58.16
4.70
0
gpu-small
38.51
521.73
7.22
65.37
5.23
0
gpu-xlarge
36.69
512.53
8.55
56.85
4.64
0
mi210x1
37.78
545.31
7.66
64.91
5.36
1
mi210x2
29.98
421.53
9.11
53.83
4.29
0
Observations
Highest Prompt TPS:
gpu-large
with 40.86 tokens/sHighest Gen TPS:
mi210x1
with 545.31 tokens/sNotable errors on: mi210x1 (1 errs)
4.2.3 Llama-8B - 8192 context length @10req at a time for 100 total requests
H100X1
8.27
369.33
23.56
39.56
2.18
63
H100X2
7.83
382.46
26.82
41.91
2.05
64
gpu-large
10.82
493.55
30.12
52.87
1.78
0
gpu-medium
10.18
503.87
32.15
53.54
1.53
0
gpu-xlarge
10.79
494.30
30.44
52.60
1.74
0
mi210x1
8.10
397.21
40.79
41.65
1.25
0
mi210x2
10.57
455.34
30.63
48.88
1.81
0
Observations (Llama-8B, 8192 ctx)
Highest Prompt TPS:
gpu-large
with 10.82 tokens/sHighest Gen TPS:
gpu-medium
with 503.87 tokens/sNotable errors on: H100X1 (63 errs), H100X2 (64 errs)
4.2.4 Qwen-32B - 8192 context length @10req at a time for 100 total requests
H100X1
11.49
116.34
22.60
13.39
1.91
14
H100X2
17.82
160.93
16.17
18.63
2.89
2
gpu-xlarge
25.81
229.41
11.28
27.48
3.97
0
Observations
Highest Prompt TPS:
gpu-xlarge
with 25.81 tokens/sHighest Gen TPS:
gpu-xlarge
with 229.41 tokens/sNotable errors on: H100X1 (14 errs), H100X2 (2 errs)
4.3 64k-Token Context Benchmarks
4.3.1 Qwen-1.5B - 64k context length @10req at a time for 100 total requests
H100X1
24.50
621.06
13.04
68.12
3.35
0
H100X2
7.58
181.16
14.96
57.85
2.85
10
mi210x1
24.49
609.78
12.68
68.15
3.29
0
mi210x2
18.45
502.88
17.12
54.94
2.48
0
Observations
Highest Prompt TPS:
H100X1
with 24.50 tokens/sHighest Gen TPS:
H100X1
with 621.06 tokens/sNotable errors on: H100X2 (10 errs)
4.3.2 Llama-3.2-3B - 64k context length @10req at a time for 100 total requests
H100X1
45.85
570.00
6.74
64.60
5.40
0
H100X2
40.34
501.89
7.80
55.93
4.71
0
mi210x1
40.51
595.07
7.85
65.21
5.22
0
mi210x2
30.43
419.55
9.04
53.19
4.30
0
Observations
Highest Prompt TPS:
H100X1
with 45.85 tokens/sHighest Gen TPS:
mi210x1
with 595.07 tokens/s
4.3.3 Llama-3.1-8B - 64k context length @10req at a time for 100 total requests
H100X1
8.35
367.05
19.59
39.50
2.53
65
H100X2
8.09
386.26
34.04
41.59
1.64
41
mi210x2
8.65
451.61
38.24
47.71
1.26
0
Observations
Highest Prompt TPS:
mi210x2
with 8.65 tokens/sHighest Gen TPS:
mi210x2
with 451.61 tokens/sNotable errors on: H100X1 (65 errs), H100X2 (41 errs)
4.3.4 DeepSeek-R1-Distill-Qwen-32B - 64k context length @10req at a time for 100 total requests
H100X2
19.32
176.10
15.99
19.06
2.91
2
Observations
Highest Prompt TPS:
H100X2
with 19.32 tokens/sHighest Gen TPS:
H100X2
with 176.10 tokens/sNotable errors on: H100X2 (2 errs)
4.4 128k-Token Context Benchmarks
4.4.1 Qwen-1.5B - 128k context length @10req at a time for 100 total requests
H100X1
26.45
642.54
11.98
70.88
3.45
0
H100X2
21.76
536.27
14.77
58.37
2.87
0
mi210x1
24.13
618.17
13.10
68.26
3.22
0
mi210x2
17.80
494.41
17.49
54.64
2.40
0
Observations
Highest Prompt TPS:
H100X1
with 26.45 tokens/sHighest Gen TPS:
H100X1
with 642.54 tokens/s
4.4.2 Llama-3.2-3B - 128k context length @10req at a time for 100 total requests
H100X1
45.53
574.22
6.84
64.60
5.30
0
H100X2
30.05
451.14
9.50
55.15
4.41
0
mi210x1
44.17
574.07
7.08
64.16
5.30
0
mi210x2
32.06
444.00
9.42
51.15
4.19
0
Observations
Highest Prompt TPS:
H100X1
with 45.53 tokens/sHighest Gen TPS:
H100X1
with 574.22 tokens/s
4.4.3 Llama-3.1-8B - 128k context length @10req at a time for 100 total requests
H100X1
7.10
360.40
27.75
40.10
2.00
80
H100X2
8.38
381.71
28.39
41.35
1.95
49
mi210x2
10.08
467.43
32.70
49.46
1.62
0
Observations
Highest Prompt TPS:
mi210x2
with 10.08 tokens/sHighest Gen TPS:
mi210x2
with 467.43 tokens/sNotable errors on: H100X1 (80 errs), H100X2 (49 errs)
4.4.4 DeepSeek-R1-Distill-Qwen-32B - 128k context length @10req at a time for 100 total requests
H100X2
18.96
168.05
16.02
18.98
2.90
1
Observations
Highest Prompt TPS:
H100X2
with 18.96 tokens/sHighest Gen TPS:
H100X2
with 168.05 tokens/sNotable errors on: H100X2 (1 errs)
5. Overview of concurrency strategies
Concurrency significantly impacts overall system throughput. You will observe that the request-level tokens per second (TPS) decreases as concurrency rises, while the total system throughput increases due to the vLLMs Paged Attention algorithm and other enhancements. This aspect can be adjusted for various trade-offs to boost system throughput, accommodate a reasonable number of request level tokens per second, and reduce errors.
H100X2
18.96
168.05
16.02
18.98
2.90
1
H100X2
3.62
29.36
10.2027
29.45
3.62
0
Important notes
The script uses a short, fixed prompt, which can lead to high generation throughput.
It employs sequential concurrent requests, maximizing GPU utilization.
The powerful hardware capabilities align with these performance figures.
You can squeeze a lot more from these presets with multiple model instances on same GPUs.
Last updated
Was this helpful?