Multi-GPU Benchmarks Report

This is a report summarizing the benchmark methodology, the environment, the metrics, and the conclusions based on the data we collected.

1. Introduction

We conducted a series of vLLM inference benchmarks to evaluate performance under different GPU presets, model sizes, and parallelization strategies (pipeline vs. tensor). We focused on:

Prompt tokens/s and generation tokens/s (throughput).
Requests in different states (running, swapped, waiting).
KV-cache usage on GPU and CPU.
Average request latencies.
Error counts (e.g., OOM or network-related failures).

These tests spanned a range of context lengths (from 2048 up to 128k tokens) and model sizes (e.g., 1.5B, 8B, 32B). We also compared pipeline parallel with tensor parallel deployments to see how different parallelization strategies impacted throughput.

2. Environment & Setup

2.1 GPU Presets

We used the following GPU presets, each with distinct CPU, memory, and GPU counts/types:

Preset

vCPUs

Memory

VRAM

GPU Count

GPU Type

gpu-small

30.0

63.0 GB

16 GB

NVIDIA V100 (PCIe 16GB)

gpu-medium

60.0

126.0 GB

32 GB

NVIDIA V100 (PCIe 16GB)

gpu-large

120.0

252.0 GB

64 GB

NVIDIA V100 (PCIe 16GB)

gpu-xlarge

120.0

504.0 GB

128 GB

NVIDIA V100 (PCIe 16GB)

mi210x1

65.0 GB

64 GB

AMD MI210

mi210x2

130.0 GB

128 GB

AMD MI210

H100X1

63.0

265.0 GB

80 GB

NVIDIA H100 (PCIe 80GB)

H100X2

126.0

530.0 GB

160 GB

NVIDIA H100 (PCIe 80GB)

Each preset was deployed via a Helm-based “apolo” flow.

2.2 Models

We tested multiple Hugging Face models:

deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B (smaller ~1.5B params)
meta-llama/Llama-3.2-3B-Instruct (~3B params)
meta-llama/Llama-3.1-8B-Instruct (~8B params)
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B (~32B params)

We then varied context length from 2048 up to 128k tokens to measure how throughput, memory usage, and latencies scaled with input size.

2.3 Benchmark Script & Methodology

A custom Python script:

Deploys each (preset, model) combination, With the following arguments: --host=0.0.0.0 --port=8000 --model=meta-llama/Llama-3.1-8B-Instruct --code-revision=main --tokenizer= --tensor-parallel-size=<number_of_gpus> --dtype=half --max-model-len=<context_length> --enforce-eager --trust-remote-codeFor context length we use a different length for each run and for tensor-parallel-size we use the number of GPUs available on the preset
Waits until the endpoint is ready (/v1/models returning 200),
Sends load (e.g., 100 requests) at concurrency=1,
- 100 requests @10req at a time to the /v1/completions endpoint
- Uses a fixed prompt: "Let's explore some architecture patterns for microservices"
- Configures max_tokens=512/2048 and temperature=0.7
Polls /metrics every second for:
- prompt_tokens_total & generation_tokens_total => used to compute tokens/s
- num_requests_running, num_requests_swapped, num_requests_waiting
- gpu_cache_usage_perc, cpu_cache_usage_perc
Tracks per-request latencies and errors,
Continues polling until all requests are finalized (no pending/running/swapped),
Writes aggregated metrics (averages) into a CSV file,
Generates bar charts for each metric.

We repeated these steps for pipeline parallel vs. tensor parallel where relevant.

3. Overview of Parallel Strategies

Pipeline Parallel: Splits the model layers into stages across multiple GPUs so that each GPU processes a portion of layers in sequence.
Tensor Parallel: Splits tensors (e.g., weight matrices) across multiple GPUs in a more fine-grained way so the same layers are effectively distributed among GPUs.

In general, tensor parallel is often more efficient for large or similarly sized GPUs, whereas pipeline parallel can help in some multi-GPU cases but can introduce significant inter-stage waiting time and memory overheads, especially if each pipeline stage has different computational loads.

Below are side-by-side tables comparing pipeline parallel vs. tensor parallel for Llama-3B at 2048 context length —where we have overlapping data. Each cell shows (Prompt TPS, Generation TPS). After each table, we list observations for that model.

Llama-3.2-3B-Instruct - 2048 context length @10req concurrency for 100 total requests

GPU Preset

Pipeline (Prompt TPS)

Tensor (Prompt TPS)

Pipeline (Gen TPS)

Tensor (Gen TPS)

Prompt Speedup (%)

Gen Speedup (%)

H100X1

47.64

47.35

564.79

573.80

-0.6%

1.6%

H100X2

36.14

40.98

449.28

500.43

13.4%

11.4%

gpu-medium

34.17

43.03

416.18

521.92

25.9%

25.4%

gpu-small

47.22

47.80

581.37

576.27

1.2%

-0.9%

mi210x1

47.79

46.29

581.30

574.77

-3.1%

-1.1%

mi210x2

34.20

40.59

425.94

489.30

18.7%

14.9%

Observations (Llama-3B, 2048 ctx, Pipeline vs. Tensor)

H100X1: ~-0.6% prompt speedup, ~1.6% gen speedup.
H100X2: ~13.4% prompt speedup, ~11.4% gen speedup.
gpu-medium: ~25.9% prompt speedup, ~25.4% gen speedup.
gpu-small: ~1.2% prompt speedup, ~-0.9% gen speedup.
mi210x1: ~-3.1% prompt speedup, ~-1.1% gen speedup.
mi210x2: ~18.7% prompt speedup, ~14.9% gen speedup.

Overall, we notice that:

For small models, if they fit on a single GPU, splitting them across multiple GPUs doesn't help. It actually slows down inference.
The tensor parallel size split is faster than the pipeline parallel size split on multi-gpu setups, which is expected.

4. Results & Observations

Below we break down context length runs and highlight models performance on specific configurations

4.1 2048-Token Context Benchmarks

Below we have three models: Qwen-1.5B, Llama-3B, Llama-8B, and Qwen-32B.

4.1.1 Qwen-1.5B - 2048 context length @10req at a time for 100 total requests

Preset

Prompt TPS

Gen TPS

Avg Latency (s)

Request Generation Level TPS

Request Prompt Level TPS

Errors

H100X1

45.74

649.78

6.95

71.64

5.10

H100X2

37.44

521.10

8.47

57.64

4.33

gpu-large

38.09

535.91

8.35

58.93

4.43

gpu-medium

39.73

552.69

7.96

61.43

4.68

gpu-small

46.63

676.10

6.79

74.31

5.21

mi210x1

43.73

618.48

7.24

68.46

5.06

mi210x2

9.35

125.16

8.84

55.09

4.13

Observations (Qwen-1.5B, 2048 ctx):

Highest Prompt TPS: gpu-small with 46.63 tokens/s
Highest Gen TPS: gpu-small with 676.10 tokens/s
Notable errors on: mi210x2 (10 errs)

4.1.2 Llama-3B - 2048 context length @10req at a time for 100 total requests

Preset

Prompt TPS

Gen TPS

Avg Latency (s)

Request Generation Level TPS

Request Prompt Level TPS

Errors

H100X1

9.12

105.12

7.95

64.26

5.36

H100X2

40.23

500.80

7.83

55.76

4.55

gpu-large

41.73

510.79

7.37

58.00

4.84

gpu-medium

43.03

532.96

7.24

59.75

4.91

gpu-small

47.22

576.56

6.49

65.87

5.51

gpu-xlarge

42.42

510.52

7.31

57.83

4.91

mi210x1

45.35

570.08

6.73

65.08

5.31

mi210x2

38.17

475.52

8.30

52.58

4.30

Observations (Llama-3B, 2048 ctx)

Highest Prompt TPS: gpu-small with 47.22 tokens/s
Highest Gen TPS: gpu-small with 576.56 tokens/s
Notable errors on: H100X1 (10 errs)

4.1.3 Llama-8B - 2048 context length @10req at a time for 100 total requests

Preset

Prompt TPS

Gen TPS

Avg Latency (s)

Request Generation Level TPS

Request Prompt Level TPS

Errors

H100X1

28.07

399.07

11.60

42.91

3.04

H100X2

5.08

66.26

11.01

45.72

3.19

gpu-large

34.24

492.47

9.52

52.81

3.69

gpu-medium

34.56

491.24

9.17

54.03

3.84

gpu-xlarge

34.27

488.51

9.41

52.93

3.75

mi210x1

28.09

407.72

11.66

43.22

3.02

mi210x2

31.46

444.93

10.17

48.53

3.47

Observations (Llama-8B, 2048 ctx):

Highest Prompt TPS: gpu-medium with 34.56 tokens/s
Highest Gen TPS: gpu-large with 492.47 tokens/s
Notable errors on: gpu-medium (1 errs), H100X2 (20 errs)

4.1.4 Qwen-32B - 2048 context length @10req at a time for 100 total requests

Preset

Prompt TPS

Gen TPS

Avg Latency (s)

Request Generation Level TPS

Request Prompt Level TPS

Errors

H100X1

14.44

123.72

22.27

13.42

1.95

H100X2

20.39

175.25

15.88

18.87

2.70

gpu-xlarge

8.20

73.99

12.06

27.64

3.53

Observations (Qwen-32B, 2048 ctx):

Highest Prompt TPS: H100X2 with 20.39 tokens/s
Highest Gen TPS: H100X2 with 175.25 tokens/s
Notable errors on: gpu-xlarge (10 errs)

4.2 8192-Token Context Benchmarks

4.2.1 DeepSeek-R1-Distill-Qwen-1.5B - 8192 context length @10req at a time for 100 total requests

Preset

Prompt TPS

Gen TPS

Avg Latency (s)

Request Generation Level TPS

Request Prompt Level TPS

Errors

H100X1

23.80

605.55

12.79

69.47

3.21

H100X2

20.07

517.68

15.33

58.55

2.89

gpu-large

20.76

531.39

15.24

58.74

2.61

gpu-medium

22.98

551.03

13.77

61.23

2.97

gpu-small

27.97

666.18

11.56

71.97

3.87

mi210x1

24.30

602.95

12.67

68.55

3.41

mi210x2

19.05

469.01

17.02

50.65

2.56

Observations

Highest Prompt TPS: gpu-small with 27.97 tokens/s
Highest Gen TPS: gpu-small with 666.18 tokens/s

4.2.2 Llama-3.2-3B-Instruct - 8192 context length @10req at a time for 100 total requests

Preset

Prompt TPS

Gen TPS

Avg Latency (s)

Request Generation Level TPS

Request Prompt Level TPS

Errors

H100X1

37.67

522.51

8.28

58.67

4.70

H100X2

38.42

496.42

8.23

55.05

4.63

gpu-large

40.86

523.26

7.82

57.39

4.85

gpu-medium

38.88

531.07

8.23

58.16

4.70

gpu-small

38.51

521.73

7.22

65.37

5.23

gpu-xlarge

36.69

512.53

8.55

56.85

4.64

mi210x1

37.78

545.31

7.66

64.91

5.36

mi210x2

29.98

421.53

9.11

53.83

4.29

Observations

Highest Prompt TPS: gpu-large with 40.86 tokens/s
Highest Gen TPS: mi210x1 with 545.31 tokens/s
Notable errors on: mi210x1 (1 errs)

4.2.3 Llama-8B - 8192 context length @10req at a time for 100 total requests

Preset

Prompt TPS

Gen TPS

Avg Latency (s)

Request Generation Level TPS

Request Prompt Level TPS

Errors

H100X1

8.27

369.33

23.56

39.56

2.18

H100X2

7.83

382.46

26.82

41.91

2.05

gpu-large

10.82

493.55

30.12

52.87

1.78

gpu-medium

10.18

503.87

32.15

53.54

1.53

gpu-xlarge

10.79

494.30

30.44

52.60

1.74

mi210x1

8.10

397.21

40.79

41.65

1.25

mi210x2

10.57

455.34

30.63

48.88

1.81

Observations (Llama-8B, 8192 ctx)

Highest Prompt TPS: gpu-large with 10.82 tokens/s
Highest Gen TPS: gpu-medium with 503.87 tokens/s
Notable errors on: H100X1 (63 errs), H100X2 (64 errs)

4.2.4 Qwen-32B - 8192 context length @10req at a time for 100 total requests

Preset

Prompt TPS

Gen TPS

Avg Latency (s)

Request Generation Level TPS

Request Prompt Level TPS

Errors

H100X1

11.49

116.34

22.60

13.39

1.91

H100X2

17.82

160.93

16.17

18.63

2.89

gpu-xlarge

25.81

229.41

11.28

27.48

3.97

Observations

Highest Prompt TPS: gpu-xlarge with 25.81 tokens/s
Highest Gen TPS: gpu-xlarge with 229.41 tokens/s
Notable errors on: H100X1 (14 errs), H100X2 (2 errs)

4.3 64k-Token Context Benchmarks

4.3.1 Qwen-1.5B - 64k context length @10req at a time for 100 total requests

Preset

Prompt TPS

Gen TPS

Avg Latency (s)

Request Generation Level TPS

Request Prompt Level TPS

Errors

H100X1

24.50

621.06

13.04

68.12

3.35

H100X2

7.58

181.16

14.96

57.85

2.85

mi210x1

24.49

609.78

12.68

68.15

3.29

mi210x2

18.45

502.88

17.12

54.94

2.48

Observations

Highest Prompt TPS: H100X1 with 24.50 tokens/s
Highest Gen TPS: H100X1 with 621.06 tokens/s
Notable errors on: H100X2 (10 errs)

4.3.2 Llama-3.2-3B - 64k context length @10req at a time for 100 total requests

Preset

Prompt TPS

Gen TPS

Avg Latency (s)

Request Generation Level TPS

Request Prompt Level TPS

Errors

H100X1

45.85

570.00

6.74

64.60

5.40

H100X2

40.34

501.89

7.80

55.93

4.71

mi210x1

40.51

595.07

7.85

65.21

5.22

mi210x2

30.43

419.55

9.04

53.19

4.30

Observations

Highest Prompt TPS: H100X1 with 45.85 tokens/s
Highest Gen TPS: mi210x1 with 595.07 tokens/s

4.3.3 Llama-3.1-8B - 64k context length @10req at a time for 100 total requests

Preset

Prompt TPS

Gen TPS

Avg Latency (s)

Request Generation Level TPS

Request Prompt Level TPS

Errors

H100X1

8.35

367.05

19.59

39.50

2.53

H100X2

8.09

386.26

34.04

41.59

1.64

mi210x2

8.65

451.61

38.24

47.71

1.26

Observations

Highest Prompt TPS: mi210x2 with 8.65 tokens/s
Highest Gen TPS: mi210x2 with 451.61 tokens/s
Notable errors on: H100X1 (65 errs), H100X2 (41 errs)

4.3.4 DeepSeek-R1-Distill-Qwen-32B - 64k context length @10req at a time for 100 total requests

Preset

Prompt TPS

Gen TPS

Avg Latency (s)

Request Generation Level TPS

Request Prompt Level TPS

Errors

H100X2

19.32

176.10

15.99

19.06

2.91

Observations

Highest Prompt TPS: H100X2 with 19.32 tokens/s
Highest Gen TPS: H100X2 with 176.10 tokens/s
Notable errors on: H100X2 (2 errs)

4.4 128k-Token Context Benchmarks

4.4.1 Qwen-1.5B - 128k context length @10req at a time for 100 total requests

Preset

Prompt TPS

Gen TPS

Avg Latency (s)

Request Generation Level TPS

Request Prompt Level TPS

Errors

H100X1

26.45

642.54

11.98

70.88

3.45

H100X2

21.76

536.27

14.77

58.37

2.87

mi210x1

24.13

618.17

13.10

68.26

3.22

mi210x2

17.80

494.41

17.49

54.64

2.40

Observations

Highest Prompt TPS: H100X1 with 26.45 tokens/s
Highest Gen TPS: H100X1 with 642.54 tokens/s

4.4.2 Llama-3.2-3B - 128k context length @10req at a time for 100 total requests

Preset

Prompt TPS

Gen TPS

Avg Latency (s)

Request Generation Level TPS

Request Prompt Level TPS

Errors

H100X1

45.53

574.22

6.84

64.60

5.30

H100X2

30.05

451.14

9.50

55.15

4.41

mi210x1

44.17

574.07

7.08

64.16

5.30

mi210x2

32.06

444.00

9.42

51.15

4.19

Observations

Highest Prompt TPS: H100X1 with 45.53 tokens/s
Highest Gen TPS: H100X1 with 574.22 tokens/s

4.4.3 Llama-3.1-8B - 128k context length @10req at a time for 100 total requests

Preset

Prompt TPS

Gen TPS

Avg Latency (s)

Request Generation Level TPS

Request Prompt Level TPS

Errors

H100X1

7.10

360.40

27.75

40.10

2.00

H100X2

8.38

381.71

28.39

41.35

1.95

mi210x2

10.08

467.43

32.70

49.46

1.62

Observations

Highest Prompt TPS: mi210x2 with 10.08 tokens/s
Highest Gen TPS: mi210x2 with 467.43 tokens/s
Notable errors on: H100X1 (80 errs), H100X2 (49 errs)

4.4.4 DeepSeek-R1-Distill-Qwen-32B - 128k context length @10req at a time for 100 total requests

Preset

Prompt TPS

Gen TPS

Avg Latency (s)

Request Generation Level TPS

Request Prompt Level TPS

Errors

H100X2

18.96

168.05

16.02

18.98

2.90

Observations

Highest Prompt TPS: H100X2 with 18.96 tokens/s
Highest Gen TPS: H100X2 with 168.05 tokens/s
Notable errors on: H100X2 (1 errs)

5. Overview of concurrency strategies

Concurrency significantly impacts overall system throughput. You will observe that the request-level tokens per second (TPS) decreases as concurrency rises, while the total system throughput increases due to the vLLMs Paged Attention algorithm and other enhancements. This aspect can be adjusted for various trade-offs to boost system throughput, accommodate a reasonable number of request level tokens per second, and reduce errors.

Preset

Prompt TPS

Gen TPS

Avg Latency (s)

Request Generation Level TPS

Request Prompt Level TPS

Concurrency

Errors

H100X2

18.96

168.05

16.02

18.98

2.90

H100X2

3.62

29.36

10.2027

29.45

3.62

Important notes

The script uses a short, fixed prompt, which can lead to high generation throughput.
It employs sequential concurrent requests, maximizing GPU utilization.
The powerful hardware capabilities align with these performance figures.
You can squeeze a lot more from these presets with multiple model instances on same GPUs.

PreviousvLLM Inference details NextPostgreSQL

Last updated 4 months ago

Was this helpful?