# vLLM Inference details

### Under the Hood

When you specify a multi-GPU preset (like a preset with multiple Nvidia and AMD GPUs), **LLM Inference**:

1. **Determines GPU Provider & Count**
   * AMD MI210 → `gpuProvider=amd`, 2 GPUs.
   * NVIDIA → `gpuProvider=nvidia`, 2 GPUs.
2. **Sets GPU Visibility**
   * On AMD: `HIP_VISIBLE_DEVICES=0,1`, `ROCR_VISIBLE_DEVICES=0,1` (if 2 GPUs).
   * On NVIDIA: `CUDA_VISIBLE_DEVICES=0,1`.
3. **Applies a Default Parallel** arguments (e.g. `--tensor-parallel-size=2`) if the user hasn’t already done so.

#### Environment Variables for AMD

By default, if you select a preset with AMD GPU cards, the chart’s logic sets:

* **`HIP_VISIBLE_DEVICES=0,1...`** and **`ROCR_VISIBLE_DEVICES=0,1...`**` `` ``depending on number of available GPUs: ` Tells ROCm which GPUs are accessible.
* **`TORCH_USE_HIP_DSA=1`**: Enables direct storage access for HIP.
* **`HSA_FORCE_FINE_GRAIN_PCIE=1`** & **`HSA_ENABLE_SDMA=1`**: Improves GPU ↔ Host & GPU ↔ GPU memory transfers.
* **`ROCM_DISABLE_CU_MASK=0`**: All compute units remain active.
* **`VLLM_WORKER_MULTIPROC_METHOD=spawn`**: Avoids “fork” issues on AMD.
* **`NCCL_P2P_DISABLE=0`**: By default, we assume your cluster has correct kernel parameters for GPU–GPU direct memory access. If not, you can pass `--set envAmd.NCCL_P2P_DISABLE=1` to forcibly disable P2P.

### Final Notes

* **NCCL / RCCL** logs will appear in the vLLM container logs. Look for lines referencing peer-to-peer if you see a hang.
* On AMD, if you do have persistent hangs, append `--set "envAmd.NCCL_P2P_DISABLE=1"` to your Apolo command to force fallback GPU communication.
* For the best performance, we keep ROCm version 6.2+ or 6.3+ in sync with your Docker image (`rocm/vllm-ci`)

By combining the right Apolo preset, environment variables, you can reliably run **vLLM** on multiple GPUs—be they AMD or NVIDIA —and get high token throughput for large LLMs.
