GPU & Workload Guide

NVIDIA GPUs for ML Inference
GPU Types
B200
Blackwell
Released: 2024
Memory 192GB HBM3e
HBM3e (Enhanced)
Latest generation high bandwidth memory. 8 TB/s on B200. Stacked on GPU die for maximum speed. Most expensive memory type.
Bandwidth 8 TB/s
FP4
FP4 (4-bit Float)
Ultra-low precision for extreme performance. Blackwell-exclusive. 4x smaller than FP16. Enables massive throughput for quantized inference with specialized model formats.
TFLOPS
9,000+
Next-gen flagship. Massive memory for trillion-parameter models. 2.5x H100 performance.
B100
Blackwell
Released: 2024
Memory 192GB HBM3e
HBM3e (Enhanced)
Latest generation high bandwidth memory. 8 TB/s on B100. Stacked on GPU die for maximum speed. Most expensive memory type.
Bandwidth 8 TB/s
FP4
FP4 (4-bit Float)
Ultra-low precision for extreme performance. Blackwell-exclusive. 4x smaller than FP16. Enables massive throughput for quantized inference with specialized model formats.
TFLOPS
7,000+
Blackwell mainstream. Same memory as B200, slightly lower compute. Best price/performance for largest models.
H200
Hopper
Released: Late 2023
Memory 141GB HBM3e
HBM3e (Enhanced)
Latest generation high bandwidth memory. 4.8 TB/s on H200. 1.4x faster than H100's HBM3. Stacked on GPU die.
Bandwidth 4.8 TB/s
FP8
FP8 (8-bit Float)
Low precision for fast inference. Hopper-exclusive. 2x smaller than FP16. Ideal for quantized LLM inference with minimal quality loss. Most common for modern inference.
TFLOPS
3,958
H100 with more memory. Better for 70B-405B models. 1.4x H100 bandwidth.
H100
Hopper
Released: Sep 2022
Memory 80GB HBM3
HBM3
High bandwidth memory generation 3. 3.35 TB/s on H100. Stacked directly on GPU package for ultra-high speeds. 2x faster than A100's HBM2e.
Bandwidth 3.35 TB/s
FP8
FP8 (8-bit Float)
Low precision for fast inference. Hopper-exclusive. 2x smaller than FP16. Ideal for quantized LLM inference with minimal quality loss. Most common for modern inference.
TFLOPS
3,958
Largest models, fastest inference. Llama 405B, GPT-4 scale models. Current flagship.
A100
Ampere
Released: May 2020
Memory 40GB / 80GB HBM2e
HBM2e
High bandwidth memory 2 enhanced. 1.6-2.0 TB/s depending on variant. Previous gen high-end memory before HBM3. Still excellent for most workloads.
Bandwidth 1.56 / 2.04 TB/s
FP16
FP16 (16-bit Float)
Half precision. Standard for modern inference. Good balance of speed and quality. 2x smaller than FP32. Widely supported across all frameworks.
TFLOPS
312
Workhorse for LLMs. Llama 70B, Stable Diffusion XL, most production workloads.
L40S
Ada Lovelace
Released: Oct 2023
Memory 48GB GDDR6
GDDR6
Graphics DDR6. Standard GPU memory. 300-864 GB/s depending on configuration. Much cheaper than HBM but slower. Good balance of cost and performance for most workloads.
Bandwidth 864 GB/s
FP8
FP8 (8-bit Float)
Low precision for fast inference. Ada Lovelace support via Tensor Cores. 2x smaller than FP16. Good for quantized models with minimal accuracy loss.
TFLOPS
733
High-memory inference. Llama 70B single GPU, large context windows, vision-language models.
L40
Ada Lovelace
Released: Oct 2022
Memory 48GB GDDR6
GDDR6
Graphics DDR6. Standard GPU memory. 300-864 GB/s depending on configuration. Much cheaper than HBM but slower. Good balance of cost and performance for most workloads.
Bandwidth 864 GB/s
FP8
FP8 (8-bit Float)
Low precision for fast inference. Ada Lovelace support via Tensor Cores. 2x smaller than FP16. Good for quantized models with minimal accuracy loss.
TFLOPS
362
Large memory capacity. Multi-modal models, large batch inference, video generation.
L4
Ada Lovelace
Released: Mar 2023
Memory 24GB GDDR6
GDDR6
Graphics DDR6. Standard GPU memory. 300-864 GB/s depending on configuration. Much cheaper than HBM but slower. Good balance of cost and performance for most workloads.
Bandwidth 300 GB/s
INT8
INT8 (8-bit Integer)
Integer quantization. Very efficient for inference. 4x smaller than FP32. Common for cost-effective deployments. Some quality trade-offs vs floating point.
TOPS
242
Cost-effective inference. Llama 7B/13B, Whisper, smaller vision models, embeddings.
A10G
Ampere
Released: 2021
Memory 24GB GDDR6
GDDR6
Graphics DDR6. Standard GPU memory. 300-864 GB/s depending on configuration. Much cheaper than HBM but slower. Good balance of cost and performance for most workloads.
Bandwidth 600 GB/s
FP16
FP16 (16-bit Float)
Half precision. Standard for modern inference. Good balance of speed and quality. 2x smaller than FP32. Widely supported across all frameworks.
TFLOPS
125
Graphics-heavy workloads. ControlNet, image generation, video processing.
T4
Turing
Released: Sep 2018
Memory 16GB GDDR6
GDDR6
Graphics DDR6. Standard GPU memory. 300-864 GB/s depending on configuration. Much cheaper than HBM but slower. Good balance of cost and performance for most workloads.
Bandwidth 300 GB/s
INT8
INT8 (8-bit Integer)
Integer quantization. Very efficient for inference. 4x smaller than FP32. Common for cost-effective deployments. Some quality trade-offs vs floating point.
TOPS
130
Entry-level inference. Small language models, embeddings, classification, lightweight tasks.
V100
Volta
Released: May 2017
Memory 16GB / 32GB HBM2
HBM2
High bandwidth memory generation 2. 900 GB/s on V100. Older generation of stacked memory. Slower than HBM2e/HBM3 but still faster than GDDR6.
Bandwidth 900 GB/s
FP16
FP16 (16-bit Float)
Half precision. Standard for modern inference. Good balance of speed and quality. 2x smaller than FP32. Widely supported across all frameworks.
TFLOPS
125
Older generation. Still viable for many inference workloads at lower cost.
Workload Patterns
Single GPU
Most common inference pattern. Model fits entirely in one GPU's memory. Simple, fast, cost-effective.
GPU
Examples:
Llama 7B/13B, SDXL, Whisper, CLIP, CodeLlama 13B, Mistral 7B
Multi-GPU (2x)
Tensor parallelism splits model across 2 GPUs. Needed when model exceeds single GPU memory. Requires fast interconnect.
GPU 1
GPU 2
Examples:
Llama 70B, Falcon 40B, MPT 30B, large vision-language models
Multi-GPU (4x+)
Large-scale tensor parallelism for massive models. 4-8 GPUs with NVLink. High complexity, high cost, highest capability.
GPU 1
GPU 2
GPU 3
GPU 4
Examples:
Llama 405B, GPT-4 class models, Mixtral 8x22B
GPU Connectivity
NVLink
Bandwidth (A100) 600 GB/s
Bandwidth (H100) 900 GB/s
Latency ~1 μs
Configuration GPU-to-GPU direct
When it matters: Multi-GPU inference for 70B+ models. NVLink enables efficient tensor parallelism by allowing fast weight sharing across GPUs. Critical for maintaining low latency at scale.
PCIe Gen4
Bandwidth (x16) 32 GB/s
Latency ~10 μs
Configuration Through CPU/chipset
When it matters: Fine for single GPU inference and model loading. Becomes bottleneck for multi-GPU inference. Acceptable for pipeline parallelism but not ideal for tensor parallelism.
NVSwitch
Topology All-to-all mesh
Bandwidth per GPU 900 GB/s (H100)
Configuration 8 GPU DGX systems
When it matters: 8-GPU configurations for largest models. Enables full bisection bandwidth between any GPU pair. Required for Llama 405B and similar scale. Expensive but necessary for frontier models.