B200
Blackwell
Released: 2024
Memory
192GB HBM3eHBM3e (Enhanced)
Latest generation high bandwidth memory. 8 TB/s on B200. Stacked on GPU die for maximum speed. Most expensive memory type.
Bandwidth
8 TB/s
FP4FP4 (4-bit Float)
Ultra-low precision for extreme performance. Blackwell-exclusive. 4x smaller than FP16. Enables massive throughput for quantized inference with specialized model formats. TFLOPS
9,000+
Next-gen flagship. Massive memory for trillion-parameter models. 2.5x H100 performance.
B100
Blackwell
Released: 2024
Memory
192GB HBM3eHBM3e (Enhanced)
Latest generation high bandwidth memory. 8 TB/s on B100. Stacked on GPU die for maximum speed. Most expensive memory type.
Bandwidth
8 TB/s
FP4FP4 (4-bit Float)
Ultra-low precision for extreme performance. Blackwell-exclusive. 4x smaller than FP16. Enables massive throughput for quantized inference with specialized model formats. TFLOPS
7,000+
Blackwell mainstream. Same memory as B200, slightly lower compute. Best price/performance for largest models.
H200
Hopper
Released: Late 2023
Memory
141GB HBM3eHBM3e (Enhanced)
Latest generation high bandwidth memory. 4.8 TB/s on H200. 1.4x faster than H100's HBM3. Stacked on GPU die.
Bandwidth
4.8 TB/s
FP8FP8 (8-bit Float)
Low precision for fast inference. Hopper-exclusive. 2x smaller than FP16. Ideal for quantized LLM inference with minimal quality loss. Most common for modern inference. TFLOPS
3,958
H100 with more memory. Better for 70B-405B models. 1.4x H100 bandwidth.
H100
Hopper
Released: Sep 2022
Memory
80GB HBM3HBM3
High bandwidth memory generation 3. 3.35 TB/s on H100. Stacked directly on GPU package for ultra-high speeds. 2x faster than A100's HBM2e.
Bandwidth
3.35 TB/s
FP8FP8 (8-bit Float)
Low precision for fast inference. Hopper-exclusive. 2x smaller than FP16. Ideal for quantized LLM inference with minimal quality loss. Most common for modern inference. TFLOPS
3,958
Largest models, fastest inference. Llama 405B, GPT-4 scale models. Current flagship.
A100
Ampere
Released: May 2020
Memory
40GB / 80GB HBM2eHBM2e
High bandwidth memory 2 enhanced. 1.6-2.0 TB/s depending on variant. Previous gen high-end memory before HBM3. Still excellent for most workloads.
Bandwidth
1.56 / 2.04 TB/s
FP16FP16 (16-bit Float)
Half precision. Standard for modern inference. Good balance of speed and quality. 2x smaller than FP32. Widely supported across all frameworks. TFLOPS
312
Workhorse for LLMs. Llama 70B, Stable Diffusion XL, most production workloads.
L40S
Ada Lovelace
Released: Oct 2023
Memory
48GB GDDR6GDDR6
Graphics DDR6. Standard GPU memory. 300-864 GB/s depending on configuration. Much cheaper than HBM but slower. Good balance of cost and performance for most workloads.
Bandwidth
864 GB/s
FP8FP8 (8-bit Float)
Low precision for fast inference. Ada Lovelace support via Tensor Cores. 2x smaller than FP16. Good for quantized models with minimal accuracy loss. TFLOPS
733
High-memory inference. Llama 70B single GPU, large context windows, vision-language models.
L40
Ada Lovelace
Released: Oct 2022
Memory
48GB GDDR6GDDR6
Graphics DDR6. Standard GPU memory. 300-864 GB/s depending on configuration. Much cheaper than HBM but slower. Good balance of cost and performance for most workloads.
Bandwidth
864 GB/s
FP8FP8 (8-bit Float)
Low precision for fast inference. Ada Lovelace support via Tensor Cores. 2x smaller than FP16. Good for quantized models with minimal accuracy loss. TFLOPS
362
Large memory capacity. Multi-modal models, large batch inference, video generation.
L4
Ada Lovelace
Released: Mar 2023
Memory
24GB GDDR6GDDR6
Graphics DDR6. Standard GPU memory. 300-864 GB/s depending on configuration. Much cheaper than HBM but slower. Good balance of cost and performance for most workloads.
Bandwidth
300 GB/s
INT8INT8 (8-bit Integer)
Integer quantization. Very efficient for inference. 4x smaller than FP32. Common for cost-effective deployments. Some quality trade-offs vs floating point. TOPS
242
Cost-effective inference. Llama 7B/13B, Whisper, smaller vision models, embeddings.
A10G
Ampere
Released: 2021
Memory
24GB GDDR6GDDR6
Graphics DDR6. Standard GPU memory. 300-864 GB/s depending on configuration. Much cheaper than HBM but slower. Good balance of cost and performance for most workloads.
Bandwidth
600 GB/s
FP16FP16 (16-bit Float)
Half precision. Standard for modern inference. Good balance of speed and quality. 2x smaller than FP32. Widely supported across all frameworks. TFLOPS
125
Graphics-heavy workloads. ControlNet, image generation, video processing.
T4
Turing
Released: Sep 2018
Memory
16GB GDDR6GDDR6
Graphics DDR6. Standard GPU memory. 300-864 GB/s depending on configuration. Much cheaper than HBM but slower. Good balance of cost and performance for most workloads.
Bandwidth
300 GB/s
INT8INT8 (8-bit Integer)
Integer quantization. Very efficient for inference. 4x smaller than FP32. Common for cost-effective deployments. Some quality trade-offs vs floating point. TOPS
130
Entry-level inference. Small language models, embeddings, classification, lightweight tasks.
V100
Volta
Released: May 2017
Memory
16GB / 32GB HBM2HBM2
High bandwidth memory generation 2. 900 GB/s on V100. Older generation of stacked memory. Slower than HBM2e/HBM3 but still faster than GDDR6.
Bandwidth
900 GB/s
FP16FP16 (16-bit Float)
Half precision. Standard for modern inference. Good balance of speed and quality. 2x smaller than FP32. Widely supported across all frameworks. TFLOPS
125
Older generation. Still viable for many inference workloads at lower cost.