Running a large language model in production feels like balancing on a tightrope. You need speed to keep users engaged, but you also need to watch your cloud bill closely. If you pick the wrong hardware, you either burn cash or frustrate customers with slow responses. The choice isn't just about raw power anymore; it’s about memory bandwidth, precision formats, and how well the software talks to the silicon.
In 2026, the landscape has shifted dramatically. The NVIDIA A100 is a high-performance GPU based on Ampere architecture that dominated early LLM deployments is still everywhere, but it’s aging. The NVIDIA H100 is the successor using Hopper architecture designed specifically for transformer workloads with FP8 support has become the new standard for serious inference. Meanwhile, CPU Offloading is a technique that moves model weights between CPU RAM and GPU memory to run large models on limited hardware remains a viable option for budget-conscious teams, though with significant trade-offs. Let’s break down exactly which path fits your specific use case.
The Performance Gap: Why H100 Dominates Inference
If you are running models larger than 13 billion parameters, the H100 is no longer a luxury; it’s often a necessity for cost efficiency. The difference between the A100 and H100 isn’t just about having more cores. It’s about how they handle data.
The H100 introduces the Transformer Engine, which supports FP8 (8-bit floating point) precision. This allows the GPU to dynamically switch between FP8, FP16, and INT8 during inference. For LLMs, this means you can process tokens faster while using less memory bandwidth. In real-world tests with the Llama 3.1 70B model using the vLLM engine, the H100 SXM5 generated 3,311 tokens per second compared to the A100 NVLink’s 1,148 tokens per second. That is a 2.8x throughput advantage.
Memory bandwidth is the silent killer of inference performance. The A100 offers 2.0 TB/s of HBM2e bandwidth. The H100 jumps to 3.35 TB/s with HBM3 memory. When you are streaming tokens one by one, every millisecond counts. The H100’s wider pipeline means less waiting for data to arrive at the compute units. For smaller models (13B-70B range), you see consistent 1.5x to 2x speed improvements on the H100. But for optimized transformers leveraging FP8, those gains can stretch to 3.3x.
| Feature | NVIDIA A100 (Ampere) | NVIDIA H100 (Hopper) |
|---|---|---|
| Architecture | Ampere | Hopper |
| Memory Type | HBM2e | HBM3 |
| Memory Bandwidth | 2.0 TB/s | 3.35 TB/s |
| CUDA Cores | 6,912 | 14,592 |
| Precision Support | FP16, BF16, INT8 | FP8, FP16, BF16, INT8 |
| NVLink Speed | 600 GB/s | 900 GB/s |
| Typical Cloud Cost (Hourly) | $3.00 - $4.50 | $4.00 - $6.00 |
Cost Efficiency: Is the H100 Actually Cheaper?
It sounds counterintuitive. The H100 costs more per hour in the cloud. So why do experts say it’s more cost-effective? Because you pay for time, not just hardware. If the H100 finishes the job twice as fast, you save money.
Let’s look at a concrete example from AWS pricing trends in mid-2025. An A100 instance might cost $0.75 per hour, delivering 112 tokens per second for Mistral 7B. An H100 instance might cost $1.20 per hour, delivering 247 tokens per second.
- A100 Cost per Token: ($0.75 / 3600 seconds) / 112 tokens ≈ $0.00000186 per token
- H100 Cost per Token: ($1.20 / 3600 seconds) / 247 tokens ≈ $0.00000135 per token
When CPU Offloading Makes Sense
Not everyone has a budget for enterprise GPUs. CPU offloading allows you to run massive models like Llama 3 70B on consumer hardware or low-cost servers with plenty of RAM. Tools like vLLM is an open-source library providing high-throughput serving for LLMs with PagedAttention technology and llama.cpp is a C++ implementation of Meta's LLaMA neural network enabling efficient inference on CPUs make this possible by swapping model weights between CPU RAM and GPU VRAM.
However, you must understand the penalty. Memory bandwidth on even the best server CPUs (like AMD EPYC 9654) is a fraction of what GPUs offer. In MLPerf benchmarks, CPU offloading increased latency from 200-500ms on an H100 to 2-5 seconds per token. Throughput drops to 1-5 tokens per second. This approach is suitable for:
- Development and testing environments where speed doesn’t matter.
- Batch processing jobs where results can be queued overnight.
- Edge cases with extremely low concurrency (1-2 users).
- Prototyping new models before committing to GPU infrastructure.
Implementation Complexity and Tooling
Hardware is only half the battle. Getting the most out of these chips requires software optimization. The A100 benefits from mature tooling. Frameworks like TensorRT-LLM and DeepSpeed have had years to refine their support for Ampere architecture. You can often get good performance out-of-the-box within 1-3 days of setup.
The H100 requires more engineering effort to unlock its full potential. To leverage FP8 precision, you need to ensure your entire stack supports it. NVIDIA reports that full benefit realization in complex inference pipelines can take 2-4 weeks of tuning. However, the payoff is significant. If you skip FP8 and run H100 in FP16 mode, you leave much of its performance on the table. For CPU offloading, the barrier to entry is low, but stability is high-effort. Developers report spending 5-7 days just to achieve stable performance with 70B models on llama.cpp, managing memory swapping manually to avoid crashes. Documentation quality varies wildly here, making debugging a frustrating experience.
Decision Framework: Which Should You Choose?
Your choice depends on three factors: Model Size, Concurrency, and Latency Requirements. Choose NVIDIA H100 if:
- You are deploying models >13B parameters.
- You need sub-second latency for real-time interactions.
- You have high concurrency (>10 simultaneous users).
- You want long-term viability (2026-2028) without re-architecting.
- You are running smaller models (<13B parameters).
- Your concurrency is low to moderate.
- You are constrained by immediate budget and cannot justify H100 premiums.
- You rely on legacy tooling that hasn’t been updated for Hopper architecture.
- You are prototyping or testing locally.
- You have zero GPU budget.
- Your application is batch-oriented, not real-time.
- You are targeting edge devices with limited power budgets.
Future-Proofing Your Infrastructure
The industry is moving toward specialized inference accelerators. While AMD’s MI300X offers competitive pricing, benchmarks show it still lags behind the H100 in transformer-specific efficiency. Google’s TPU v5p provides strong throughput but lacks broad framework support. Analysts predict H100-class GPUs will maintain over 75% market share for production LLM inference through 2027. Organizations standardizing on H100 now will likely see relevance for 3-5 years. A100 deployments face obsolescence risks as models grow beyond 1 trillion parameters, where memory bandwidth becomes the critical bottleneck. Planning your migration strategy today saves headaches tomorrow.
Is the NVIDIA H100 worth the extra cost over A100 for small models?
For models under 13B parameters with low concurrency, the A100 often provides better price-to-performance due to lower hourly rates and wider availability. The H100's advantages shine with larger models and high-throughput requirements where its memory bandwidth and FP8 support reduce cost-per-token significantly.
Can I run Llama 3 70B on a CPU-only server?
Yes, using tools like llama.cpp or vLLM with CPU offloading. However, expect very high latency (2-5 seconds per token). This is suitable for development or batch jobs but impractical for real-time user-facing applications due to poor responsiveness.
What is FP8 precision and why does it matter for LLMs?
FP8 (8-bit floating point) is a data format supported by NVIDIA H100's Transformer Engine. It reduces memory usage and increases computational throughput compared to FP16/BF16 without significant loss in model accuracy. This allows faster inference and higher concurrency on the same hardware.
How does memory bandwidth affect LLM inference speed?
LLM inference is memory-bound, not compute-bound. The GPU spends most of its time waiting for model weights to load from memory into the processor. Higher bandwidth (like H100's 3.35 TB/s vs A100's 2.0 TB/s) means more data processed per second, directly translating to higher tokens-per-second output.
When should I consider AMD MI300X instead of NVIDIA GPUs?
AMD MI300X is a strong alternative if you need large memory capacity (192GB) and want to diversify vendors. However, current benchmarks show it delivers only ~1.7x the performance of H100 at 85% of the cost, meaning NVIDIA still holds an efficiency lead for transformer workloads. Consider MI300X if ecosystem lock-in is a major concern.