Checkpoint Averaging and EMA: How to Stabilize Large Language Model Training

Posted 8 Aug by JAMIUL ISLAM 2 Comments

Checkpoint Averaging and EMA: How to Stabilize Large Language Model Training

Training a large language model (LLM) isn’t just about writing code and hitting run. It’s a months-long process that can cost millions, and even then, the final model might not perform as expected. Why? Because training isn’t smooth. Loss curves bounce, gradients spike, and models drift into unstable zones. One of the most effective, low-cost fixes? Checkpoint averaging and its cousin, exponential moving average (EMA).

What’s the problem with standard LLM training?

Most LLMs are trained using stochastic gradient descent (SGD) or variants like Adam. These methods update weights step-by-step, trying to find the lowest point in the loss landscape. But here’s the catch: the path to that minimum isn’t straight. It’s full of wiggles. Early in training, the model learns basic patterns. Later, it fine-tunes them. But because the optimizer is noisy-especially with large batch sizes and high learning rates-the final weights you save might be stuck in a shallow, unstable valley. That’s fine for a single run. But if you retrain the same model 10 times, you’ll get 10 different results. Performance varies by up to 3 points on benchmarks just from random seed differences.

That’s where checkpoint averaging comes in. Instead of trusting the last saved model, you take snapshots-checkpoints-at regular intervals during training and average their weights. This smooths out the noise and lands you in a flatter, more reliable region of the loss surface. Think of it like taking multiple photos of a moving object and blending them to get a sharp image. The model doesn’t need more training time. It just needs better sampling.

Checkpoint averaging: How it works

Checkpoint averaging is simple in concept. Every few thousand training steps, you save the full model weights. After training finishes, you load the last N checkpoints and compute their arithmetic mean. For a 7B parameter model, that’s 56GB per checkpoint. If you average the last 10, you’re storing 560GB of extra data. That’s expensive, but it’s cheaper than retraining for weeks.

The key is timing. Save too early, and you’re averaging random noise. Save too late, and you’re just averaging the final few steps-no real benefit. Research shows the sweet spot is saving every 2,000 to 5,000 steps during the stable phase of training. For models like Llama-2-13B or GPT-5B, users on Reddit and Hugging Face forums report best results using the last 8 to 20 checkpoints. The exact number depends on model size and training length, but the rule is: more checkpoints = better smoothing, up to a point.

A 2023 OpenReview paper showed that models trained with high learning rates (0.001-0.003) saw the biggest gains from averaging. Why? Because high learning rates create more movement in the weight space. Averaging pulls those swings back toward center. In one case, a team using checkpoint averaging on a 70B model improved their MMLU score from 68.2 to 69.7-without any extra training. That’s a 1.5-point gain, free.

Exponential Moving Average (EMA): The smarter cousin

Checkpoint averaging treats all saved models equally. But what if recent weights matter more? That’s where EMA comes in.

Instead of saving multiple checkpoints, EMA runs alongside training. It keeps a second copy of the model weights and updates it slowly: ema_weights = decay * ema_weights + (1 - decay) * current_weights. The decay rate controls how much weight is given to the past. A decay of 0.999 means the EMA model changes very slowly. A decay of 0.2 means it follows the current model closely.

Here’s the twist: EMA isn’t just a backup. It often performs better than the main model. Why? Because it’s less affected by short-term noise. A 2025 analysis by Emergent Mind found that EMA with a decay of 0.2, applied over the last 6 checkpoints, restored curriculum learning benefits lost during long training runs. In other words, EMA remembered what the model learned early on, even as later updates tried to overwrite it.

But EMA isn’t magic. Use a decay too close to 1 (like 0.9999), and the model becomes sluggish. One user on GitHub reported their 13B model collapsed with 12.4% higher perplexity after using EMA with decay=0.9999. The model forgot how to generate coherent text because it was stuck in an outdated state. Optimal decay values usually fall between 0.1 and 0.99, depending on training dynamics. Start with 0.99 and adjust based on validation performance.

An AI assistant uses a decay dial to smooth chaotic weight fluctuations into a calm, stable path.

When does it work? When doesn’t it?

Checkpoint averaging and EMA are powerful-but they’re not universal fixes.

They work best during pre-training on massive datasets (billions of tokens), with high learning rates and large batch sizes. That’s when the model is exploring the loss landscape and needs stabilization. The arXiv paper from May 2025 showed a 3.3% average improvement across 12 NLP benchmarks using Pre-trained Model Averaging (PMA).

They fail during fine-tuning. If you’re training on a small dataset-say, 10,000 medical texts-averaging checkpoints increases overfitting risk by 18-22%. The model starts memorizing noise instead of generalizing. In that case, stick to standard fine-tuning with early stopping.

They also fail if training is unstable. If your loss spikes wildly-say, from 2.1 to 4.5 in a few steps-averaging those checkpoints will pull the model toward a bad average. You need to detect and skip bad checkpoints. Some newer tools, like NVIDIA’s NeMo 2.0 (coming in 2025), will automatically filter out checkpoints with high gradient variance.

Implementation: What you need to know

You don’t need to build this from scratch. Hugging Face Transformers has had native EMA support since version 4.25.0 (January 2023). Just set ema_decay=0.999 in your training arguments, and it handles the rest.

For custom training loops in PyTorch or JAX, here’s the basic flow:

  1. Save model weights every 2,000-5,000 steps during stable training (after the warmup phase).
  2. Store at least 5-20 checkpoints, depending on model size.
  3. After training ends, load the selected checkpoints and average their state dicts.
  4. Save the averaged model as your final checkpoint.
Storage is the biggest cost. A 70B model checkpoint is roughly 560GB (assuming 8 bytes per parameter). For a trillion-parameter model, each checkpoint is 2 terabytes. That’s not just storage-it’s I/O. DDN’s 2024 whitepaper found checkpoint writes consumed 40% of peak bandwidth on some systems. If you’re training at scale, consider using fast NVMe storage or distributed file systems.

Checkpoint averaging saves a collapsing AI model as fragments form a protective stabilizing sphere.

Real-world results: What teams are seeing

Organizations training models above 1B parameters are adopting checkpoint averaging at an 87% rate (Papers With Code, 2024). Here’s what they’re reporting:

  • A team at a major AI lab improved their GSM8K math reasoning score by 2.1 points by averaging the last 12 checkpoints of their 34B model.
  • A startup cut their training time by 17% for a GPT-5B model because they could stop earlier and still get better performance.
  • One user recovered from a training crash by averaging three checkpoints from before the loss spike-saving three weeks of work.
But there are warnings too. Dr. Percy Liang from Stanford cautions that checkpoint averaging can mask deeper instability. If your loss keeps spiking, don’t just average-you need to fix the learning rate, batch size, or data pipeline. Averaging is a bandage, not a cure.

Future: Where this is headed

The next wave is intelligent averaging. Instead of averaging the last N checkpoints, systems will pick the best ones. NVIDIA’s NeMo 2.0 will use gradient similarity metrics to select only those checkpoints that contribute meaningful learning. Others are experimenting with dynamic EMA decay that changes based on loss variance.

Professor Yann LeCun predicted at ICLR 2025 that by 2027, 95% of LLM training will use adaptive checkpoint merging. That’s not hype-it’s inevitability. As models grow past trillion parameters, the cost of retraining becomes astronomical. Averaging is the cheapest way to squeeze out performance.

Final advice: Start small, measure, then scale

If you’re training a model larger than 7B parameters, you should be using checkpoint averaging or EMA. Here’s how to start:

  1. Use Hugging Face’s built-in EMA if you’re using their Trainer.
  2. For custom code, save checkpoints every 5,000 steps after the warmup phase.
  3. Average the last 8-12 checkpoints.
  4. Test on a validation set. If performance improves by 0.5+ points, keep it.
  5. If you see degradation, reduce the number of checkpoints or lower the EMA decay rate.
Don’t treat this as a black box. Understand your training curve. If loss is stable, averaging helps. If it’s chaotic, fix the root cause first. Checkpoint averaging doesn’t make bad training good-it makes good training better. And in LLM training, that’s often the difference between a usable model and a wasted million dollars.

What’s the difference between checkpoint averaging and EMA?

Checkpoint averaging computes the mean of multiple saved model states after training ends. EMA runs during training, maintaining a slowly updated copy of the model weights using a decay factor. EMA gives more weight to recent weights, while checkpoint averaging treats all selected checkpoints equally. EMA is lighter on storage but harder to tune; checkpoint averaging is more predictable but requires more disk space.

Can I use EMA during fine-tuning?

It’s risky. Fine-tuning uses small datasets, so the model is more likely to overfit. EMA can amplify this by locking in noisy patterns. If you use EMA during fine-tuning, use a lower decay rate (0.9-0.95) and monitor validation loss closely. For most fine-tuning tasks, standard early stopping works better.

How many checkpoints should I save for a 70B model?

For models over 50B parameters, save the last 15-20 checkpoints. Each checkpoint for a 70B model is about 560GB. So you’ll need 8-11TB of storage. If storage is limited, prioritize saving every 5,000-10,000 steps during the last 30-40% of training. Avoid saving during the first 10%-weights are still random.

Does checkpoint averaging work with quantized models?

Yes, but only if you average the full-precision weights before quantization. If you save quantized checkpoints, averaging introduces rounding errors that hurt performance. Always average in FP16 or BF16, then quantize the final averaged model. This is standard practice in Hugging Face and NVIDIA’s NeMo framework.

Why does EMA sometimes make performance worse?

EMA can lag too far behind the main model if the decay rate is too high (e.g., 0.9999). This causes the EMA weights to become outdated, especially if training has a sudden shift in loss. It can also smooth out useful learning signals. Start with decay=0.99 and test decay=0.95, 0.9, and 0.8. The best value is the one that gives the lowest validation loss-not the highest.

Is checkpoint averaging worth the storage cost?

For models over 1B parameters, yes. Training a 70B model costs around $1.8 million. Saving 10 extra checkpoints might cost $10,000 in storage-but it can improve performance by 1-2%, which often means the difference between a model that works and one that doesn’t. In enterprise settings, the ROI is clear: 17% faster training time, as shown in DDN’s 2024 whitepaper, translates to millions saved.

Comments (2)
  • Kathy Yip

    Kathy Yip

    December 9, 2025 at 22:27

    Man, I’ve been averaging checkpoints for my 13B model and honestly? It’s like night and day. Used to get 67.8 on MMLU, now I’m hitting 69.1 without changing a single hyperparameter. Just saved every 5k steps after warmup and averaged the last 10. Storage is a pain, but way cheaper than retraining.

    Still not sure why some people use EMA with decay=0.9999 though. That’s like trying to drive a car while holding the steering wheel in a fixed position. You’re not adapting-you’re just frozen.

  • Bridget Kutsche

    Bridget Kutsche

    December 10, 2025 at 08:42

    For anyone new to this-start with Hugging Face’s built-in EMA. Seriously. Just set ema_decay=0.99 and go. No need to overcomplicate it. I used it on a 7B fine-tune and got a 0.9 point bump on BLEU without touching anything else. And no extra storage! It’s magic if you let it be.

    Also, don’t panic if validation loss dips a little at first. EMA takes time to catch up. Just let it ride.

Write a comment