One-sentence summary: Scaling laws reveal that language model loss falls predictably as a power function of model size, dataset size, and compute — which means you can estimate the cost of a 70B training run on the back of a napkin, and you should.
A.1 What Is a Scaling Law?
A.1.1 The counterintuitive finding
In 2020, OpenAI published what became known as the Scaling Laws paper. The central result surprised people who assumed that architectural cleverness would drive progress:
Language model loss (measured by cross-entropy on held-out text) follows a power-law relationship with three quantities: parameter count N, dataset size D, and compute C.
This is not a claim that larger is always better. It is a claim that the relationship is predictable. Given a fixed budget, you can calculate, before training, roughly what loss you will get. That prediction is useful.
A.1.2 The three power-law equations
The core equations from the OpenAI paper:
Where:
- is the model's cross-entropy loss (lower is better)
- is the number of parameters, is the training token count, is total compute in FLOPs
- are fitted constants
- is the power-law exponent
The empirically fitted exponents:
The exponents are small, which means you need to scale aggressively to see large loss improvements. Doubling parameters does not halve loss — it reduces it by roughly 5%.
A.1.3 What the curve looks like
On a log-log plot, each relationship is approximately a straight line with a negative slope equal to the exponent. That straight line in log-log space is the power law:
Loss
| \
| \
| \
| \____
| \____
| \____
+-----------------------------> log(N or D or C)
The lines do not flatten at any scale observed so far. That is what keeps frontier labs scaling.
A.2 Parameter Count Estimation
A.2.1 Where the parameters live in a Transformer
A standard dense Transformer's parameters come from three places:
Embedding layer
Token embeddings: vocab_size × d_model
Position embeddings: max_seq_len × d_model (if learned; omitted for RoPE)
Each Transformer block (the bulk of the model)
Multi-Head Attention:
W_Q: d_model × d_model
W_K: d_model × d_model
W_V: d_model × d_model
W_O: d_model × d_model
Subtotal: 4 × d_model²
Feed Forward Network (FFN, expansion ratio 4):
W_1: d_model × 4·d_model
W_2: 4·d_model × d_model
Subtotal: 8 × d_model²
LayerNorm (×2):
γ, β per layer: 4 × d_model (negligible)
Per block total: ≈ 12 × d_model²
Output head
LM head: d_model × vocab_size
(usually tied to the embedding matrix — no extra parameters)
A.2.2 The simplified estimation formula
For a model with layers and hidden dimension , the dominant term is:
The embedding and output layers are subdominant for large models and can be ignored in rough estimates.
Quick sanity check on GPT-3 175B: ,
A.2.3 Verification table across six models
| Model | L | d_model | Predicted N | Actual N | Match |
|---|---|---|---|---|---|
| GPT-2 Small | 12 | 768 | 85M | 117M | ~73% |
| GPT-2 Medium | 24 | 1024 | 302M | 345M | ~88% |
| GPT-2 Large | 36 | 1280 | 709M | 762M | ~93% |
| GPT-2 XL | 48 | 1600 | 1.47B | 1.5B | ~98% |
| GPT-3 Small | 12 | 768 | 85M | 125M | ~68% |
| GPT-3 175B | 96 | 12288 | 173B | 175B | ~99% |
The formula is less accurate for small models where embedding parameters are a larger fraction of the total. For anything above 1B parameters the estimate is within a few percent. This is the regime where scaling law predictions are also most reliable.
A.3 Compute Estimation
A.3.1 What a FLOP is
FLOPs (Floating Point Operations) counts the number of floating-point additions and multiplications a computation requires.
The most important primitive: matrix multiply of shape costs exactly FLOPs (one multiply and one add per output element, and there are output elements each requiring multiply-accumulate pairs).
A.3.2 Training compute formula
The standard training cost estimate is:
Where:
- is the number of model parameters
- is the number of training tokens
- The constant 6 accounts for forward and backward passes:
- Forward pass: each parameter participates in one multiply-add per token → FLOPs per token
- Backward pass: gradient of loss w.r.t. activations plus gradient w.r.t. weights → FLOPs per token
- Total per token:
Worked example — LLaMA-7B trained on 1 trillion tokens:
That is 42 zettaFLOPs. Written out: 42,000,000,000,000,000,000,000 operations.
A.3.3 Inference compute formula
Inference only runs the forward pass and generates one token at a time:
Example — LLaMA-7B generating 100 tokens:
A single H100 can do that in under 2 milliseconds at peak throughput. The bottleneck in autoregressive generation is not compute — it is memory bandwidth, which is why KV caching (Chapter 22) and quantization (Chapter 27) matter so much for serving latency.
A.3.4 GPU specs and training time
Representative GPU specs (FP16 / BF16 Tensor Core throughput):
| GPU | FP16/BF16 TFLOPs | VRAM | Notes |
|---|---|---|---|
| RTX 3090 | 35 TFLOPs (FP32) / 142 (FP16 sparse) | 24 GB | Consumer; training on this is slow |
| RTX 4090 | 83 TFLOPs (FP32) / 330 (FP16 sparse) | 24 GB | Best consumer card; fine-tuning only at scale |
| A100 80GB | 312 TFLOPs (BF16) | 80 GB | Datacenter standard; two-year backbone of most labs |
| H100 80GB | 990 TFLOPs (BF16) | 80 GB | 3× A100 for training throughput |
| H200 141GB | 990 TFLOPs (BF16) | 141 GB | Same compute as H100; extra memory for larger batch sizes and KV cache |
The sparse FP16 numbers for consumer GPUs assume structured sparsity that most training workloads cannot exploit. Use the dense FP32 numbers as the practical ceiling for unoptimized training.
Training time formula:
Realistic utilization is 0.4 to 0.5 — communication overhead, data loading, occasional restarts, and checkpointing all cut into peak throughput.
Example — LLaMA-7B on 1000 A100 80GB GPUs:
At spot pricing around $1.30/GPU-hour, that run costs roughly:
GPT-3 at 175B parameters with 300B tokens is closer to $4–5M on 2020 hardware — which explains why only a handful of organizations trained it.
A.4 Per-Operation FLOPs Quick Reference
Before estimating a full Transformer block, you need the cost of each primitive.
| Operation | FLOPs |
|---|---|
| Matrix multiply | |
| Vector dot product (length ) | |
| Softmax along an axis of length | per output element |
| LayerNorm over a vector of size | |
| GELU activation (per element) | FLOPs |
Softmax costs roughly 5 operations per element: subtract max, exponentiate, sum, divide, plus bookkeeping. LayerNorm costs roughly 8: mean, variance, normalize, scale, shift, plus a few bookkeeping ops. These are small compared to matrix multiplies in typical configurations, but they add up at long context lengths.
A.5 Transformer Block FLOPs Breakdown
A.5.1 The formula
For a single Transformer block processing a sequence of length with hidden dimension (and FFN dimension ), the FLOPs per token during the forward pass are:
The two terms reflect two different scaling regimes:
- : matrix multiplications, which scale quadratically with
- : Attention score computation, which scales quadratically with sequence length
A.5.2 Component breakdown
| Component | FLOPs per token | Formula |
|---|---|---|
| Attention QKV projections | Three matmuls | |
| Attention scores (QKᵀ) | ||
| Attention output projection | ||
| FFN (two linear layers) | + | |
| Total |
Softmax, LayerNorm, and GELU add a few percent on top, but they are omitted from the leading-order formula.
A.5.3 When does Attention dominate?
At short sequences the FFN dominates — it costs versus Attention's for the score computation. The crossover is at:
For (LLaMA-7B), the crossover is at tokens. Below that, FFN dominates. Above that, the quadratic Attention cost takes over — which is exactly why long-context inference is expensive and why architectures like linear Attention, sliding-window Attention, and sparse Attention exist.
A.5.4 Worked example — LLaMA-7B single forward pass
LLaMA-7B: , ,
Per block, per token:
Across all 32 blocks:
Compare that to the estimate for 1 trillion training tokens: . There are sequences, and (the factor of 3 accounts for forward + backward). The two estimates agree, which is a good sign.
A.6 Chinchilla Optimality
A.6.1 What DeepMind found
In 2022, DeepMind published the Chinchilla paper, which ran a more systematic study over a wider range of model sizes and token counts than the 2020 OpenAI paper. Their finding revised the conventional wisdom:
The optimal allocation of a fixed compute budget is to scale parameters and data in equal proportion.
The GPT-3 approach — train a very large model on a relatively small dataset — is compute-suboptimal. You would get lower loss by training a smaller model on more data with the same compute.
A.6.2 The Chinchilla formula
Training token count should be roughly 20 times the parameter count. This is a rule of thumb derived from fitting the Chinchilla scaling curves, not a hard physical constant.
A.6.3 Who is undertrained, who is overtrained, and why it matters
| Model | N | D (tokens trained) | Chinchilla optimal D | Status |
|---|---|---|---|---|
| GPT-3 175B | 175B | 300B | 3.5T | Undertrained per Chinchilla |
| Chinchilla 70B | 70B | 1.4T | 1.4T | Optimal by definition |
| LLaMA-1 7B | 7B | 1T | 140B | Overtrained (intentionally) |
| LLaMA-2 70B | 70B | 2T | 1.4T | Overtrained (intentionally) |
"Overtrained" is not a bug in the LLaMA case — it is a design choice. If your goal is the best possible inference quality at a fixed serving cost, you want to train a smaller model longer. A 7B model running at 100 tokens/second is more practical than a 70B model running at 12 tokens/second, even if the 70B is nominally Chinchilla-optimal.
This distinction — train-compute optimality vs inference efficiency — is one of the most important engineering tradeoffs in large model deployment.
A.7 Practical Resource Planning
A.7.1 Starting from a compute budget
Given a compute budget in FLOPs, the Chinchilla-optimal allocation is:
# Pseudocode: Chinchilla-optimal resource planning
def plan_training(budget_flops):
# From C ≈ 6ND and D ≈ 20N:
# C ≈ 6N × 20N = 120N²
# N = sqrt(C / 120)
N = (budget_flops / 120) ** 0.5 # parameter count
D = 20 * N # token count
return N, D
At FLOPs (roughly one A100-year × 1000 GPUs × 3 months):
That is roughly the regime of Mistral-22B or early Falcon-40B — frontier-adjacent but not GPT-4 scale.
A.7.2 Loss prediction
The OpenAI paper also provides an empirical prediction formula for loss given compute:
This lets you predict final loss before training. The predictions are not exact — they assume you are near the efficiency frontier — but they are useful for go/no-go decisions before committing a cluster.
A.7.3 Small-model vs large-model strategies
| Strategy | Strength | Weakness |
|---|---|---|
| Large model, less data | High capability ceiling | Expensive inference, undertrained quality |
| Small model, more data | Fast, cheap serving | Lower capability ceiling |
| Chinchilla-optimal | Best loss per FLOP during training | May not be best for deployment |
| Intentionally overtrained small model | Best quality per serving FLOP | More expensive to train per parameter |
The right choice depends on whether you optimize for training cost or serving cost. At scale, serving cost usually dominates because you run inference continuously but train once.
A.8 Key Results Summary
-
Scaling Laws: loss follows power laws in , , and with exponents 0.076, 0.095, and 0.050 respectively.
-
Parameter estimation: — accurate to within a few percent for models above 1B.
-
Training compute: — factor of 6 from 2N forward + 4N backward.
-
Inference compute: .
-
Chinchilla optimality: — but small models trained on more data often beat this at serving time.
-
Training time: .
-
Block FLOPs: per block per token; FFN-dominated at short context, Attention-dominated above .
Further Reading
- Scaling Laws for Neural Language Models — Kaplan et al., OpenAI, 2020. The original paper.
- Training Compute-Optimal Large Language Models — Hoffmann et al., DeepMind, 2022. The Chinchilla paper that revised the field.
- Scaling Laws for Autoregressive Generative Modeling — Henighan et al., OpenAI, 2020. Extended scaling studies across modalities.
If you can estimate the training cost of a 70B run without reaching for a calculator, you have internalized scaling laws. Appendix B dives into decoding strategies.