One-sentence summary: Scaling laws reveal that language model loss falls predictably as a power function of model size, dataset size, and compute — which means you can estimate the cost of a 70B training run on the back of a napkin, and you should.


A.1 What Is a Scaling Law?

A.1.1 The counterintuitive finding

In 2020, OpenAI published what became known as the Scaling Laws paper. The central result surprised people who assumed that architectural cleverness would drive progress:

Language model loss (measured by cross-entropy on held-out text) follows a power-law relationship with three quantities: parameter count N, dataset size D, and compute C.

This is not a claim that larger is always better. It is a claim that the relationship is predictable. Given a fixed budget, you can calculate, before training, roughly what loss you will get. That prediction is useful.

A.1.2 The three power-law equations

The core equations from the OpenAI paper:

L(N)=(NcN)αNL(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}
L(D)=(DcD)αDL(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}
L(C)=(CcC)αCL(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}

Where:

  • LL is the model's cross-entropy loss (lower is better)
  • NN is the number of parameters, DD is the training token count, CC is total compute in FLOPs
  • Nc,Dc,CcN_c, D_c, C_c are fitted constants
  • α\alpha is the power-law exponent

The empirically fitted exponents:

  • αN0.076\alpha_N \approx 0.076
  • αD0.095\alpha_D \approx 0.095
  • αC0.050\alpha_C \approx 0.050

The exponents are small, which means you need to scale aggressively to see large loss improvements. Doubling parameters does not halve loss — it reduces it by roughly 5%.

A.1.3 What the curve looks like

On a log-log plot, each relationship is approximately a straight line with a negative slope equal to the exponent. That straight line in log-log space is the power law:

Loss
 |  \
 |   \
 |    \
 |     \____
 |          \____
 |               \____
 +-----------------------------> log(N  or  D  or  C)

The lines do not flatten at any scale observed so far. That is what keeps frontier labs scaling.


A.2 Parameter Count Estimation

A.2.1 Where the parameters live in a Transformer

A standard dense Transformer's parameters come from three places:

Embedding layer

Token embeddings:    vocab_size × d_model
Position embeddings: max_seq_len × d_model   (if learned; omitted for RoPE)

Each Transformer block (the bulk of the model)

Multi-Head Attention:
  W_Q: d_model × d_model
  W_K: d_model × d_model
  W_V: d_model × d_model
  W_O: d_model × d_model
  Subtotal: 4 × d_model²

Feed Forward Network (FFN, expansion ratio 4):
  W_1: d_model × 4·d_model
  W_2: 4·d_model × d_model
  Subtotal: 8 × d_model²

LayerNorm (×2):
  γ, β per layer: 4 × d_model  (negligible)

Per block total:  12 × d_model²

Output head

LM head: d_model × vocab_size
(usually tied to the embedding matrix  no extra parameters)

A.2.2 The simplified estimation formula

For a model with LL layers and hidden dimension dmodeld_\text{model}, the dominant term is:

N12×L×dmodel2N \approx 12 \times L \times d_{\text{model}}^2

The embedding and output layers are subdominant for large models and can be ignored in rough estimates.

Quick sanity check on GPT-3 175B: L=96L = 96, dmodel=12288d_\text{model} = 12288

12×96×122882=12×96×150,994,944173B12 \times 96 \times 12288^2 = 12 \times 96 \times 150,994,944 \approx 173\text{B} \checkmark

A.2.3 Verification table across six models

ModelLd_modelPredicted NActual NMatch
GPT-2 Small1276885M117M~73%
GPT-2 Medium241024302M345M~88%
GPT-2 Large361280709M762M~93%
GPT-2 XL4816001.47B1.5B~98%
GPT-3 Small1276885M125M~68%
GPT-3 175B9612288173B175B~99%

The formula is less accurate for small models where embedding parameters are a larger fraction of the total. For anything above 1B parameters the estimate is within a few percent. This is the regime where scaling law predictions are also most reliable.


A.3 Compute Estimation

A.3.1 What a FLOP is

FLOPs (Floating Point Operations) counts the number of floating-point additions and multiplications a computation requires.

The most important primitive: matrix multiply of shape A[m,k]×B[k,n]A[m, k] \times B[k, n] costs exactly 2mnk2mnk FLOPs (one multiply and one add per output element, and there are m×nm \times n output elements each requiring kk multiply-accumulate pairs).

A.3.2 Training compute formula

The standard training cost estimate is:

Ctrain6×N×DC_\text{train} \approx 6 \times N \times D

Where:

  • NN is the number of model parameters
  • DD is the number of training tokens
  • The constant 6 accounts for forward and backward passes:
    • Forward pass: each parameter participates in one multiply-add per token → 2N2N FLOPs per token
    • Backward pass: gradient of loss w.r.t. activations plus gradient w.r.t. weights → 4N4N FLOPs per token
    • Total per token: 2N+4N=6N2N + 4N = 6N

Worked example — LLaMA-7B trained on 1 trillion tokens:

C=6×7×109×1012=4.2×1022 FLOPsC = 6 \times 7 \times 10^9 \times 10^{12} = 4.2 \times 10^{22} \text{ FLOPs}

That is 42 zettaFLOPs. Written out: 42,000,000,000,000,000,000,000 operations.

A.3.3 Inference compute formula

Inference only runs the forward pass and generates one token at a time:

Cinference2×N×TgeneratedC_\text{inference} \approx 2 \times N \times T_\text{generated}

Example — LLaMA-7B generating 100 tokens:

C=2×7×109×100=1.4×1012 FLOPs=1.4 TFLOPsC = 2 \times 7 \times 10^9 \times 100 = 1.4 \times 10^{12} \text{ FLOPs} = 1.4 \text{ TFLOPs}

A single H100 can do that in under 2 milliseconds at peak throughput. The bottleneck in autoregressive generation is not compute — it is memory bandwidth, which is why KV caching (Chapter 22) and quantization (Chapter 27) matter so much for serving latency.

A.3.4 GPU specs and training time

Representative GPU specs (FP16 / BF16 Tensor Core throughput):

GPUFP16/BF16 TFLOPsVRAMNotes
RTX 309035 TFLOPs (FP32) / 142 (FP16 sparse)24 GBConsumer; training on this is slow
RTX 409083 TFLOPs (FP32) / 330 (FP16 sparse)24 GBBest consumer card; fine-tuning only at scale
A100 80GB312 TFLOPs (BF16)80 GBDatacenter standard; two-year backbone of most labs
H100 80GB990 TFLOPs (BF16)80 GB3× A100 for training throughput
H200 141GB990 TFLOPs (BF16)141 GBSame compute as H100; extra memory for larger batch sizes and KV cache

The sparse FP16 numbers for consumer GPUs assume structured sparsity that most training workloads cannot exploit. Use the dense FP32 numbers as the practical ceiling for unoptimized training.

Training time formula:

Days=CtrainGPUs×TFLOPs×utilization×86400\text{Days} = \frac{C_\text{train}}{\text{GPUs} \times \text{TFLOPs} \times \text{utilization} \times 86400}

Realistic utilization is 0.4 to 0.5 — communication overhead, data loading, occasional restarts, and checkpointing all cut into peak throughput.

Example — LLaMA-7B on 1000 A100 80GB GPUs:

Days=4.2×10221000×312×1012×0.4×864004.2×10221.08×10193.9 days\text{Days} = \frac{4.2 \times 10^{22}}{1000 \times 312 \times 10^{12} \times 0.4 \times 86400} \approx \frac{4.2 \times 10^{22}}{1.08 \times 10^{19}} \approx 3.9 \text{ days}

At spot pricing around $1.30/GPU-hour, that run costs roughly:

1000×3.9×24×$1.30$122,0001000 \times 3.9 \times 24 \times \$1.30 \approx \$122{,}000

GPT-3 at 175B parameters with 300B tokens is closer to $4–5M on 2020 hardware — which explains why only a handful of organizations trained it.


A.4 Per-Operation FLOPs Quick Reference

Before estimating a full Transformer block, you need the cost of each primitive.

OperationFLOPs
Matrix multiply A[m,k]×B[k,n]A[m,k] \times B[k,n]2mnk2mnk
Vector dot product (length nn)2n2n
Softmax along an axis of length LL5L\approx 5L per output element
LayerNorm over a vector of size dd8d\approx 8d
GELU activation (per element)4\approx 4 FLOPs

Softmax costs roughly 5 operations per element: subtract max, exponentiate, sum, divide, plus bookkeeping. LayerNorm costs roughly 8: mean, variance, normalize, scale, shift, plus a few bookkeeping ops. These are small compared to matrix multiplies in typical configurations, but they add up at long context lengths.


A.5 Transformer Block FLOPs Breakdown

A.5.1 The formula

For a single Transformer block processing a sequence of length ss with hidden dimension dd (and FFN dimension 4d4d), the FLOPs per token during the forward pass are:

FLOPs per block per token24sd2+4s2d\text{FLOPs per block per token} \approx 24sd^2 + 4s^2 d

The two terms reflect two different scaling regimes:

  • 24sd224sd^2: matrix multiplications, which scale quadratically with dd
  • 4s2d4s^2 d: Attention score computation, which scales quadratically with sequence length ss

A.5.2 Component breakdown

ComponentFLOPs per tokenFormula
Attention QKV projections6sd26sd^2Three [s,d]×[d,d][s,d] \times [d,d] matmuls
Attention scores (QKᵀ)2s2d2s^2 d[s,d]×[d,s][s,d] \times [d,s]
Attention output projection2sd22sd^2[s,d]×[d,d][s,d] \times [d,d]
FFN (two linear layers)16sd216sd^2[s,d]×[d,4d][s,d]\times[d,4d] + [s,4d]×[4d,d][s,4d]\times[4d,d]
Total24sd2+4s2d24sd^2 + 4s^2d

Softmax, LayerNorm, and GELU add a few percent on top, but they are omitted from the leading-order formula.

A.5.3 When does Attention dominate?

At short sequences the FFN dominates — it costs 16sd216sd^2 versus Attention's 4s2d4s^2 d for the score computation. The crossover is at:

s4ds \approx 4d

For dmodel=4096d_\text{model} = 4096 (LLaMA-7B), the crossover is at s16,384s \approx 16{,}384 tokens. Below that, FFN dominates. Above that, the quadratic Attention cost takes over — which is exactly why long-context inference is expensive and why architectures like linear Attention, sliding-window Attention, and sparse Attention exist.

A.5.4 Worked example — LLaMA-7B single forward pass

LLaMA-7B: L=32L = 32, d=4096d = 4096, s=2048s = 2048

Per block, per token:

24×2048×40962+4×20482×409624 \times 2048 \times 4096^2 + 4 \times 2048^2 \times 4096
=24×2048×16,777,216+4×4,194,304×4096= 24 \times 2048 \times 16{,}777{,}216 + 4 \times 4{,}194{,}304 \times 4096
8.25×1011+6.87×10108.94×1011 FLOPs per block\approx 8.25 \times 10^{11} + 6.87 \times 10^{10} \approx 8.94 \times 10^{11} \text{ FLOPs per block}

Across all 32 blocks:

32×8.94×10112.86×1013 FLOPs for one 2048-token sequence32 \times 8.94 \times 10^{11} \approx 2.86 \times 10^{13} \text{ FLOPs for one 2048-token sequence}

Compare that to the C=6NDC = 6ND estimate for 1 trillion training tokens: 4.2×10224.2 \times 10^{22}. There are 1012/20484.9×10810^{12} / 2048 \approx 4.9 \times 10^8 sequences, and 3×2.86×1013×4.9×1084.2×10223 \times 2.86 \times 10^{13} \times 4.9 \times 10^8 \approx 4.2 \times 10^{22} (the factor of 3 accounts for forward + backward). The two estimates agree, which is a good sign.


A.6 Chinchilla Optimality

A.6.1 What DeepMind found

In 2022, DeepMind published the Chinchilla paper, which ran a more systematic study over a wider range of model sizes and token counts than the 2020 OpenAI paper. Their finding revised the conventional wisdom:

The optimal allocation of a fixed compute budget is to scale parameters and data in equal proportion.

The GPT-3 approach — train a very large model on a relatively small dataset — is compute-suboptimal. You would get lower loss by training a smaller model on more data with the same compute.

A.6.2 The Chinchilla formula

Doptimal20×ND_\text{optimal} \approx 20 \times N

Training token count should be roughly 20 times the parameter count. This is a rule of thumb derived from fitting the Chinchilla scaling curves, not a hard physical constant.

A.6.3 Who is undertrained, who is overtrained, and why it matters

ModelND (tokens trained)Chinchilla optimal DStatus
GPT-3 175B175B300B3.5TUndertrained per Chinchilla
Chinchilla 70B70B1.4T1.4TOptimal by definition
LLaMA-1 7B7B1T140BOvertrained (intentionally)
LLaMA-2 70B70B2T1.4TOvertrained (intentionally)

"Overtrained" is not a bug in the LLaMA case — it is a design choice. If your goal is the best possible inference quality at a fixed serving cost, you want to train a smaller model longer. A 7B model running at 100 tokens/second is more practical than a 70B model running at 12 tokens/second, even if the 70B is nominally Chinchilla-optimal.

This distinction — train-compute optimality vs inference efficiency — is one of the most important engineering tradeoffs in large model deployment.


A.7 Practical Resource Planning

A.7.1 Starting from a compute budget

Given a compute budget CbudgetC_\text{budget} in FLOPs, the Chinchilla-optimal allocation is:

# Pseudocode: Chinchilla-optimal resource planning
def plan_training(budget_flops):
    # From C  6ND and D  20N:
    # C  6N × 20N = 120N²
    # N = sqrt(C / 120)
    N = (budget_flops / 120) ** 0.5  # parameter count
    D = 20 * N                        # token count
    return N, D

At C=1023C = 10^{23} FLOPs (roughly one A100-year × 1000 GPUs × 3 months):

N1023/1202.9×101029B parametersN \approx \sqrt{10^{23} / 120} \approx 2.9 \times 10^{10} \approx 29\text{B parameters}
D20×29B=580B tokensD \approx 20 \times 29\text{B} = 580\text{B tokens}

That is roughly the regime of Mistral-22B or early Falcon-40B — frontier-adjacent but not GPT-4 scale.

A.7.2 Loss prediction

The OpenAI paper also provides an empirical prediction formula for loss given compute:

L(C)1.69×C0.048L(C) \approx 1.69 \times C^{-0.048}

This lets you predict final loss before training. The predictions are not exact — they assume you are near the efficiency frontier — but they are useful for go/no-go decisions before committing a cluster.

A.7.3 Small-model vs large-model strategies

StrategyStrengthWeakness
Large model, less dataHigh capability ceilingExpensive inference, undertrained quality
Small model, more dataFast, cheap servingLower capability ceiling
Chinchilla-optimalBest loss per FLOP during trainingMay not be best for deployment
Intentionally overtrained small modelBest quality per serving FLOPMore expensive to train per parameter

The right choice depends on whether you optimize for training cost or serving cost. At scale, serving cost usually dominates because you run inference continuously but train once.


A.8 Key Results Summary

  1. Scaling Laws: loss follows power laws in NN, DD, and CC with exponents 0.076, 0.095, and 0.050 respectively.

  2. Parameter estimation: N12×L×dmodel2N \approx 12 \times L \times d_{\text{model}}^2 — accurate to within a few percent for models above 1B.

  3. Training compute: Ctrain6NDC_\text{train} \approx 6ND — factor of 6 from 2N forward + 4N backward.

  4. Inference compute: Cinference2N×TgeneratedC_\text{inference} \approx 2N \times T_\text{generated}.

  5. Chinchilla optimality: Doptimal20ND_\text{optimal} \approx 20N — but small models trained on more data often beat this at serving time.

  6. Training time: Days=C/(GPUs×TFLOPs×0.40.5×86400)\text{Days} = C / (\text{GPUs} \times \text{TFLOPs} \times 0.4{-}0.5 \times 86400).

  7. Block FLOPs: 24sd2+4s2d24sd^2 + 4s^2 d per block per token; FFN-dominated at short context, Attention-dominated above s4ds \approx 4d.


Further Reading

  • Scaling Laws for Neural Language Models — Kaplan et al., OpenAI, 2020. The original paper.
  • Training Compute-Optimal Large Language Models — Hoffmann et al., DeepMind, 2022. The Chinchilla paper that revised the field.
  • Scaling Laws for Autoregressive Generative Modeling — Henighan et al., OpenAI, 2020. Extended scaling studies across modalities.

If you can estimate the training cost of a 70B run without reaching for a calculator, you have internalized scaling laws. Appendix B dives into decoding strategies.

Cite this page
Zhang, Wayland (2026). Appendix A: Scaling Laws and Compute. In Transformer Architecture: From Intuition to Implementation. https://waylandz.com/llm-transformer-book-en/appendix-a-scaling-laws-compute
@incollection{zhang2026transformer_appendix_a_scaling_laws_compute,
  author = {Zhang, Wayland},
  title = {Appendix A: Scaling Laws and Compute},
  booktitle = {Transformer Architecture: From Intuition to Implementation},
  year = {2026},
  url = {https://waylandz.com/llm-transformer-book-en/appendix-a-scaling-laws-compute}
}