One-sentence summary: Scaling laws reveal that language model loss falls predictably as a power function of model size, dataset size, and compute — which means you can estimate the cost of a 70B training run on the back of a napkin, and you should.

Appendix A overview: the Kaplan and Chinchilla scaling laws — log–log loss curves over compute, parameters, and tokens, plus the FLOPs ≈ 6·N·D rule that lets you sketch a 70B training run on a napkin

A.1 What Is a Scaling Law?

A.1.1 The counterintuitive finding

In 2020, OpenAI published what became known as the Scaling Laws paper. The central result surprised people who assumed that architectural cleverness would drive progress:

Language model loss (measured by cross-entropy on held-out text) follows a power-law relationship with three quantities: parameter count N, dataset size D, and compute C.

This is not a claim that larger is always better. It is a claim that the relationship is predictable. Given a fixed budget, you can calculate, before training, roughly what loss you will get. That prediction is useful.

A.1.2 The three power-law equations

The core equations from the OpenAI paper:

L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}

L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}

L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}

Where:

$L$ is the model's cross-entropy loss (lower is better)
$N$ is the number of parameters, $D$ is the training token count, $C$ is total compute in FLOPs
$N_c, D_c, C_c$ are fitted constants
$\alpha$ is the power-law exponent

The empirically fitted exponents:

$\alpha_N \approx 0.076$
$\alpha_D \approx 0.095$
$\alpha_C \approx 0.050$

The exponents are small, which means you need to scale aggressively to see large loss improvements. Doubling parameters does not halve loss — it reduces it by roughly 5%.

A.1.3 What the curve looks like

On a log-log plot, each relationship is approximately a straight line with a negative slope equal to the exponent. That straight line in log-log space is the power law:

Loss
 |  \
 |   \
 |    \
 |     \____
 |          \____
 |               \____
 +-----------------------------> log(N  or  D  or  C)

The lines do not flatten at any scale observed so far. That is what keeps frontier labs scaling.

A.2 Parameter Count Estimation

A.2.1 Where the parameters live in a Transformer

A standard dense Transformer's parameters come from three places:

Embedding layer

Token embeddings:    vocab_size × d_model
Position embeddings: max_seq_len × d_model   (if learned; omitted for RoPE)

Each Transformer block (the bulk of the model)

Multi-Head Attention:
  W_Q: d_model × d_model
  W_K: d_model × d_model
  W_V: d_model × d_model
  W_O: d_model × d_model
  Subtotal: 4 × d_model²

Feed Forward Network (FFN, expansion ratio 4):
  W_1: d_model × 4·d_model
  W_2: 4·d_model × d_model
  Subtotal: 8 × d_model²

LayerNorm (×2):
  γ, β per layer: 4 × d_model  (negligible)

Per block total: ≈ 12 × d_model²

Output head

LM head: d_model × vocab_size
(usually tied to the embedding matrix — no extra parameters)

A.2.2 The simplified estimation formula

For a model with $L$ layers and hidden dimension $d_\text{model}$ , the dominant term is:

N \approx 12 \times L \times d_{\text{model}}^2

The embedding and output layers are subdominant for large models and can be ignored in rough estimates.

Quick sanity check on GPT-3 175B: $L = 96$ , $d_\text{model} = 12288$

12 \times 96 \times 12288^2 = 12 \times 96 \times 150,994,944 \approx 173\text{B} \checkmark

A.2.3 Verification table across six models

Model	L	d_model	Predicted N	Actual N	Match
GPT-2 Small	12	768	85M	117M	~73%
GPT-2 Medium	24	1024	302M	345M	~88%
GPT-2 Large	36	1280	709M	762M	~93%
GPT-2 XL	48	1600	1.47B	1.5B	~98%
GPT-3 Small	12	768	85M	125M	~68%
GPT-3 175B	96	12288	173B	175B	~99%

The formula is less accurate for small models where embedding parameters are a larger fraction of the total. For anything above 1B parameters the estimate is within a few percent. This is the regime where scaling law predictions are also most reliable.

A.3 Compute Estimation

A.3.1 What a FLOP is

FLOPs (Floating Point Operations) counts the number of floating-point additions and multiplications a computation requires.

The most important primitive: matrix multiply of shape $A[m, k] \times B[k, n]$ costs exactly $2mnk$ FLOPs (one multiply and one add per output element, and there are $m \times n$ output elements each requiring $k$ multiply-accumulate pairs).

A.3.2 Training compute formula

The standard training cost estimate is:

C_\text{train} \approx 6 \times N \times D

Where:

$N$ is the number of model parameters
$D$ is the number of training tokens
The constant 6 accounts for forward and backward passes:
- Forward pass: each parameter participates in one multiply-add per token → $2N$ FLOPs per token
- Backward pass: gradient of loss w.r.t. activations plus gradient w.r.t. weights → $4N$ FLOPs per token
- Total per token: $2N + 4N = 6N$

Worked example — LLaMA-7B trained on 1 trillion tokens:

C = 6 \times 7 \times 10^9 \times 10^{12} = 4.2 \times 10^{22} \text{ FLOPs}

That is 42 zettaFLOPs. Written out: 42,000,000,000,000,000,000,000 operations.

A.3.3 Inference compute formula

Inference only runs the forward pass and generates one token at a time:

C_\text{inference} \approx 2 \times N \times T_\text{generated}

Example — LLaMA-7B generating 100 tokens:

C = 2 \times 7 \times 10^9 \times 100 = 1.4 \times 10^{12} \text{ FLOPs} = 1.4 \text{ TFLOPs}

A single H100 can do that in under 2 milliseconds at peak throughput. The bottleneck in autoregressive generation is not compute — it is memory bandwidth, which is why KV caching (Chapter 22) and quantization (Chapter 27) matter so much for serving latency.

A.3.4 GPU specs and training time

Representative GPU specs (FP16 / BF16 Tensor Core throughput):

GPU	FP16/BF16 TFLOPs	VRAM	Notes
RTX 3090	35 TFLOPs (FP32) / 142 (FP16 sparse)	24 GB	Consumer; training on this is slow
RTX 4090	83 TFLOPs (FP32) / 330 (FP16 sparse)	24 GB	Best consumer card; fine-tuning only at scale
A100 80GB	312 TFLOPs (BF16)	80 GB	Datacenter standard; two-year backbone of most labs
H100 80GB	990 TFLOPs (BF16)	80 GB	3× A100 for training throughput
H200 141GB	990 TFLOPs (BF16)	141 GB	Same compute as H100; extra memory for larger batch sizes and KV cache

The sparse FP16 numbers for consumer GPUs assume structured sparsity that most training workloads cannot exploit. Use the dense FP32 numbers as the practical ceiling for unoptimized training.

Training time formula:

\text{Days} = \frac{C_\text{train}}{\text{GPUs} \times \text{TFLOPs} \times \text{utilization} \times 86400}

Realistic utilization is 0.4 to 0.5 — communication overhead, data loading, occasional restarts, and checkpointing all cut into peak throughput.

Example — LLaMA-7B on 1000 A100 80GB GPUs:

\text{Days} = \frac{4.2 \times 10^{22}}{1000 \times 312 \times 10^{12} \times 0.4 \times 86400} \approx \frac{4.2 \times 10^{22}}{1.08 \times 10^{19}} \approx 3.9 \text{ days}

At spot pricing around $1.30/GPU-hour, that run costs roughly:

1000 \times 3.9 \times 24 \times \$1.30 \approx \$122{,}000

GPT-3 at 175B parameters with 300B tokens is closer to $4–5M on 2020 hardware — which explains why only a handful of organizations trained it.

A.4 Per-Operation FLOPs Quick Reference

Before estimating a full Transformer block, you need the cost of each primitive.

Operation	FLOPs
Matrix multiply $A[m,k] \times B[k,n]$	$2mnk$
Vector dot product (length $n$ )	$2n$
Softmax along an axis of length $L$	$\approx 5L$ per output element
LayerNorm over a vector of size $d$	$\approx 8d$
GELU activation (per element)	$\approx 4$ FLOPs

Softmax costs roughly 5 operations per element: subtract max, exponentiate, sum, divide, plus bookkeeping. LayerNorm costs roughly 8: mean, variance, normalize, scale, shift, plus a few bookkeeping ops. These are small compared to matrix multiplies in typical configurations, but they add up at long context lengths.

A.5 Transformer Block FLOPs Breakdown

A.5.1 The formula

For a single Transformer block processing a sequence of length $s$ with hidden dimension $d$ (and FFN dimension $4d$ ), the FLOPs per token during the forward pass are:

\text{FLOPs per block per token} \approx 24sd^2 + 4s^2 d

The two terms reflect two different scaling regimes:

$24sd^2$ : matrix multiplications, which scale quadratically with $d$
$4s^2 d$ : Attention score computation, which scales quadratically with sequence length $s$

A.5.2 Component breakdown

Component	FLOPs per token	Formula
Attention QKV projections	$6sd^2$	Three $[s,d] \times [d,d]$ matmuls
Attention scores (QKᵀ)	$2s^2 d$	$[s,d] \times [d,s]$
Attention output projection	$2sd^2$	$[s,d] \times [d,d]$
FFN (two linear layers)	$16sd^2$	$[s,d]\times[d,4d]$ + $[s,4d]\times[4d,d]$
Total	$24sd^2 + 4s^2d$

Softmax, LayerNorm, and GELU add a few percent on top, but they are omitted from the leading-order formula.

A.5.3 When does Attention dominate?

At short sequences the FFN dominates — it costs $16sd^2$ versus Attention's $4s^2 d$ for the score computation. The crossover is at:

s \approx 4d

For $d_\text{model} = 4096$ (LLaMA-7B), the crossover is at $s \approx 16{,}384$ tokens. Below that, FFN dominates. Above that, the quadratic Attention cost takes over — which is exactly why long-context inference is expensive and why architectures like linear Attention, sliding-window Attention, and sparse Attention exist.

A.5.4 Worked example — LLaMA-7B single forward pass

LLaMA-7B: $L = 32$ , $d = 4096$ , $s = 2048$

Per block, per token:

24 \times 2048 \times 4096^2 + 4 \times 2048^2 \times 4096

= 24 \times 2048 \times 16{,}777{,}216 + 4 \times 4{,}194{,}304 \times 4096

\approx 8.25 \times 10^{11} + 6.87 \times 10^{10} \approx 8.94 \times 10^{11} \text{ FLOPs per block}

Across all 32 blocks:

32 \times 8.94 \times 10^{11} \approx 2.86 \times 10^{13} \text{ FLOPs for one 2048-token sequence}

Compare that to the $C = 6ND$ estimate for 1 trillion training tokens: $4.2 \times 10^{22}$ . There are $10^{12} / 2048 \approx 4.9 \times 10^8$ sequences, and $3 \times 2.86 \times 10^{13} \times 4.9 \times 10^8 \approx 4.2 \times 10^{22}$ (the factor of 3 accounts for forward + backward). The two estimates agree, which is a good sign.

A.6 Chinchilla Optimality

A.6.1 What DeepMind found

In 2022, DeepMind published the Chinchilla paper, which ran a more systematic study over a wider range of model sizes and token counts than the 2020 OpenAI paper. Their finding revised the conventional wisdom:

The optimal allocation of a fixed compute budget is to scale parameters and data in equal proportion.

The GPT-3 approach — train a very large model on a relatively small dataset — is compute-suboptimal. You would get lower loss by training a smaller model on more data with the same compute.

A.6.2 The Chinchilla formula

D_\text{optimal} \approx 20 \times N

Training token count should be roughly 20 times the parameter count. This is a rule of thumb derived from fitting the Chinchilla scaling curves, not a hard physical constant.

A.6.3 Who is undertrained, who is overtrained, and why it matters

Model	N	D (tokens trained)	Chinchilla optimal D	Status
GPT-3 175B	175B	300B	3.5T	Undertrained per Chinchilla
Chinchilla 70B	70B	1.4T	1.4T	Optimal by definition
LLaMA-1 7B	7B	1T	140B	Overtrained (intentionally)
LLaMA-2 70B	70B	2T	1.4T	Overtrained (intentionally)

"Overtrained" is not a bug in the LLaMA case — it is a design choice. If your goal is the best possible inference quality at a fixed serving cost, you want to train a smaller model longer. A 7B model running at 100 tokens/second is more practical than a 70B model running at 12 tokens/second, even if the 70B is nominally Chinchilla-optimal.

This distinction — train-compute optimality vs inference efficiency — is one of the most important engineering tradeoffs in large model deployment.

A.7 Practical Resource Planning

A.7.1 Starting from a compute budget

Given a compute budget $C_\text{budget}$ in FLOPs, the Chinchilla-optimal allocation is:

# Pseudocode: Chinchilla-optimal resource planning
def plan_training(budget_flops):
    # From C ≈ 6ND and D ≈ 20N:
    # C ≈ 6N × 20N = 120N²
    # N = sqrt(C / 120)
    N = (budget_flops / 120) ** 0.5  # parameter count
    D = 20 * N                        # token count
    return N, D

At $C = 10^{23}$ FLOPs (roughly one A100-year × 1000 GPUs × 3 months):

N \approx \sqrt{10^{23} / 120} \approx 2.9 \times 10^{10} \approx 29\text{B parameters}

D \approx 20 \times 29\text{B} = 580\text{B tokens}

That is roughly the regime of Mistral-22B or early Falcon-40B — frontier-adjacent but not GPT-4 scale.

A.7.2 Loss prediction

The OpenAI paper also provides an empirical prediction formula for loss given compute:

L(C) \approx 1.69 \times C^{-0.048}

This lets you predict final loss before training. The predictions are not exact — they assume you are near the efficiency frontier — but they are useful for go/no-go decisions before committing a cluster.

A.7.3 Small-model vs large-model strategies

Strategy	Strength	Weakness
Large model, less data	High capability ceiling	Expensive inference, undertrained quality
Small model, more data	Fast, cheap serving	Lower capability ceiling
Chinchilla-optimal	Best loss per FLOP during training	May not be best for deployment
Intentionally overtrained small model	Best quality per serving FLOP	More expensive to train per parameter

The right choice depends on whether you optimize for training cost or serving cost. At scale, serving cost usually dominates because you run inference continuously but train once.

A.8 Key Results Summary

Scaling Laws: loss follows power laws in $N$ , $D$ , and $C$ with exponents 0.076, 0.095, and 0.050 respectively.
Parameter estimation: $N \approx 12 \times L \times d_{\text{model}}^2$ — accurate to within a few percent for models above 1B.
Training compute: $C_\text{train} \approx 6ND$ — factor of 6 from 2N forward + 4N backward.
Inference compute: $C_\text{inference} \approx 2N \times T_\text{generated}$ .
Chinchilla optimality: $D_\text{optimal} \approx 20N$ — but small models trained on more data often beat this at serving time.
Training time: $\text{Days} = C / (\text{GPUs} \times \text{TFLOPs} \times 0.4{-}0.5 \times 86400)$ .
Block FLOPs: $24sd^2 + 4s^2 d$ per block per token; FFN-dominated at short context, Attention-dominated above $s \approx 4d$ .