One-sentence summary: LoRA trains a low-rank decomposition of the weight update instead of all weights; QLoRA quantizes the frozen base model to 4-bit so you can run the same trick on an RTX 3090.

26.1 Why Efficient Fine-Tuning Exists

26.1.1 The full fine-tuning math

Suppose you want to fine-tune LLaMA-7B into a specialized code-review assistant. The naive approach---full parameter fine-tuning---updates all 7 billion weights.

Here is what that actually costs in GPU memory:

Model weights (fp16): 7B × 2 bytes = 14 GB
Gradients (same size as weights): 14 GB
Optimizer state (Adam stores two moments per parameter): 28 GB

Total: 56 GB before you touch activations or batch data. You need an A100 80GB just to get started. If you want to try several hyperparameter settings in parallel, you need several A100 80GBs.

26.1.2 The storage problem

Every fine-tuned variant produces a full copy of the weights:

Code-review assistant: 14 GB
Security-audit assistant: 14 GB
Documentation assistant: 14 GB

Three variants, 42 GB. A team running 20 downstream applications holds 280 GB of nearly identical parameter files. You can see the absurdity.

26.1.3 The key insight

Researchers studying fine-tuning dynamics noticed something consistent: the weight change ΔW during fine-tuning is low-rank. The model is not relearning everything. It is adjusting a small subspace of each weight matrix.

If ΔW is inherently low-rank, you do not need to parameterize it with a full N×D matrix. You can approximate it with two small matrices whose product has that rank. That observation is LoRA.

26.2 LoRA Core Idea: Low-Rank Decomposition

26.2.1 The math

Any matrix can be approximated as the product of two smaller matrices:

ΔW  (N × D)  ≈  B (N × r) @ A (r × D)

where r is the rank and r ≪ min(N, D).

Parameter count comparison for N=1024, D=512, r=32:

Original: 1024 × 512 = 524,288
Low-rank: 32 × (1024 + 512) = 49,152 (about 9.4%)

26.2.2 The LoRA training setup

LoRA does not modify the pretrained weight W. It freezes W and learns the update separately:

W_new = W_original + B @ A

During training:

W_original is frozen---no gradients, no optimizer state
Only B and A are updated

During inference:

Merge once: W_merged = W_original + (α/r) × B @ A
Use W_merged exactly like the original model
Zero inference overhead after merging

LoRA adds a low-rank BA adapter to a frozen weight W, training only B and A

26.2.3 Initialization strategy

Matrix B: initialized to zero

B = zeros(N, r)

Matrix A: initialized randomly (Kaiming)

A = randn(r, D) * sqrt(2 / r)

At the start of training, B @ A = zeros @ randn = 0, so W_new = W_original. The model starts from an exact copy of the pretrained model. No training shock, no instability. This is a carefully designed property.

26.2.4 The scaling factor alpha

The complete forward pass with LoRA:

y = W @ x + (α/r) × B @ (A @ x)

The α/r term keeps learning dynamics stable across different rank choices. As r grows, the product B @ A naturally grows in magnitude; α/r counterbalances this. In practice:

Set α = r for a neutral scale (α/r = 1)
Set α = 2r for stronger LoRA contribution
HuggingFace PEFT defaults to α = 8

26.3 Choosing Rank

26.3.1 What rank controls

Rank r determines how many "degrees of freedom" the update has:

r = 1-4: very few trainable parameters, suitable for simple style or format adaptation
r = 8-16: the community sweet spot for most instruction-following fine-tunes
r = 32-64: more capacity for complex domain adaptation
r > 64: rarely needed; may overfit on small datasets

26.3.2 Empirical data: rank vs. quality

The original LoRA paper reports this on a text generation benchmark (val_loss and downstream metrics):

Rank r	val_loss	BLEU	NIST	METEOR	ROUGE-L	CIDEr
1	1.23	68.72	8.7215	0.4565	0.7052	2.4329
2	1.21	69.17	8.7413	0.4590	0.7052	2.4639
4	1.18	70.38	8.8439	0.4689	0.7186	2.5349
8	1.17	69.57	8.7457	0.4636	0.7196	2.5196
16	1.16	69.61	8.7483	0.4629	0.7177	2.4985
32	1.16	69.33	8.7736	0.4642	0.7105	2.5255
64	1.16	69.24	8.7174	0.4651	0.7180	2.5070
128	1.16	68.73	8.6718	0.4628	0.7127	2.5030
256	1.16	68.92	8.6982	0.4629	0.7128	2.5012
512	1.16	68.78	8.6857	0.4637	0.7128	2.5025
1024	1.17	69.37	8.7495	0.4659	0.7149	2.5090

Three things to notice:

r=4 wins on several generation metrics. Bigger is not always better.
val_loss stops improving after r=16.
r=512 performs no better than r=4 on generation quality while training much slower.

26.3.3 Practical rank selection

Task	Recommended rank	Reasoning
Simple instruction following	4-8	Task is low-complexity
Summarization, translation	8-16	Moderate adaptation needed
Code generation, complex reasoning	16-64	Higher capacity required
Not sure	8	Safe default, matches community practice

The rule I use: start at r=8. If the eval metric plateaus too early or never converges, try r=16. I have rarely needed to go above r=64 for practical fine-tunes.

26.4 Where to Apply LoRA

26.4.1 Attention projections

The original LoRA paper applies adapters to the Q, K, V, and output projections of each Attention layer:

# For each Attention layer, add LoRA to Q, K, V projections:

lora_query_B = nn.Parameter(torch.zeros(d, r))
lora_query_A = nn.Parameter(torch.randn(r, d))

lora_key_B   = nn.Parameter(torch.zeros(d, r))
lora_key_A   = nn.Parameter(torch.randn(r, d))

lora_value_B = nn.Parameter(torch.zeros(d, r))
lora_value_A = nn.Parameter(torch.randn(r, d))

# effective updates:
lora_Wq = lora_query_B @ lora_query_A
lora_Wk = lora_key_B   @ lora_key_A
lora_Wv = lora_value_B @ lora_value_A

26.4.2 Target modules by model family

Model	Common LoRA targets
LLaMA / Mistral	`q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj`
GPT-2	`c_attn, c_proj, c_fc`
BERT	`query, key, value, dense`

Conservative starting point: only Q and V projections. Works well for most tasks. Recommended extension: add output projection and K. Marginal cost, often better results. Aggressive option: include all FFN linear layers. Maximum capacity, higher risk of overfitting on small data.

HuggingFace PEFT configuration:

from peft import LoraConfig, TaskType

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

26.4.3 Trainable parameter fraction

For a typical 7B model with the above config, about 5% of parameters are trainable:

Frozen (base model): 66.5B parameters, no gradients
Trainable (LoRA): ~350M parameters, full gradient flow

That is what moves the memory requirement from 56 GB to something manageable.

26.5 Full LoRA Code

26.5.1 LoRA fine-tuning

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch

# 1. Load base model
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 2. Configure LoRA
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

# 3. Apply LoRA
model = get_peft_model(model, lora_config)

# 4. Inspect parameter count
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06%

26.5.2 Training loop

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    save_steps=100,
    logging_steps=10,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
)

trainer.train()

# Save only the adapter weights (tens of MB, not 14 GB)
model.save_pretrained("./lora-weights")

26.5.3 Loading and merging

from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load adapter on top of base
model = PeftModel.from_pretrained(base_model, "./lora-weights")

# Merge adapter into base weights for zero-overhead inference
model = model.merge_and_unload()

26.6 QLoRA: Fine-Tuning on Consumer Hardware

26.6.1 The remaining bottleneck

LoRA cuts trainable parameters to ~0.1%. But the frozen base model still sits in GPU memory. A 7B model in fp16 needs 14 GB, and that is before the LoRA adapter or any optimizer state.

For an RTX 3090 with 24 GB, 14 GB just for the base leaves little headroom. For anything larger---13B, 70B---it is simply impossible.

QLoRA's answer: quantize the frozen base model to 4-bit, then apply LoRA adapters at full precision.

QLoRA = frozen base (4-bit int4) + trainable adapters (bf16/fp32)

26.6.2 Memory comparison

Method	7B memory	Practical GPU
Full fine-tuning (fp16)	80 GB+	A100 80GB
LoRA (fp16 base)	16-20 GB	A100 40GB, RTX 4090
QLoRA (4-bit base)	6-8 GB	RTX 3090, RTX 4080

This is the number that matters for most practitioners. QLoRA put serious fine-tuning within reach of a single 24 GB GPU.

26.6.3 The three QLoRA innovations

The original QLoRA paper (Dettmers et al., 2023) introduced:

1. NF4 (NormalFloat4): a 4-bit data type designed for the specific distribution of neural network weights. Standard int4 assumes uniform distribution. NF4 spaces the 16 quantization levels to minimize expected error under a normal distribution, matching what weight matrices actually look like.

2. Double quantization: quantize the quantization constants themselves. Each block of 64 weights shares a scale factor (a float32). NF4 is stored as int4 (0.5 bytes per weight) but the scale factor costs extra. Double quantization compresses scale factors to 8-bit, saving about 0.37 bits per weight on average.

3. Paged optimizers: optimizer states (Adam's first and second moments) can overflow GPU memory during long sequences. QLoRA uses NVIDIA unified memory to page those states to CPU RAM automatically rather than crashing.

26.6.4 QLoRA code

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# 1. Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",              # NormalFloat4
    bnb_4bit_compute_dtype=torch.float16,   # compute in fp16 for speed
    bnb_4bit_use_double_quant=True,         # double quantization
)

# 2. Load quantized base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

# 3. Prepare for k-bit training (casts layer norms, enables gradient checkpointing)
model = prepare_model_for_kbit_training(model)

# 4. Apply LoRA adapters at full precision
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

# Training proceeds identically to plain LoRA

26.7 LoRA vs Full Fine-Tuning

26.7.1 Quality comparison

Metric	Full fine-tuning	LoRA (r=8)	LoRA (r=16)
Task quality	100% (baseline)	95-98%	97-99%
Trainable parameters	100%	~0.1%	~0.2%
GPU memory	very high	low	low
Training time	slow	fast	fast

The 2-5% quality gap closes significantly with better data quality and careful module selection. For most production use cases, the gap is negligible.

26.7.2 When to use which

Use LoRA / QLoRA when:

Consumer or mid-tier GPUs (24 GB, 40 GB)
Rapid iteration across hyperparameter configurations
Multiple downstream variants sharing one base model
Task similarity to the pretrained distribution is reasonable

Consider full fine-tuning when:

Target domain diverges strongly from pretraining (very specialized vocabulary, radically different format)
You have abundant A100 or H100 GPU time
Final production quality justifies the cost, and the few percent gap matters

26.7.3 The adapter ecosystem advantage

With LoRA, one base model hosts many adapters. Switch between a code-review adapter, a documentation adapter, and a security-audit adapter at runtime by swapping a few hundred MB of adapter weights. The 14 GB base stays loaded. This is the architecture pattern that makes multi-tenant fine-tuning economically practical.

26.8 Common Questions and Best Practices

26.8.1 How do I pick rank?

Start at r=8. If val loss plateaus too early, try r=16 or r=32. I have almost never needed more than r=64 in practice.

26.8.2 How do I set alpha?

α = r is the neutral choice
α = 2r gives LoRA more influence over the frozen weights
Default values from PEFT are fine for most tasks

26.8.3 Which modules should I target?

Conservative: Q and V only. Recommended: Q, K, V, and output projection. Aggressive: all linear layers including FFN.

26.8.4 LoRA or QLoRA by GPU size

GPU (VRAM)	Model	Recommendation
RTX 3090 / 4090 (24 GB)	7B	QLoRA
RTX 3090 / 4090 (24 GB)	13B	QLoRA
A100 (40 GB)	7B	LoRA
A100 (40 GB)	13B	QLoRA
A100 (80 GB)	7B-13B	LoRA
A100 (80 GB)	70B	QLoRA

26.8.5 Training is unstable. What now?

In order of things to try first:

Lower learning rate (2e-4 → 1e-4)
Add warmup steps
Reduce batch size and increase gradient accumulation to compensate
Audit data quality---noisy labels destabilize LoRA faster than full fine-tuning because the adapter has less capacity to average them away

26.9 Chapter Summary

26.9.1 Key concepts

Concept	Description	Key formula
LoRA	Low-Rank Adaptation; trains only update	`W = W_orig + (α/r) × B @ A`
Rank (r)	Capacity of the adapter; controls parameter count	`r` in range 4-64
Alpha (α)	Scale factor; keeps learning dynamics consistent	typically `r` or `2r`
QLoRA	4-bit base + full-precision LoRA adapters	6-8 GB for a 7B model

26.9.2 Parameter count formula

For a weight matrix of dimension N × D:

Original: N × D parameters
LoRA adapter: N × r + r × D = r × (N + D) parameters
Compression ratio: r × (N + D) / (N × D)

For N=1024, D=512, r=8: adapter holds 1.5% of original parameter count.

26.9.3 Initialization recap

B matrix: zero-initialized so the adapter starts as a no-op
A matrix: random-initialized so gradients flow from step one
Result: training begins from an exact copy of the pretrained model

26.9.4 Core takeaway

LoRA is built on one empirical observation: fine-tuning weight changes are low-rank. Training two small matrices B and A instead of the full ΔW preserves 95%+ of the quality at 0.1% of the parameter update cost. QLoRA extends this to consumer hardware by storing the frozen base at 4-bit. Together they democratized serious fine-tuning beyond the data-center tier.

Chapter Checklist

After this chapter, you should be able to:

Explain why full fine-tuning of a 7B model requires 56 GB+ of GPU memory.
Describe the LoRA weight formula and explain why B is zero-initialized.
Explain what rank controls and recommend a starting value for a new task.
Name the three innovations in QLoRA (NF4, double quantization, paged optimizers).
Write a minimal PEFT LoRA and QLoRA setup using HuggingFace libraries.
Choose LoRA vs QLoRA based on available GPU memory and model size.

See You in the Next Chapter

That covers how to adapt a large model without retraining all of it.

The next question is simpler to state and harder to implement: how do you run the model cheaply after training? Chapter 27 covers GPTQ, AWQ, and GGUF---the quantization formats that determine whether a model fits on your GPU, your laptop, or your cloud bill.