One-sentence summary: computers do not process text directly. Tokenization converts text into token IDs, which are then mapped into vectors.


4.1 Why Tokenization Exists

In the previous chapter, the first step in the Transformer map was:

text -> token IDs

This chapter explains that step.

4.1.1 Computers Need Numbers

A computer does not see a sentence the way we do. It does not know that:

The agent opened a pull request.

is made of meaningful words. It needs numeric units.

Tokenization is the process that turns text into a sequence of numbers. Each numeric unit is called a token ID.

4.1.2 Where It Sits in the Architecture

Tokenization position in the Transformer pipeline

Tokenization is the entry point:

raw text -> token IDs -> embeddings -> position -> Transformer blocks

Without tokenization, the rest of the model has nothing to process.


4.2 Two Ways to Tokenize

Two tokenization methods

The simplest idea is to assign a number to every character. Real LLMs usually do something smarter.

4.2.1 Method One: Character IDs

For an English sentence, a naive character-level tokenizer might assign:

T -> 1
h -> 2
e -> 3
space -> 4
a -> 5
g -> 6
n -> 7
t -> 8
...

This is easy to understand. Every character becomes a number.

But it has problems:

  1. Too many tokens: one word becomes many characters.
  2. Weak semantic units: pull request is split into letters even though it is one meaningful phrase.
  3. Inefficient context use: long text consumes context length quickly.

Character tokenization is not wrong, but it is rarely the best choice for modern LLMs.

4.2.2 Method Two: BPE and Word Pieces

Most GPT-style tokenizers use a subword strategy such as BPE, Byte Pair Encoding.

The idea is:

  • common chunks become single tokens
  • rare words can still be split into smaller pieces
  • the vocabulary stays finite
  • the model can handle unseen text

Using OpenAI's cl100k_base tokenizer, this text:

The agent opened a pull request.

becomes:

[791, 8479, 9107, 264, 6958, 1715, 13]

The token pieces are:

791   -> "The"
8479  -> " agent"
9107  -> " opened"
264   -> " a"
6958  -> " pull"
1715  -> " request"
13    -> "."

Notice that spaces often become part of the token. That is normal.

4.2.3 Context Length

Context length is the number of tokens the model can process at once.

If a model supports 128,000 tokens, that does not mean 128,000 English words. It means 128,000 tokenizer units.

Context lengths have grown substantially across model generations:

ModelContext length
GPT-34,096 tokens
GPT-4 (original)8,192 tokens
GPT-4 Turbo / 32k variant32,768 – 128,000 tokens
GPT-5400,000+ tokens
Claude (Sonnet 4.5)200,000 tokens
Gemini 2.5 Pro1,000,000+ tokens

Different languages and writing systems have different token efficiency. English words often tokenize into familiar chunks. Chinese text is notably less efficient: Chinese characters often require 2–3 tokens per character using the cl100k_base tokenizer, meaning a Chinese document of equivalent reading length may consume two to three times as many tokens as the same content in English.

That is why LLM APIs charge by token instead of by word.


4.3 From Token to Embedding

Token IDs are still not enough. The model must convert each ID into a vector.

This is called Embedding.

4.3.1 Embedding Lookup Table

Embedding lookup table

The model contains a large table:

[vocab_size, d_model]

Where:

  • vocab_size is the number of token IDs the tokenizer knows.
  • d_model is the vector width used by the model.

For example, if:

vocab_size = 100256
d_model = 64

then the embedding table contains:

100256 x 64 = 6,416,384 numbers

Those numbers are trainable parameters.

4.3.2 Lookup Process

Token IDs are looked up as vectors

Take the sentence:

The agent opened a pull request.

Tokenization gives:

[791, 8479, 9107, 264, 6958, 1715, 13]

Then the model performs table lookup:

token 791   -> row 791   -> vector
token 8479  -> row 8479  -> vector
token 9107  -> row 9107  -> vector
...

The result is a matrix:

[context_length, d_model]

If the sentence has 7 tokens and d_model = 64, the matrix shape is:

[7, 64]

This matrix is the numeric representation sent into the Transformer blocks.

4.3.3 Why Use Vectors?

Why not use token IDs directly?

Because IDs have no geometry. Token ID 791 is not "closer" to token ID 792 in a meaningful semantic way.

Vectors solve that. They can encode relationships:

  • agent, tool, and workflow can occupy a nearby region
  • pull request and code review can be related
  • pull request and pull tab can separate based on context

Embedding vectors make language available to matrix math without pretending that token IDs themselves have meaning.


4.4 Try It With tiktoken

You can inspect tokenization with OpenAI's tokenizer library:

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")

text = "The agent opened a pull request."
tokens = enc.encode(text)

print(f"Token IDs: {tokens}")
print(f"Token count: {len(tokens)}")
print(f"Decoded: {enc.decode(tokens)}")

for token_id in tokens:
    print(f"{token_id} -> {enc.decode([token_id])!r}")

Expected shape of the result:

Token IDs: [791, 8479, 9107, 264, 6958, 1715, 13]
Token count: 7
Decoded: The agent opened a pull request.
791 -> 'The'
8479 -> ' agent'
...

This small experiment is worth doing. Tokenization becomes much less abstract once you see the pieces.


4.5 Parameter Count in the Embedding Layer

The embedding layer can hold a meaningful number of parameters.

4.5.1 Formula

embedding parameters = vocab_size x d_model

4.5.2 Examples

Modelvocabwidthparams
GPT-2 Small50,257768about 38.6M
GPT-2 Large50,2571,280about 64.3M
GPT-350,25712,288about 618M
LLaMA-2-7B32,0004,096about 131M

Embedding is not a tiny pre-processing detail. It is a learned parameter table that matters.


4.6 Chapter Summary

4.6.1 Key Concepts

ConceptMeaning
Tokenizationconverts text into tokenizer units
Tokena model-readable text fragment
Token IDthe numeric ID for a token
Vocab sizethe number of known token IDs
Embeddingmaps token IDs to vectors
d_modelthe width of the model's internal vectors
Context lengththe maximum token count processed at once

4.6.2 Flow

"The agent opened a pull request."
        |
        | Tokenization
        v
[791, 8479, 9107, 264, ...]
        |
        | Embedding lookup
        v
[context_length, d_model] matrix

4.6.3 Core Takeaway

Tokenization plus embedding is how text enters the Transformer. Tokenization cuts text into model-readable units; embedding turns those units into vectors that can participate in matrix computation.


Chapter Checklist

After this chapter, you should be able to:

  • Explain why tokenization is needed.
  • Describe the difference between character tokenization and BPE-style tokenization.
  • Explain what vocab_size, d_model, and context_length mean.
  • Explain why token IDs are converted into vectors.
  • Calculate the parameter count of an embedding table.

See You in the Next Chapter

That is it for Tokenization. The next time an API charges you by token, you should know exactly what it is counting.

Now text has become vectors. But one key thing is still missing: position.

The sentences:

The agent tagged the reviewer.
The reviewer tagged the agent.

contain nearly the same words but mean different things. Chapter 5 explains how the model knows where each token sits in the sequence.

Cite this page
Zhang, Wayland (2026). Chapter 4: Tokenization - How Text Becomes Numbers. In Transformer Architecture: From Intuition to Implementation. https://waylandz.com/llm-transformer-book-en/chapter-04-tokenization
@incollection{zhang2026transformer_chapter_04_tokenization,
  author = {Zhang, Wayland},
  title = {Chapter 4: Tokenization - How Text Becomes Numbers},
  booktitle = {Transformer Architecture: From Intuition to Implementation},
  year = {2026},
  url = {https://waylandz.com/llm-transformer-book-en/chapter-04-tokenization}
}