TL;DR
In the last post, I wrote about custom LLM models and where they outperform general purpose LLM. Today we will present the transformer architecture, the hearth of llm model.
The Transformer
Next question that should help us to enforce the knowledge from chapter 1, Build LLM from scratch, is following:
What is the significance of the "transformer" architecture in LLMs? It allows the model to selectively focus on different parts of the input text when making predictions.
I asked chatgpt following prompt for:
Please explain how the Transformer architecture allows the model to selectively focus on different parts of the input text when making predictions. Provide examples with explanations.
1️⃣ Why “selective focus” matters in language
Human language is context-dependent.
Example: I gave the book to Mark because he asked for it.
To understand “he”, you must focus on Mark, not the book.
A language model must do the same.
2️⃣ The key idea: self-attention
The Transformer architecture introduces self-attention, which allows every word (token) to:
- Look at all other words in the sentence
- Decide which ones matter most
- Weigh them differently depending on context
This is what “selectively focus” means.
3️⃣ How self-attention works (conceptually, no math)
For each token, the model asks three questions:
- Query – What am I looking for?
- Key – What information does each other token offer?
- Value – What information should I actually take?
Each token compares its query with the keys of all other tokens and assigns attention weights.
Tokens with higher weights influence the prediction more.
I did not understand point 3 at all, so I asked following prompt:
I do not understand point three. What are the answers for query, key and value? Provide examples.
This is a very common sticking point, so you’re asking exactly the right question.
Let’s slow it down and make Query, Key, Value (QKV) feel concrete.
I’ll avoid math and start with intuition, then give a worked language example.
Big picture first (very important)
Query, Key, and Value are not questions with literal answers.
They are vectors (lists of numbers) that the model learns.
Think of them as roles a word can play when interacting with other words.
The core intuition (one sentence)
A token asks a question (Query), other tokens advertise what they contain (Keys), and the token takes useful information (Values).
Step 1: A concrete sentence
Let’s use this sentence:
The trophy didn’t fit in the suitcase because it was too big.
We focus on the word:
it
The model must decide:
Does “it” refer to trophy or suitcase?
Step 2: What is the Query?
Query = “What am I looking for?”
For the token “it”, the query represents something like:
“I am a pronoun. I’m looking for a noun that I can refer to.”
This is not text, but a learned vector encoding:
- pronoun-ness
- grammatical role
- position in sentence
So conceptually:
Query(it) ≈ “find the thing I refer to”
Step 3: What is the Key?
Key = “What do I offer?”
Every token creates a key describing what kind of thing it is.
For example:
Key(trophy) ≈ “I am a concrete object, singular noun”
Key(suitcase) ≈ “I am a container, singular noun”
Key(big) ≈ “I am an adjective”
Again, these are vectors, not labels.
Step 4: Matching Query with Keys (the important part)
The model compares:
Query(it) ↔ Key(trophy)
Query(it) ↔ Key(suitcase)
Query(it) ↔ Key(big)
It learns (from data) that:
- Pronouns often refer to concrete nouns
- “Too big” usually describes the thing that doesn’t fit
So the similarity scores might look like:
| Token | Similarity with Query(it) |
|---|
| trophy | 🔥 high |
| suitcase | medium |
| big | low |
This comparison produces attention weights.
Step 5: What is the Value?
Value = “What information should be passed along?”
Once attention weights are decided, the model pulls information from the Values.
For example:
Value(trophy) = semantic info about “trophy” Value(suitcase) = semantic info about “suitcase”
Because trophy got the highest attention weight, most of its Value is passed to “it”.
So “it” becomes enriched with the meaning of “trophy”.
Step 6: Why we need all three
Here’s the key separation:
| Component | Purpose |
|---|---|
| Query | What I’m looking for |
| Key | What I represent |
| Value | The information I carry |
This separation allows:
-
Flexible matching (Query ↔ Key)
-
Clean information transfer (Value)
Another example (simpler)
Sentence:
For token “he”:
-
Query(he) → “find a male person”
-
Key(Alice) → “female person”
-
Key(Bob) → “male person”
Attention:
-
Bob → high
-
Alice → low
Value from Bob flows into “he”.
Why this is powerful
Because:
-
The same word can ask different queries in different contexts
-
The same word can provide different values depending on context
-
Everything is learned automatically
No rules like:
“pronouns refer to nouns”
The model discovers this statistically.
Conclusion
Attentions is core concept of LLM Transformer architecture. I needed two prompts to grasp this concept from high level. The real implementation is math, but more about that when we move further in this book.