TL;DR

In the last post, I wrote about custom LLM models and where they outperform general purpose LLM. Today we will present the transformer architecture, the hearth of llm model.

The Transformer

Next question that should help us to enforce the knowledge from chapter 1, Build LLM from scratch, is following:

What is the significance of the "transformer" architecture in LLMs? It allows the model to selectively focus on different parts of the input text when making predictions.

I asked chatgpt following prompt for:

Please explain how the Transformer architecture allows the model to selectively focus on different parts of the input text when making predictions. Provide examples with explanations.

1️⃣ Why “selective focus” matters in language

Human language is context-dependent.

Example:

I gave the book to Mark because he asked for it.

To understand “he”, you must focus on Mark, not the book.

A language model must do the same.

2️⃣ The key idea: self-attention

The Transformer architecture introduces self-attention, which allows every word (token) to:

Look at all other words in the sentence
Decide which ones matter most
Weigh them differently depending on context

This is what “selectively focus” means.

3️⃣ How self-attention works (conceptually, no math)

For each token, the model asks three questions:

Query – What am I looking for?
Key – What information does each other token offer?
Value – What information should I actually take?

Each token compares its query with the keys of all other tokens and assigns attention weights.

Tokens with higher weights influence the prediction more.

I did not understand point 3 at all, so I asked following prompt:

I do not understand point three. What are the answers for query, key and value? Provide examples.

This is a very common sticking point, so you’re asking exactly the right question.
Let’s slow it down and make Query, Key, Value (QKV) feel concrete.

I’ll avoid math and start with intuition, then give a worked language example.

Big picture first (very important)

Query, Key, and Value are not questions with literal answers.
They are vectors (lists of numbers) that the model learns.

Think of them as roles a word can play when interacting with other words.

The core intuition (one sentence)

A token asks a question (Query), other tokens advertise what they contain (Keys), and the token takes useful information (Values).

Step 1: A concrete sentence

Let’s use this sentence:

The trophy didn’t fit in the suitcase because it was too big.

We focus on the word:

it

The model must decide:

Does “it” refer to trophy or suitcase?

Step 2: What is the Query?

Query = “What am I looking for?”

For the token “it”, the query represents something like:

“I am a pronoun. I’m looking for a noun that I can refer to.”

This is not text, but a learned vector encoding:

pronoun-ness
grammatical role
position in sentence

So conceptually:

Query(it) ≈ “find the thing I refer to”

Step 3: What is the Key?

Key = “What do I offer?”

Every token creates a key describing what kind of thing it is.

For example:

Key(trophy) ≈ “I am a concrete object, singular noun”
Key(suitcase) ≈ “I am a container, singular noun”
Key(big) ≈ “I am an adjective”

Again, these are vectors, not labels.

Step 4: Matching Query with Keys (the important part)

The model compares:

Query(it) ↔ Key(trophy)
Query(it) ↔ Key(suitcase)
Query(it) ↔ Key(big)

It learns (from data) that:

Pronouns often refer to concrete nouns
“Too big” usually describes the thing that doesn’t fit

So the similarity scores might look like:

Token	Similarity with Query(it)

trophy

🔥 high

suitcase

medium

big

low

This comparison produces attention weights.

Step 5: What is the Value?

Value = “What information should be passed along?”

Once attention weights are decided, the model pulls information from the Values.

For example:

Value(trophy) = semantic info about “trophy”
Value(suitcase) = semantic info about “suitcase”

Because trophy got the highest attention weight, most of its Value is passed to “it”.

So “it” becomes enriched with the meaning of “trophy”.

Step 6: Why we need all three

Here’s the key separation:

Component	Purpose
Query	What I’m looking for
Key	What I represent
Value	The information I carry

This separation allows:

Flexible matching (Query ↔ Key)
Clean information transfer (Value)

Another example (simpler)

Sentence:

Alice thanked Bob because he helped her.

For token “he”:

Query(he) → “find a male person”
Key(Alice) → “female person”
Key(Bob) → “male person”

Attention:

Bob → high
Alice → low

Value from Bob flows into “he”.

Why this is powerful

Because:

The same word can ask different queries in different contexts
The same word can provide different values depending on context
Everything is learned automatically

No rules like:

“pronouns refer to nouns”

The model discovers this statistically.

Conclusion

Attentions is core concept of LLM Transformer architecture. I needed two prompts to grasp this concept from high level. The real implementation is math, but more about that when we move further in this book.

Transformer architecture, the hearth of LLM model