TL;DR
In the previous post we gave a high-level overview of the LLM Transformer architecture with examples. Today, we explain what ChatGPT did in “primary school”: the pre-training of an LLM.
The question
What is the main purpose of pre training an LLM? To develop a broad understanding of language by training on a large, diverse dataset.
And my follow up prompt:
Please explain with examples how we train llm using a large, diverse dataset. For example please use how first public chatgpt version was trained.
Short answer (baseline)
The main purpose of pre-training an LLM is to give it a broad, general understanding of language by exposing it to massive, diverse text and training it to predict the next token.
Everything else (chatting, reasoning, instruction following) builds on top of this.
1️⃣ What “pre-training” means (very precisely)
Pre-training is the phase where an LLM learns:
-
Grammar
-
Vocabulary
-
Facts
-
Writing styles
-
Basic reasoning patterns
👉 without being taught any specific task
The model is not yet:
-
A chatbot
-
Helpful
-
Polite
-
Instruction-following
It is just a language model.
2️⃣ What does “large, diverse dataset” actually mean?
A large, diverse dataset typically includes text from many sources, styles, and domains, for example:
| Source type | What it teaches |
|---|---|
| Books | Long-form structure, narrative |
| Wikipedia | Factual knowledge |
| News articles | Formal writing, current events |
| Forums | Informal language, Q&A |
| Documentation | Technical language |
| Code repositories | Programming syntax |
| Websites | Mixed real-world language |
This diversity is crucial because:
-
Language is used differently everywhere
-
The model must generalize across contexts
3️⃣ How pre-training works (step by step)
Step 1: Collect text
Huge amounts of publicly available text are collected and filtered.
Example (conceptual):
Step 2: Tokenization
Text is broken into tokens (words or word pieces).
Example:
Note note that transformers is broken in tokens bases on English grammar. Transform is verb, we add to verb er to make noun, and s to make a plural noun.
Step 3: Next-token prediction (the only objective)
The model sees sequences like:
It learns to minimize prediction error across billions of examples.
Formally:
No labels like:
-
“This is a noun”
-
“This is a fact”
-
“This is polite”
All structure emerges from prediction.
4️⃣ Why diversity matters (very important)
If you trained only on:
-
News → model sounds formal
-
Code → model sounds like a compiler
-
Reddit → model sounds casual
A diverse dataset teaches:
-
Style switching
-
Context awareness
-
Broad world knowledge
This is why pre-training is about breadth, not depth.
5️⃣ Example: how the first public ChatGPT-style models were trained
Let’s use OpenAI ChatGPT (early versions) as an illustrative example.
Phase 1: Pre-training (foundation)
The base model (GPT-3 family) was pre-trained on:
-
Large-scale internet text
-
Books
-
Wikipedia
-
Articles
-
Code
Goal:
Learn how language works in general.
At this stage, the model could:
-
Complete sentences
-
Write paragraphs
-
Mimic styles
But:
-
It was not a good conversationalist
-
It could be unsafe or incoherent in dialogue
Phase 2: Fine-tuning (not pre-training, but important context)
After pre-training, the model was:
-
Fine-tuned on conversations
-
Aligned with human preferences
-
Trained using reinforcement learning from human feedback (RLHF)
This is what turned it into ChatGPT.
But without pre-training, this step would be impossible.
6️⃣ Concrete analogy (very helpful)
Think of pre-training like:
Reading the entire internet before learning how to answer questions politely.
-
Pre-training → learn the language
-
Fine-tuning → learn the behavior
You cannot skip the first step.
7️⃣ What pre-training does not do
Pre-training does not:
-
Teach task-specific skills
-
Teach safety rules
-
Teach company policies
-
Make the model “helpful”
It only builds a general linguistic brain.
8️⃣ Visual intuition: pre-training pipeline
Large Text Corpus
↓
Tokenization
↓
Transformer
↓
Next-token prediction
↓
General language understanding
9️⃣ Why this matters for custom LLMs
When you build a custom LLM, you often:
- Start from a pre-trained model
- Then specialize it
Because:
- Pre-training from scratch is extremely expensive
- The general language understanding is already there
Conclusion
Teaching an LLM the English language is very expensive, but it forms the foundation for building a chatbot without humans manually creating labels. In the next post, we will talk about what an LLM is.