TL;DR

In the previous post we gave a high-level overview of the LLM Transformer architecture with examples. Today, we explain what ChatGPT did in “primary school”: the pre-training of an LLM.

The question

What is the main purpose of pre training an LLM? To develop a broad understanding of language by training on a large, diverse dataset.

And my follow up prompt:

Please explain with examples how we train llm using a large, diverse dataset. For example please use how first public chatgpt version was trained.

Short answer (baseline)

The main purpose of pre-training an LLM is to give it a broad, general understanding of language by exposing it to massive, diverse text and training it to predict the next token.

Everything else (chatting, reasoning, instruction following) builds on top of this.

1️⃣ What “pre-training” means (very precisely)

Pre-training is the phase where an LLM learns:

Grammar
Vocabulary
Facts
Writing styles
Basic reasoning patterns

👉 without being taught any specific task

The model is not yet:

A chatbot
Helpful
Polite
Instruction-following

It is just a language model.

2️⃣ What does “large, diverse dataset” actually mean?

A large, diverse dataset typically includes text from many sources, styles, and domains, for example:

Source type	What it teaches
Books	Long-form structure, narrative
Wikipedia	Factual knowledge
News articles	Formal writing, current events
Forums	Informal language, Q&A
Documentation	Technical language
Code repositories	Programming syntax
Websites	Mixed real-world language

This diversity is crucial because:

Language is used differently everywhere
The model must generalize across contexts

3️⃣ How pre-training works (step by step)

Step 1: Collect text

Huge amounts of publicly available text are collected and filtered.

Example (conceptual):

Step 2: Tokenization

Text is broken into tokens (words or word pieces).

Example:

Note note that transformers is broken in tokens bases on English grammar. Transform is verb, we add to verb er to make noun, and s to make a plural noun.

Step 3: Next-token prediction (the only objective)

The model sees sequences like:

It learns to minimize prediction error across billions of examples.

Formally:

No labels like:

“This is a noun”
“This is a fact”
“This is polite”

All structure emerges from prediction.

4️⃣ Why diversity matters (very important)

If you trained only on:

News → model sounds formal
Code → model sounds like a compiler
Reddit → model sounds casual

A diverse dataset teaches:

Style switching
Context awareness
Broad world knowledge

This is why pre-training is about breadth, not depth.

5️⃣ Example: how the first public ChatGPT-style models were trained

Let’s use OpenAI ChatGPT (early versions) as an illustrative example.

Phase 1: Pre-training (foundation)

The base model (GPT-3 family) was pre-trained on:

Large-scale internet text
Books
Wikipedia
Articles
Code

Goal:

Learn how language works in general.

At this stage, the model could:

Complete sentences
Write paragraphs
Mimic styles

But:

It was not a good conversationalist
It could be unsafe or incoherent in dialogue

Phase 2: Fine-tuning (not pre-training, but important context)

After pre-training, the model was:

Fine-tuned on conversations
Aligned with human preferences
Trained using reinforcement learning from human feedback (RLHF)

This is what turned it into ChatGPT.

But without pre-training, this step would be impossible.

6️⃣ Concrete analogy (very helpful)

Think of pre-training like:

Reading the entire internet before learning how to answer questions politely.

Pre-training → learn the language
Fine-tuning → learn the behavior

You cannot skip the first step.

7️⃣ What pre-training does not do

Pre-training does not:

Teach task-specific skills
Teach safety rules
Teach company policies
Make the model “helpful”

It only builds a general linguistic brain.

8️⃣ Visual intuition: pre-training pipeline

Large Text Corpus
↓
Tokenization
↓
Transformer
↓
Next-token prediction
↓
General language understanding

9️⃣ Why this matters for custom LLMs

When you build a custom LLM, you often:

Start from a pre-trained model
Then specialize it

Because:

Pre-training from scratch is extremely expensive
The general language understanding is already there

Conclusion

Teaching an LLM the English language is very expensive, but it forms the foundation for building a chatbot without humans manually creating labels. In the next post, we will talk about what an LLM is.

What did LLM in primary school?