Your phone’s keyboard guesses your next word. An LLM is that same idea - scaled to most of the internet and billions of tiny dials.
Type “I’m running late, I’ll be there in ___” and your keyboard offers “5”, “10”, “a”. It’s predicting what usually comes next. An LLM does this for whole paragraphs, emails, and code.
Four layers, each builds on the last
1. Tokens - what the model actually reads
Models don’t see letters or whole words; they see tokens - chunks of roughly 4 characters (~0.75 of a word). Cost and length limits are counted in tokens.
2. Next-token prediction - how it generates
At each step the model assigns a probability to thousands of possible next tokens, then samples one, then repeats. After “The capital of France is”, the token “Paris” gets a huge probability and “banana” gets almost none.
3. Temperature - the randomness dial
Temperature controls how adventurous the sampling is. Low = pick the most likely token (focused, consistent). High = take more chances (creative, but riskier). Prompt “a tagline for a coffee shop”: low temp → “Great coffee, every day.” High temp → “Sip the sunrise.”
4. How it learned - two phases
- Pre-training: read trillions of words, just predicting the next token. No labels needed.
- Post-training (fine-tuning + RLHF): coach it to be helpful, honest, and safe.
5. Scale and parameters
Parameters are the model’s adjustable dials, set during training - modern models have billions. More parameters + more data + more compute → more capability. That is the “scaling” story behind the leaps you’ve seen.
Be the model
Set the temperature, then sample the next token - just like an LLM.
Low temperature makes the bars peaky - it almost always picks “sunny.” Crank it up and the bars flatten, so even “banana” gets a turn.
- LLMs work in tokens, predicting the next one from probabilities.
- Pre-training builds knowledge; post-training (fine-tuning + RLHF) makes it helpful.
- Scale drives capability; temperature controls randomness.

