Language Models

Understand the core concept of Language Modeling: assigning probabilities to sequences of words.

What is a Language Model?

At its absolute core, a Language Model (LM) does one simple mathematical thing: it assigns probabilities to sequences of words. It determines how "likely" a specific sentence is to exist in a given language.

High Probability (Valid English)

P("The cat sat on the mat") = 0.95

Low Probability (Gibberish)

P("Mat the on sat cat the") = 0.0001

The Goal: Next Word Prediction

Because of the chain rule of probability, assigning probabilities to full sentences is analytically identical to the task of Next Word Prediction (Autoregressive task).

"I went to the coffee shop and ordered a ________"

A good language model will assign a high probability to words like "latte" or "cappuccino", and a near-zero probability to words like "car" or "elephant".

The Evolution of Language Models

Era	Model Type	How it predicts the next word
1990s	Statistical N-gram Models	Counts frequency of (n-1) previous words matching in a database table. Extremely limited memory.
2010s	Recurrent Neural Nets (RNNs)	Passes a "hidden state" vector left-to-right through a neural network. Can remember long-term context, but suffers from vanishing gradients.
2018 - Present	Transformer LLMs (GPT)	Uses "Self-Attention" to look at every word in the sentence simultaneously. Capable of trillions of parameters. Human-like reasoning capabilities.

Previous: ELMo

Related Natural Language Processing Links