NLP with Deep Learning: Teaching Machines to Understand Language
Natural Language Processing (NLP) has been revolutionized by deep learning. From word vectors to pretrained Transformers that power ChatGPT — complete guide covering architectures, embeddings, sequence models, attention, and state-of-the-art language models.
Word2Vec
Static embeddings
Seq2Seq
+Attention
Transformer
Self-attention
BERT
Pretraining
GPT-3/4
LLMs
Multi-modal
Text+Image
Why Deep Learning for NLP?
Traditional NLP relied on hand-crafted features (POS tags, parse trees, lexicon). Deep learning enables end-to-end systems that learn hierarchical representations directly from raw text, capturing syntax, semantics, and world knowledge.
Classical NLP
Feature engineering, sparse representations, separate components (tokenizer → tagger → parser → classifier). Brittle pipelines.
Deep Learning NLP
Learned dense embeddings, hierarchical feature extraction, joint training, transfer learning. One model for multiple tasks.
All components differentiable, trained end-to-end.
From Raw Text to Tokens
Tokenization Strategies
- Word-level: Split by space/punctuation. Large vocab (50k+). OOV problem.
- Character-level: No OOV, long sequences.
- Subword (BPE, WordPiece, Unigram): Byte-Pair Encoding. Balance between word & char. Used in BERT, GPT.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("I love NLP with deep learning!")
# ['i', 'love', 'nl', '##p', 'with', 'deep', 'learning', '!']
ids = tokenizer.convert_tokens_to_ids(tokens)
Word Embeddings: You Shall Know a Word by Its Context
Word2Vec (2013)
CBOW: Predict word from context.
Skip-gram: Predict context from word.
maximize log p(context | word)
GloVe (2014)
Global Vectors. Factorizes word co-occurrence matrix. Combines count-based & prediction-based.
FastText (2016)
Subword information. Each word as bag of character n-grams. Handles OOV, morphology.
class SkipGram(nn.Module):
def __init__(self, vocab_size, embedding_dim):
super().__init__()
self.target_embeddings = nn.Embedding(vocab_size, embedding_dim)
self.context_embeddings = nn.Embedding(vocab_size, embedding_dim)
def forward(self, target, context):
# target: (batch,), context: (batch,)
v = self.target_embeddings(target) # (batch, emb)
u = self.context_embeddings(context) # (batch, emb)
score = torch.sum(v * u, dim=1) # (batch,)
return score # use with BCEWithLogitsLoss
Recurrent Neural Networks (RNNs, LSTMs, GRUs)
Process text sequentially, maintaining hidden state. Natural fit for variable-length sequences.
RNN
hₜ = tanh(Wᵢₕ xₜ + bᵢₕ + Wₕₕ hₜ₋₁ + bₕₕ).
Vanishing gradients on long sequences.
LSTM (1997)
Forget gate, input gate, output gate, cell state. Mitigates vanishing gradients. Long-range dependencies.
GRU (2014)
Update gate, reset gate. Fewer parameters than LSTM. Comparable performance.
class LSTMClassifier(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers, num_classes):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers, batch_first=True)
self.fc = nn.Linear(hidden_dim, num_classes)
def forward(self, x):
x = self.embedding(x) # (batch, seq_len, embed_dim)
_, (h_n, _) = self.lstm(x) # h_n: (num_layers, batch, hidden_dim)
out = h_n[-1] # last layer hidden state
return self.fc(out)
Seq2Seq & Attention: The Breakthrough
Encoder-decoder architecture for translation, summarization, chatbots. Attention solves the bottleneck.
Seq2Seq without Attention
Encoder final state → entire sentence representation. Decoder generates. Information loss for long sentences.
Seq2Seq + Attention
Decoder computes context vector as weighted sum of encoder states. Dynamic alignment.
Transformers: The Architecture That Changed Everything
No recurrence. Pure self-attention. Parallelizable, scalable, contextual embeddings.
Self-Attention
Each token attends to all tokens. Captures long-range context.
Multi-Head
Multiple attention views: syntax, semantics, coreference.
Positional Encoding
Injects sequence order information.
Input → [Multi-Head Self-Attention] → Add & Norm → [Feed-Forward] → Add & Norm → Output
Pretrained Language Models: The Foundation of Modern NLP
Train on massive unlabeled text, then fine-tune on downstream tasks. Transfer learning for NLP.
BERT (2018)
Bidirectional Encoder. Masked LM + Next Sentence Prediction. Deeply contextual. SOTA on 11 tasks at release.
Encoder-only Understanding
GPT (2018-...)
Autoregressive Decoder. Predict next token. GPT-3 (175B): few-shot, in-context learning. ChatGPT: instruction-tuned.
Decoder-only Generation
T5
Text-to-Text framework. Encoder-decoder.
RoBERTa
BERT with better hyperparameters, more data.
ALBERT
Parameter-efficient via factorization.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Tokenize dataset
encodings = tokenizer(texts, truncation=True, padding=True)
# Trainer API
training_args = TrainingArguments(output_dir="./results", evaluation_strategy="epoch")
trainer = Trainer(model=model, args=training_args,
train_dataset=train_dataset, eval_dataset=eval_dataset)
trainer.train()
NLP Tasks & Model Heads
Text Classification
Sentiment, spam, topic. [CLS] token + FFN.
NER
Token classification. Linear layer on each token.
QA
Span extraction (BERT: start/end logits).
Translation
Seq2Seq (T5, Marian).
Summarization
BART, Pegasus.
Text Generation
GPT, Llama.
Semantic Similarity
Sentence-BERT.
Zero-shot
Natural language inference.
Production NLP Pipeline
- Data collection & cleaning
- Tokenization & model selection
- Fine-tuning & evaluation
- Model compression (quantization, distillation)
- Serving (ONNX, TensorRT, TorchServe)
- Latency optimization
- Monitoring drift
- CI/CD for models
import torch
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
dummy_input = torch.randint(0, 1000, (1, 128)) # batch=1, seq_len=128
torch.onnx.export(model, dummy_input, "bert.onnx",
input_names=["input_ids"], output_names=["logits"])
NLP with Deep Learning – Cheatsheet
Deep Learning NLP Models Comparison
| Model | Year | Architecture | Pretraining Task | Best For |
|---|---|---|---|---|
| Word2Vec | 2013 | Shallow NN | CBOW/Skip-gram | Static embeddings |
| LSTM | 1997/2013 | Recurrent | - | Sequential modeling |
| Transformer | 2017 | Self-attention | Translation | Parallel sequence processing |
| BERT | 2018 | Encoder (Transformer) | Masked LM + NSP | Understanding tasks |
| GPT-3 | 2020 | Decoder (Transformer) | Autoregressive LM | Few-shot generation |
| T5 | 2019 | Encoder-Decoder | Span corruption | Text-to-text |
NLP Pitfalls & Debugging
The Future of NLP
Efficiency
Pruning, quantization, distillation. Smaller models (DistilBERT, TinyBERT).
Multimodal
CLIP, Flamingo, GPT-4V. Text + images + audio.
Agents
Tool use, reasoning, planning (AutoGPT).
Alignment
RLHF, Constitutional AI. Safety.