Transformers deep dive 15 questions 25 min

Transformers MCQ · test your attention knowledge

From self‑attention to BERT & GPT – 15 questions covering multi‑head attention, positional encoding, masking, and core transformer concepts.

Easy: 5 Medium: 6 Hard: 4
Self‑Attention
Multi‑Head
Positional Enc.
Masking

Transformers: the foundation of modern NLP

The Transformer architecture, introduced in "Attention Is All You Need", replaces recurrence with self‑attention. This MCQ tests your understanding of attention mechanisms, positional encoding, and the building blocks of models like BERT and GPT.

Why self‑attention?

Self‑attention allows each token to attend to all other tokens in the sequence, capturing long‑range dependencies and enabling parallelization – a major improvement over RNNs/LSTMs.

Transformer glossary – key concepts

Self‑Attention (Scaled Dot‑Product)

Attention(Q,K,V) = softmax(QKᵀ/√dₖ)V. Computes relevance between all pairs of positions.

Multi‑Head Attention

Runs multiple attention heads in parallel, allowing the model to focus on different subspaces.

Positional Encoding

Adds information about token order (since self‑attention is permutation‑invariant). Usually sine/cosine or learned.

Encoder‑Decoder

Encoder maps input to representations; decoder generates output autoregressively (with cross‑attention).

Masked Attention

In decoders, prevents attending to future tokens (causal masking). Also used in BERT's masked language modeling.

BERT (Encoder‑only)

Bidirectional representations, pretrained with masked LM and next sentence prediction.

GPT (Decoder‑only)

Autoregressive language model, trained to predict next token. Uses causal masking.

# Scaled dot‑product attention (simplified)
def attention(Q, K, V, mask=None):
    scores = Q @ K.T / sqrt(d_k)
    if mask: scores = scores.masked_fill(mask == 0, -1e9)
    weights = softmax(scores, dim=-1)
    return weights @ V
Interview tip: Be ready to explain why transformers need positional encoding, the difference between BERT and GPT, and how multi‑head attention works. This MCQ covers these distinctions.

Common Transformer interview questions

  • Why is self‑attention considered more parallelizable than RNNs?
  • Explain the role of the scaling factor √dₖ in attention.
  • What is the purpose of multi‑head attention?
  • How does positional encoding work? Why sine/cosine functions?
  • Compare encoder‑only (BERT) vs decoder‑only (GPT) architectures.
  • What is masked attention and where is it used?