Attention mechanism · Transformers 15 questions 25 min

Attention Mechanism MCQ · test your transformer knowledge

From Bahdanau to BERT – 15 questions covering self‑attention, multi‑head, scaled dot‑product, and modern architectures (Transformer, GPT).

Easy: 5 Medium: 6 Hard: 4
Self‑Attention
Multi‑Head
Scaled Dot‑Product
Transformer

Attention Mechanism: the key to modern NLP

Attention allows models to focus on relevant parts of the input when producing each output. Introduced for machine translation (Bahdanau et al.), it became the foundation of the Transformer architecture (Vaswani et al.) and models like BERT, GPT, and T5. This MCQ covers the core concepts: self‑attention, multi‑head attention, positional encoding, and variants.

Why attention?

It solves the bottleneck of fixed‑length context vectors in RNNs by providing direct access to all hidden states. The Transformer takes this further, relying solely on attention.

Attention glossary – key concepts

Self‑Attention

Attention mechanism where queries, keys, and values come from the same sequence. Each position attends to all positions.

Multi‑Head Attention

Runs multiple attention operations (heads) in parallel, each learning different relationships. Outputs are concatenated and projected.

Scaled Dot‑Product Attention

Attention(Q,K,V) = softmax(QK^T/√d_k)V. Scaling by √d_k prevents softmax saturation.

Positional Encoding

Since Transformers have no recurrence, positional encodings (sinusoidal or learned) inject order information.

Encoder‑Decoder Attention

In Transformer, decoder attends to encoder's output – queries from decoder, keys/values from encoder.

Masked Self‑Attention

Used in decoder to prevent attending to future tokens (causal mask).

BERT / GPT

BERT uses bidirectional self‑attention; GPT uses causal (masked) self‑attention for generation.

# Scaled dot‑product attention (NumPy style)
def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)
    if mask is not None: scores = scores.masked_fill(mask == 0, -1e9)
    attn_weights = softmax(scores)
    return attn_weights @ V
Interview tip: Be ready to explain why scaling is needed in dot‑product attention, the difference between self‑attention and cross‑attention, and how multi‑head attention works. This MCQ covers these distinctions.

Common attention interview questions

  • Why is the dot product scaled by 1/√d_k in Transformer attention?
  • Explain the role of queries, keys, and values in attention.
  • What is the difference between self‑attention and encoder‑decoder attention?
  • How does multi‑head attention improve over single head?
  • Why do Transformers need positional encodings?
  • Describe masked self‑attention in autoregressive models (GPT).