Technology Hub
AI Hub
Deep Learning
Attention

Attention Mechanism MCQ Â· test your transformer knowledge

From Bahdanau to BERT â€“ 15 questions covering selfâ€‘attention, multiâ€‘head, scaled dotâ€‘product, and modern architectures (Transformer, GPT).

Easy: 5 Medium: 6 Hard: 4

Selfâ€‘Attention

Multiâ€‘Head

Scaled Dotâ€‘Product

Transformer

Your attention score

0/15

0 Correct0 Incorrect

Attention Mechanism: the key to modern NLP

Attention allows models to focus on relevant parts of the input when producing each output. Introduced for machine translation (Bahdanau et al.), it became the foundation of the Transformer architecture (Vaswani et al.) and models like BERT, GPT, and T5. This MCQ covers the core concepts: selfâ€‘attention, multiâ€‘head attention, positional encoding, and variants.

Why attention?

It solves the bottleneck of fixedâ€‘length context vectors in RNNs by providing direct access to all hidden states. The Transformer takes this further, relying solely on attention.

Attention glossary â€“ key concepts

Selfâ€‘Attention

Attention mechanism where queries, keys, and values come from the same sequence. Each position attends to all positions.

Multiâ€‘Head Attention

Runs multiple attention operations (heads) in parallel, each learning different relationships. Outputs are concatenated and projected.

Scaled Dotâ€‘Product Attention

Attention(Q,K,V) = softmax(QK^T/âˆšd_k)V. Scaling by âˆšd_k prevents softmax saturation.

Positional Encoding

Since Transformers have no recurrence, positional encodings (sinusoidal or learned) inject order information.

Encoderâ€‘Decoder Attention

In Transformer, decoder attends to encoder's output â€“ queries from decoder, keys/values from encoder.

Masked Selfâ€‘Attention

Used in decoder to prevent attending to future tokens (causal mask).

BERT / GPT

BERT uses bidirectional selfâ€‘attention; GPT uses causal (masked) selfâ€‘attention for generation.

# Scaled dotâ€‘product attention (NumPy style)
def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)
    if mask is not None: scores = scores.masked_fill(mask == 0, -1e9)
    attn_weights = softmax(scores)
    return attn_weights @ V

Interview tip: Be ready to explain why scaling is needed in dotâ€‘product attention, the difference between selfâ€‘attention and crossâ€‘attention, and how multiâ€‘head attention works. This MCQ covers these distinctions.

Common attention interview questions

Why is the dot product scaled by 1/âˆšd_k in Transformer attention?
Explain the role of queries, keys, and values in attention.
What is the difference between selfâ€‘attention and encoderâ€‘decoder attention?
How does multiâ€‘head attention improve over single head?
Why do Transformers need positional encodings?
Describe masked selfâ€‘attention in autoregressive models (GPT).

AI Hub Next: Object Detection