Related Neural Networks Links
Learn Attention Neural Networks Tutorial, validate concepts with Attention Neural Networks MCQ Questions, and prepare interviews through Attention Neural Networks Interview Questions and Answers.
Neural Networks
15 Essential Q&A
Interview Prep
Attention Mechanism — 15 Interview Questions
Q/K/V, scaled dot-product, causal masking, multi-head attention, and how attention stacks into a Transformer block.
Colored left borders per card; green / amber / red difficulty chips.
QKV
Softmax
Multi-head
Causal
1 Intuition: what does attention compute?Easy
Answer: A weighted sum of values, where weights (attention scores) say how much each source position matters for the current query—soft lookup over a set of vectors.
2 What are Query, Key, and Value?Easy
Answer: Three linear projections of inputs (or cross-modal sources). Query asks “what I needâ€; keys label slots; values carry content mixed by attention weights.
3 Scaled dot-product attention formula.Medium
Answer: Attention(Q,K,V) = softmax(QKᵀ / √d_k) V. Scale by √d_k keeps dot products from growing too large so softmax doesn’t saturate.
softmax(QK^T / √d_k) V
4 Self-attention vs cross-attention.Medium
Answer: Self: Q, K, V from same sequence (e.g. encoder). Cross: Q from one sequence (decoder), K,V from another (encoder output)—decoder attends to source.
5 Causal (look-ahead) mask in decoders.Medium
Answer: Set attention logits to −∞ for future positions before softmax—position t cannot attend to t+1,…—preserves autoregressive generation.
6 Multi-head attention—why multiple heads?Medium
Answer: Each head learns different subspaces of relationships in parallel; concatenating heads lets the model capture multiple dependency types (syntax, coreference, etc.).
7 Pre-LN vs Post-LN Transformer (brief).Hard
Answer: Post-LN: original “Attention → Add&Normâ€. Pre-LN: norm before sublayers—often more stable training for very deep stacks; both used in literature.
8 Why positional encoding?Easy
Answer: Attention is permutation-invariant without order info—add sinusoidal or learned positions so “cat bites dog†≠“dog bites cat.â€
9 Time complexity of self-attention in sequence length n.Medium
Answer: O(n² · d) for attention matrix over pairs—quadratic in length is the main bottleneck for long contexts; motivates sparse/linear attention variants.
10 Bahdanau (additive) vs Luong (dot) attention.Hard
Answer: Older seq2seq: additive scores use a small MLP on [s_t; h_j]; multiplicative/dot uses direct similarity—scaled dot-product is the modern dot family at scale.
11 What sits after attention in a Transformer block?Easy
Answer: Feed-forward network (MLP) applied per position—typically expand (4d) with GELU/ReLU then project back; residual + norm around each sublayer.
12 Attention dropout—where?Easy
Answer: Dropout on attention weights (after softmax) or on scores in some implementations—regularizes attention patterns.
13 Vision Transformer—how is attention used?Medium
Answer: Split image into patches, embed as tokens, run Transformer encoder self-attention—global mixing of patch relationships without conv inductive bias (with data scale).
14 FlashAttention (interview one-liner).Hard
Answer: IO-aware exact attention implementation that fuses ops and tiles to SRAM—same math, faster training on GPUs for long sequences.
15 Encoder-only vs decoder-only vs encoder–decoder.Medium
Answer: Encoder-only (BERT): bidirectional context. Decoder-only (GPT): causal LM. Enc–Dec (T5, original Transformer): encoder sees source, decoder generates target with cross-attention.
Memorize scaled softmax(QKᵀ)V and be able to point to each matrix’s shape.
Quick review checklist
- Q/K/V; scaled dot-product; self vs cross; causal mask.
- Multi-head; positional encoding; O(n²) cost.
- Transformer block; ViT; encoder/decoder roles.