Paper Summary
This landmark paper introduced the Transformer architecture, which has since become the foundation for most modern NLP models.
Key Contributions
- Self-Attention Mechanism: Replaced recurrence with attention for sequence modeling
- Multi-Head Attention: Allows the model to attend to different representation subspaces
- Positional Encoding: Injects sequence order information without recurrence
The Architecture
The Transformer consists of:
- Encoder: 6 identical layers with self-attention and feed-forward networks
- Decoder: 6 identical layers with masked self-attention, encoder-decoder attention, and feed-forward networks
$
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O
$
Why This Paper Matters
Before Transformers:
- RNNs/LSTMs were the standard for sequence modeling
- Training was sequential and slow
- Long-range dependencies were difficult to capture
After Transformers:
- Parallel training enabled massive scale
- Models like BERT, GPT, T5 became possible
- Attention became the dominant paradigm
Strengths
- Elegant, simple architecture
- Highly parallelizable
- Strong empirical results
- Clear writing and presentation
Limitations
- Quadratic complexity in sequence length ($O(n^2)$)
- No inherent notion of position (requires positional encoding)
- Large memory footprint for long sequences
My Takeaways
This paper is a masterclass in research presentation. The authors clearly motivate the problem, present a clean solution, and provide thorough experiments. A must-read for anyone in ML.