Paper Summary
BERT introduced bidirectional pre-training for language models, achieving state-of-the-art results on 11 NLP tasks.
Key Innovations
1. Masked Language Modeling (MLM)
Instead of predicting the next word (like GPT), BERT masks random tokens and predicts them:
Input: The [MASK] sat on the mat
Output: The cat sat on the mat
This allows bidirectional context—the model can look both left and right.
2. Next Sentence Prediction (NSP)
BERT also learns to predict if two sentences follow each other:
[CLS] The cat sat on the mat [SEP] It was a sunny day [SEP]
Label: IsNext / NotNext
Architecture
- BERT-Base: 12 layers, 768 hidden, 12 heads, 110M parameters
- BERT-Large: 24 layers, 1024 hidden, 16 heads, 340M parameters
Results
BERT achieved SOTA on:
- GLUE benchmark
- SQuAD question answering
- Named entity recognition
- And many more...
Impact
BERT fundamentally changed NLP:
- Pre-train once, fine-tune for many tasks
- Bidirectional context matters
- Scale helps (more data, bigger models)
Criticisms
- NSP task later shown to be less important (RoBERTa)
- Masking creates train-test mismatch
- Computationally expensive to pre-train
My Thoughts
BERT democratized NLP. Before BERT, achieving good results required task-specific architectures. After BERT, you could fine-tune a pre-trained model and get competitive results quickly. This paper, along with the Transformer paper, defined modern NLP.