Review: BERT - Pre-training of Deep Bidirectional Transformers

Paper Summary

BERT introduced bidirectional pre-training for language models, achieving state-of-the-art results on 11 NLP tasks.

Key Innovations

1. Masked Language Modeling (MLM)

Instead of predicting the next word (like GPT), BERT masks random tokens and predicts them:

Input:  The [MASK] sat on the mat
Output: The cat sat on the mat

This allows bidirectional context—the model can look both left and right.

2. Next Sentence Prediction (NSP)

BERT also learns to predict if two sentences follow each other:

[CLS] The cat sat on the mat [SEP] It was a sunny day [SEP]
Label: IsNext / NotNext

Architecture

BERT-Base: 12 layers, 768 hidden, 12 heads, 110M parameters
BERT-Large: 24 layers, 1024 hidden, 16 heads, 340M parameters

Results

BERT achieved SOTA on:

GLUE benchmark
SQuAD question answering
Named entity recognition
And many more...

Impact

BERT fundamentally changed NLP:

Pre-train once, fine-tune for many tasks
Bidirectional context matters
Scale helps (more data, bigger models)

Criticisms

NSP task later shown to be less important (RoBERTa)
Masking creates train-test mismatch
Computationally expensive to pre-train

My Thoughts

BERT democratized NLP. Before BERT, achieving good results required task-specific architectures. After BERT, you could fine-tune a pre-trained model and get competitive results quickly. This paper, along with the Transformer paper, defined modern NLP.