Skip to content
← Back to paper reviews

Review: BERT - Pre-training of Deep Bidirectional Transformers

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin et al.

NAACL 2019

arXiv →
Oct 15, 20252 min read

Paper Summary

BERT introduced bidirectional pre-training for language models, achieving state-of-the-art results on 11 NLP tasks.

Key Innovations

1. Masked Language Modeling (MLM)

Instead of predicting the next word (like GPT), BERT masks random tokens and predicts them:

Input:  The [MASK] sat on the mat
Output: The cat sat on the mat

This allows bidirectional context—the model can look both left and right.

2. Next Sentence Prediction (NSP)

BERT also learns to predict if two sentences follow each other:

[CLS] The cat sat on the mat [SEP] It was a sunny day [SEP]
Label: IsNext / NotNext

Architecture

  • BERT-Base: 12 layers, 768 hidden, 12 heads, 110M parameters
  • BERT-Large: 24 layers, 1024 hidden, 16 heads, 340M parameters

Results

BERT achieved SOTA on:

  • GLUE benchmark
  • SQuAD question answering
  • Named entity recognition
  • And many more...

Impact

BERT fundamentally changed NLP:

  1. Pre-train once, fine-tune for many tasks
  2. Bidirectional context matters
  3. Scale helps (more data, bigger models)

Criticisms

  • NSP task later shown to be less important (RoBERTa)
  • Masking creates train-test mismatch
  • Computationally expensive to pre-train

My Thoughts

BERT democratized NLP. Before BERT, achieving good results required task-specific architectures. After BERT, you could fine-tune a pre-trained model and get competitive results quickly. This paper, along with the Transformer paper, defined modern NLP.