In 2018, Google released BERT (Bidirectional Encoder Representations from Transformers), which fundamentally transformed the field of natural language processing. Prior to BERT, pretrained models primarily used unidirectional language modeling (like GPT), which could only leverage context in one direction. BERT revolutionized NLP by introducing bidirectional encoder architecture and masked language modeling (MLM), achieving state-of-the-art performance on 11 NLP tasks and ushering in the golden age of the "pretrain-finetune" paradigm.
BERT's success lies not only in its architectural innovations but also in demonstrating that large-scale pretrained models can serve as universal foundations for NLP tasks. Since BERT, variants like RoBERTa, ALBERT, and ELECTRA have continuously emerged, each optimizing BERT's design in different dimensions. Understanding BERT is not just key to understanding modern NLP — it's the starting point for diving into the era of large language models.
This article provides an in-depth analysis of BERT's architecture, training strategies, and finetuning methods, demonstrates practical usage through HuggingFace code examples, and compares various BERT variants and their improvements.
The Rise of Pretrain-Finetune Paradigm
Before BERT, the mainstream approach in NLP was to train a separate model for each task. This approach had obvious limitations: each task required large amounts of labeled data, which is expensive to obtain; knowledge couldn't be shared across tasks, leading to resource waste.
From Word2Vec to ELMo: Evolution of Pretraining
The idea of pretraining wasn't invented by BERT. As early as 2013, Word2Vec trained word embeddings through unsupervised learning, which could be used as initialization parameters for downstream tasks. However, Word2Vec had a fundamental problem: each word had only one fixed vector representation, unable to handle polysemy.
In 2018, ELMo (Embeddings from Language Models) first proposed context-dependent word representations. ELMo used bidirectional LSTM language models to generate context-dependent vectors for each word:
ELMo's contribution was proving that context-dependent pretrained representations can significantly improve downstream task performance. However, it still used RNN architecture, unable to fully leverage parallel computation, resulting in lower training efficiency.
GPT-1: Attempt at Unidirectional Pretraining
In June 2018, OpenAI released GPT-1 (Generative Pre-trained
Transformer), first applying Transformer architecture to pretraining.
GPT-1 used unidirectional language modeling objective:
BERT's Breakthrough: Bidirectional Pretraining
BERT's core innovation is bidirectional pretraining. Unlike GPT's unidirectional modeling, BERT can simultaneously leverage information from both directions of context, giving it a natural advantage in understanding tasks.
BERT's pretrain-finetune paradigm can be summarized as:
- Pretraining Phase: Train on large-scale unlabeled corpora to learn universal language representations
- Finetuning Phase: Finetune on task-specific labeled data to adapt to specific task requirements
The advantages of this paradigm: - High Data Efficiency: Pretrained models have already learned rich language knowledge; finetuning requires only small amounts of labeled data - Strong Generalization: The same pretrained model can be used for multiple downstream tasks - Excellent Performance: Pretrained models typically achieve better performance on downstream tasks
BERT Architecture Deep Dive
BERT is built on the encoder part of Transformer but with key modifications to adapt to bidirectional pretraining needs.
Overall Architecture
BERT's input representation is the sum of three components:
Token Embedding: Maps input tokens to vectors. BERT uses WordPiece tokenization with a vocabulary size of 30,000.
Segment Embedding: Used to distinguish different
sentences. BERT can handle sentence pair tasks (like QA), using
Position Embedding: Same as Transformer, using learnable positional encodings with a maximum sequence length of 512.
BERT adds a special [CLS] token at the beginning of the
input sequence, whose final representation is used for classification
tasks. Sentences are separated by [SEP] tokens.
Bidirectional Encoder
BERT's core is the bidirectional Transformer
encoder. Unlike GPT's unidirectional attention, each position
in BERT can attend to all positions in the sequence (including both
forward and backward):
BERT's encoder consists of
Two BERT Scales
BERT released two model scales:
BERT-Base: - Layers
BERT-Large: - Layers
BERT Training Strategy
BERT uses two unsupervised pretraining tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).
Masked Language Modeling (MLM)
MLM is BERT's core pretraining task. The specific approach is:
- Randomly select 15% of tokens in the input sequence for masking
- Among these 15%:
- 80% are replaced with
[MASK]token - 10% are replaced with random tokens
- 10% remain unchanged
- 80% are replaced with
- The model needs to predict the original tokens that were masked
The cleverness of this strategy: - 80% use [MASK]:
Allows the model to learn prediction tasks - 10% random
replacement: Prevents the model from over-relying on
[MASK] tokens, improving generalization - 10%
unchanged: Allows the model to learn to leverage contextual
information even without explicit mask tokens
The MLM loss function is:
Next Sentence Prediction (NSP)
The NSP task is used to learn relationships between sentences, which is important for tasks like question answering and natural language inference.
Training data construction: - 50% of samples: Sentence A and Sentence B are consecutive (IsNext) - 50% of samples: Sentence B is randomly selected (NotNext)
The model uses the [CLS] token representation for binary
classification:
Joint Training
BERT's final pretraining loss is the combination of both tasks:
BERT Variants and Improvements
BERT's success inspired numerous follow-up studies, with various variants optimizing BERT's design in different dimensions.
RoBERTa: A More Robust BERT
RoBERTa (Robustly Optimized BERT Pretraining Approach) was proposed by Facebook in 2019, with main improvements including:
1. Removing NSP Task - Research found that NSP task provides limited performance improvement and may even be harmful - RoBERTa uses only MLM task for pretraining
2. Dynamic Masking - BERT statically masks during data preprocessing, using the same masking pattern each epoch - RoBERTa dynamically generates masks each epoch, increasing training data diversity
3. Larger Batch Size and Longer Training - BERT: Batch size 256, train for 1M steps - RoBERTa: Batch size 8K, train for longer
4. Larger Training Data - In addition to BooksCorpus and Wikipedia, also uses CC-News, OpenWebText, Stories, etc.
RoBERTa surpassed BERT on multiple tasks, demonstrating the importance of optimizing training strategies.
ALBERT: Parameter-Sharing Lightweight BERT
ALBERT (A Lite BERT) was proposed in 2019, with the main goal of reducing parameters while maintaining performance.
1. Factorized Embedding Parameterization - In BERT,
embedding dimension
3. Sentence Order Prediction (SOP) Replaces NSP - NSP task is too simple; models mainly learn topic prediction rather than sentence relationships - SOP predicts whether two sentences are in reversed order, focusing more on sentence coherence
ALBERT-xxlarge (12 layers) has only 70% of BERT-large's parameters but performs better on multiple tasks.
ELECTRA: Efficient Pretraining Method
ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) was proposed in 2020, replacing MLM with Replaced Token Detection (RTD).
Core Idea: 1. Use a small generator for MLM to generate replacement tokens 2. Use a discriminator to determine whether each position's token was replaced 3. Only train the discriminator; generator is only used for data augmentation
Advantages: - High Training Efficiency: MLM only predicts 15% of tokens; RTD predicts all tokens, achieving higher data utilization - Better Performance: ELECTRA-small's performance on GLUE approaches BERT-base but with fewer parameters
ELECTRA's loss function:
BERT Variants Comparison Summary
| Model | Main Improvements | Parameters | Training Efficiency | Performance Gain |
|---|---|---|---|---|
| BERT | Baseline | 110M/340M | Baseline | Baseline |
| RoBERTa | Remove NSP, dynamic masking, larger batch | 110M/340M | Similar | +2-3% |
| ALBERT | Parameter sharing, factorized embedding | 12M-235M | Faster | +1-2% |
| ELECTRA | RTD replaces MLM | 14M-335M | Faster | +2-3% |
Downstream Task Finetuning
BERT's strength lies in its generality: the same pretrained model can be adapted to various downstream tasks through finetuning.
Text Classification
Problem Context: Text classification is one of the most common NLP tasks, requiring mapping text to predefined categories. BERT's bidirectional encoding capability allows it to leverage both forward and backward context simultaneously, making it ideal for understanding tasks.
Solution Approach: BERT adds a special
[CLS] token at the beginning of the sequence. After passing
through multiple Transformer layers, its representation aggregates
information from the entire sequence. We add a classification head
(linear layer + softmax) on top of this representation to complete the
classification task.
Design Considerations: - The [CLS]
token is specifically designed for sequence-level representation - The
classification head is typically a simple linear layer with few
parameters, easy to train - Different pooling strategies (e.g., average
pooling) can be used, but [CLS] usually works best
For single-sentence classification tasks (like sentiment analysis),
use the [CLS] token representation for classification:
1 | from transformers import BertTokenizer, BertForSequenceClassification |
Key Points: - [CLS] token
role: Located at sequence start, aggregates entire sequence
information after multi-layer encoding, suitable for sequence-level
tasks - Classification head design: Typically just a
linear layer with hidden_size × num_labels parameters, very
lightweight - Input format: Tokenizer automatically
adds [CLS] and [SEP] tokens, handles padding
and truncation
Design Trade-offs: - ✅ Pros: Simple and efficient,
[CLS] token specifically designed for classification - ⚠️
Note: For long texts, truncation may be needed, potentially losing
information
Common Questions: - Q: Why use [CLS]
instead of other positions? A: [CLS] is specifically
trained for sequence-level representation during pretraining, works best
- Q: How to handle multi-class classification? A: Just set
num_labels to the number of classes, use cross-entropy
loss
Usage Example: 1
2
3
4
5
6# Batch processing
texts = ["I love this movie!", "This is terrible."]
inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
# predictions: tensor([1, 0]) # Assuming 1 is positive, 0 is negative
Named Entity Recognition (NER)
For sequence labeling tasks, use each token's representation for classification:
1 | from transformers import BertTokenizer, BertForTokenClassification |
Question Answering (QA)
For reading comprehension tasks, need to predict answer start and end positions in the original text:
1 | from transformers import BertTokenizer, BertForQuestionAnswering |
Sentence Pair Classification
For tasks like natural language inference, need to handle sentence pairs:
1 | from transformers import BertTokenizer, BertForSequenceClassification |
Finetuning Tips
1. Learning Rate Settings - Use smaller learning rate for pretrained layers (e.g., 2e-5) - Use larger learning rate for classification head (e.g., 1e-4)
1 | from torch.optim import AdamW |
2. Gradient Accumulation - When batch size is limited, use gradient accumulation to simulate larger batches
1 | accumulation_steps = 4 |
3. Learning Rate Scheduling - Use warmup and linear decay
1 | from transformers import get_linear_schedule_with_warmup |
HuggingFace Practice: Complete Finetuning Pipeline
Below is a complete text classification finetuning pipeline:
1 | from transformers import ( |
BERT's Limitations
Despite BERT's tremendous success, it also has some limitations:
1. High Computational Cost - BERT-base has 110M parameters, BERT-large has 340M parameters - Inference speed is relatively slow, not suitable for real-time applications
2. Weak Unidirectional Generation Capability - BERT is a bidirectional encoder, not suitable for generation tasks - For text generation, autoregressive models like GPT are needed
3. Maximum Sequence Length Limitation - BERT's maximum sequence length is 512, cannot handle very long texts - Although can be extended through sliding window methods, effectiveness is limited
4. Pretraining Data Bias Toward English - BERT is mainly trained on English corpora, limited support for other languages - Multilingual BERT (mBERT) supports multiple languages but performance is inferior to monolingual models
Summary
BERT ushered in the pretrain-finetune era of NLP, proving the effectiveness of large-scale pretrained models. Since BERT, various variants and improvements have continuously emerged, driving rapid development in the NLP field.
BERT's core contributions: 1. Bidirectional Pretraining: Achieves bidirectional context understanding through MLM task 2. Universal Representations: The same model can adapt to multiple downstream tasks 3. Pretrain-Finetune Paradigm: Dramatically reduces data requirements for NLP tasks
Understanding BERT is not just key to understanding modern NLP — it's the starting point for diving into the era of large language models. In the next article, we'll explore the GPT series models to understand the charm of generative language models.
❓ Q&A: BERT Common Questions
Q1: Why does BERT use [MASK] tokens instead of directly predicting the next word?
A: BERT uses MLM task to achieve bidirectional pretraining. If it predicted the next word like GPT, the model could only see left context and couldn't leverage right-side information. MLM masks some tokens and predicts them, allowing the model to simultaneously leverage context from both sides.
Q2: Why can BERT's [CLS] token be used for classification?
A: The [CLS] token is at the beginning of the sequence.
After passing through multiple Transformer layers, its representation
aggregates information from the entire sequence. Although theoretically
any position's token can be used for classification,
[CLS]'s design makes it specifically for sequence-level
representation, learning effective classification features during
pretraining and finetuning.
Q3: What are the main differences between BERT and GPT?
A: Main differences: - Architecture: BERT is bidirectional encoder, GPT is unidirectional decoder - Pretraining Tasks: BERT uses MLM + NSP, GPT uses language modeling - Suitable Tasks: BERT excels at understanding tasks (classification, NER, QA), GPT excels at generation tasks - Context Utilization: BERT can see both forward and backward context, GPT can only see forward context
Q4: Why did RoBERTa remove the NSP task?
A: Research found that NSP task provides limited performance improvement and may even be harmful. The NSP task is too simple; models mainly learn topic prediction rather than logical relationships between sentences. RoBERTa achieved better performance on multiple tasks by removing NSP and training with longer sequences.
Q5: How does ALBERT reduce parameters?
A: ALBERT mainly uses two methods: 1. Factorized Embedding: Factorizes word embedding matrix into two smaller matrices, reducing embedding layer parameters 2. Cross-Layer Parameter Sharing: Shares parameters across all Transformer layers, dramatically reducing parameters
Q6: What are ELECTRA's advantages over BERT?
A: ELECTRA's main advantage is higher training efficiency: - MLM only predicts 15% of tokens; RTD predicts all tokens, achieving higher data utilization - Under the same computational budget, ELECTRA performs better - ELECTRA-small's performance approaches BERT-base but with fewer parameters
Q7: Can BERT be used for generation tasks?
A: BERT itself is not suitable for generation tasks because it's a bidirectional encoder and cannot perform autoregressive generation. However, BERT can be used for generation through some techniques: - BERT-GEN: Uses BERT encoder + independent decoder - MASS/BART: Uses BERT-style encoder-decoder architecture
For pure generation tasks, autoregressive models like GPT are more suitable.
Q8: How to choose BERT variants?
A: Selection recommendations: - BERT-base: General choice, balances performance and speed - RoBERTa: Pursuing higher performance, sufficient computational resources - ALBERT: Parameter-limited, need lightweight models - ELECTRA: Pursuing training efficiency, or need small models with performance approaching large models
Q9: What should be noted when finetuning BERT?
A: Key considerations: 1. Learning Rate: Use smaller learning rate (2e-5) to avoid destroying pretrained weights 2. Batch Size: Adjust according to GPU memory, can use gradient accumulation 3. Training Epochs: Usually 2-4 epochs suffice, avoid overfitting 4. Warmup: Use learning rate warmup to stabilize training 5. Regularization: Appropriately use dropout and weight decay
Q10: How does BERT handle multilingual tasks?
A: BERT handles multilingual tasks through: 1. mBERT: Pretrained on multilingual corpora, supports 100+ languages 2. XLM: Uses cross-lingual pretraining objectives to enhance cross-lingual transfer ability 3. Monolingual BERT: Train dedicated models for each language, better performance but requires more resources
For Chinese tasks, recommend using BERT-Chinese or RoBERTa-Chinese.
- Post title:NLP (5): BERT and Pretrained Models
- Post author:Chen Kai
- Create time:2024-02-26 09:30:00
- Post link:https://www.chenk.top/en/nlp-bert-pretrained-models/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.