NLP (5): BERT and Pretrained Models

In 2018, Google released BERT (Bidirectional Encoder Representations from Transformers), which fundamentally transformed the field of natural language processing. Prior to BERT, pretrained models primarily used unidirectional language modeling (like GPT), which could only leverage context in one direction. BERT revolutionized NLP by introducing bidirectional encoder architecture and masked language modeling (MLM), achieving state-of-the-art performance on 11 NLP tasks and ushering in the golden age of the "pretrain-finetune" paradigm.

BERT's success lies not only in its architectural innovations but also in demonstrating that large-scale pretrained models can serve as universal foundations for NLP tasks. Since BERT, variants like RoBERTa, ALBERT, and ELECTRA have continuously emerged, each optimizing BERT's design in different dimensions. Understanding BERT is not just key to understanding modern NLP — it's the starting point for diving into the era of large language models.

This article provides an in-depth analysis of BERT's architecture, training strategies, and finetuning methods, demonstrates practical usage through HuggingFace code examples, and compares various BERT variants and their improvements.

The Rise of Pretrain-Finetune Paradigm

Before BERT, the mainstream approach in NLP was to train a separate model for each task. This approach had obvious limitations: each task required large amounts of labeled data, which is expensive to obtain; knowledge couldn't be shared across tasks, leading to resource waste.

From Word2Vec to ELMo: Evolution of Pretraining

The idea of pretraining wasn't invented by BERT. As early as 2013, Word2Vec trained word embeddings through unsupervised learning, which could be used as initialization parameters for downstream tasks. However, Word2Vec had a fundamental problem: each word had only one fixed vector representation, unable to handle polysemy.

In 2018, ELMo (Embeddings from Language Models) first proposed context-dependent word representations. ELMo used bidirectional LSTM language models to generate context-dependent vectors for each word:

whereis the hidden state of the-th LSTM layer at position,is a task-specific weight, andis a scaling factor.

ELMo's contribution was proving that context-dependent pretrained representations can significantly improve downstream task performance. However, it still used RNN architecture, unable to fully leverage parallel computation, resulting in lower training efficiency.

GPT-1: Attempt at Unidirectional Pretraining

In June 2018, OpenAI released GPT-1 (Generative Pre-trained Transformer), first applying Transformer architecture to pretraining. GPT-1 used unidirectional language modeling objective:GPT-1 achieved good results on multiple tasks, but unidirectional modeling limited the model's ability to understand context. For example, when understanding the sentence "The bank is closed", GPT could only see "The bank is" and couldn't leverage information from "closed" to better understand the meaning of "bank".

BERT's Breakthrough: Bidirectional Pretraining

BERT's core innovation is bidirectional pretraining. Unlike GPT's unidirectional modeling, BERT can simultaneously leverage information from both directions of context, giving it a natural advantage in understanding tasks.

BERT's pretrain-finetune paradigm can be summarized as:

Pretraining Phase: Train on large-scale unlabeled corpora to learn universal language representations
Finetuning Phase: Finetune on task-specific labeled data to adapt to specific task requirements

The advantages of this paradigm: - High Data Efficiency: Pretrained models have already learned rich language knowledge; finetuning requires only small amounts of labeled data - Strong Generalization: The same pretrained model can be used for multiple downstream tasks - Excellent Performance: Pretrained models typically achieve better performance on downstream tasks

BERT Architecture Deep Dive

BERT is built on the encoder part of Transformer but with key modifications to adapt to bidirectional pretraining needs.

Overall Architecture

BERT's input representation is the sum of three components:

Token Embedding: Maps input tokens to vectors. BERT uses WordPiece tokenization with a vocabulary size of 30,000.

Segment Embedding: Used to distinguish different sentences. BERT can handle sentence pair tasks (like QA), usingfor the first sentence andfor the second sentence.

Position Embedding: Same as Transformer, using learnable positional encodings with a maximum sequence length of 512.

BERT adds a special [CLS] token at the beginning of the input sequence, whose final representation is used for classification tasks. Sentences are separated by [SEP] tokens.

Bidirectional Encoder

BERT's core is the bidirectional Transformer encoder. Unlike GPT's unidirectional attention, each position in BERT can attend to all positions in the sequence (including both forward and backward):In BERT,,, andall come from the same input sequence, allowing each token to "see" information from the entire sequence.

BERT's encoder consists ofstacked Transformer blocks, each containing: - Multi-Head Self-Attention Mechanism - Feed-Forward Network - Residual Connections and Layer Normalization

Two BERT Scales

BERT released two model scales:

BERT-Base: - Layers - Hidden dimension - Attention heads - Parameters: 110M

BERT-Large: - Layers - Hidden dimension - Attention heads - Parameters: 340M

BERT Training Strategy

BERT uses two unsupervised pretraining tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).

Masked Language Modeling (MLM)

MLM is BERT's core pretraining task. The specific approach is:

Randomly select 15% of tokens in the input sequence for masking
Among these 15%:
- 80% are replaced with [MASK] token
- 10% are replaced with random tokens
- 10% remain unchanged
The model needs to predict the original tokens that were masked

The cleverness of this strategy: - 80% use [MASK]: Allows the model to learn prediction tasks - 10% random replacement: Prevents the model from over-relying on [MASK] tokens, improving generalization - 10% unchanged: Allows the model to learn to leverage contextual information even without explicit mask tokens

The MLM loss function is:whereis the set of masked positions, andrepresents all tokens except the masked positions.

Next Sentence Prediction (NSP)

The NSP task is used to learn relationships between sentences, which is important for tasks like question answering and natural language inference.

Training data construction: - 50% of samples: Sentence A and Sentence B are consecutive (IsNext) - 50% of samples: Sentence B is randomly selected (NotNext)

The model uses the [CLS] token representation for binary classification:The NSP loss function is:

Joint Training

BERT's final pretraining loss is the combination of both tasks:Pretraining data: BERT trains on BooksCorpus (800M words) and English Wikipedia (2.5B words), totaling approximately 3.3B words.

BERT Variants and Improvements

BERT's success inspired numerous follow-up studies, with various variants optimizing BERT's design in different dimensions.

RoBERTa: A More Robust BERT

RoBERTa (Robustly Optimized BERT Pretraining Approach) was proposed by Facebook in 2019, with main improvements including:

1. Removing NSP Task - Research found that NSP task provides limited performance improvement and may even be harmful - RoBERTa uses only MLM task for pretraining

2. Dynamic Masking - BERT statically masks during data preprocessing, using the same masking pattern each epoch - RoBERTa dynamically generates masks each epoch, increasing training data diversity

3. Larger Batch Size and Longer Training - BERT: Batch size 256, train for 1M steps - RoBERTa: Batch size 8K, train for longer

4. Larger Training Data - In addition to BooksCorpus and Wikipedia, also uses CC-News, OpenWebText, Stories, etc.

RoBERTa surpassed BERT on multiple tasks, demonstrating the importance of optimizing training strategies.

ALBERT (A Lite BERT) was proposed in 2019, with the main goal of reducing parameters while maintaining performance.

1. Factorized Embedding Parameterization - In BERT, embedding dimensionmust equal hidden dimension - ALBERT factorizes the embedding matrix into two matrices:and, where - Parameters reduced fromto 2. Cross-Layer Parameter Sharing - Each layer in BERT has independent parameters - ALBERT shares parameters across all layers, dramatically reducing parameters - Experiments show that sharing attention and FFN parameters works best

3. Sentence Order Prediction (SOP) Replaces NSP - NSP task is too simple; models mainly learn topic prediction rather than sentence relationships - SOP predicts whether two sentences are in reversed order, focusing more on sentence coherence

ALBERT-xxlarge (12 layers) has only 70% of BERT-large's parameters but performs better on multiple tasks.

ELECTRA: Efficient Pretraining Method

ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) was proposed in 2020, replacing MLM with Replaced Token Detection (RTD).

Core Idea: 1. Use a small generator for MLM to generate replacement tokens 2. Use a discriminator to determine whether each position's token was replaced 3. Only train the discriminator; generator is only used for data augmentation

Advantages: - High Training Efficiency: MLM only predicts 15% of tokens; RTD predicts all tokens, achieving higher data utilization - Better Performance: ELECTRA-small's performance on GLUE approaches BERT-base but with fewer parameters

ELECTRA's loss function: $𝟙 𝟙$ whereis the token replaced by the generator, andis the discriminator.

BERT Variants Comparison Summary

Model	Main Improvements	Parameters	Training Efficiency	Performance Gain
BERT	Baseline	110M/340M	Baseline	Baseline
RoBERTa	Remove NSP, dynamic masking, larger batch	110M/340M	Similar	+2-3%
ALBERT	Parameter sharing, factorized embedding	12M-235M	Faster	+1-2%
ELECTRA	RTD replaces MLM	14M-335M	Faster	+2-3%

Downstream Task Finetuning

BERT's strength lies in its generality: the same pretrained model can be adapted to various downstream tasks through finetuning.

Text Classification

Problem Context: Text classification is one of the most common NLP tasks, requiring mapping text to predefined categories. BERT's bidirectional encoding capability allows it to leverage both forward and backward context simultaneously, making it ideal for understanding tasks.

Solution Approach: BERT adds a special [CLS] token at the beginning of the sequence. After passing through multiple Transformer layers, its representation aggregates information from the entire sequence. We add a classification head (linear layer + softmax) on top of this representation to complete the classification task.

Design Considerations: - The [CLS] token is specifically designed for sequence-level representation - The classification head is typically a simple linear layer with few parameters, easy to train - Different pooling strategies (e.g., average pooling) can be used, but [CLS] usually works best

For single-sentence classification tasks (like sentiment analysis), use the [CLS] token representation for classification:

from transformers import BertTokenizer, BertForSequenceClassification
import torch

# Problem: How to use BERT for text classification?
# Solution: Use [CLS] token representation as sequence-level features for classification

# Load pretrained tokenizer and model
# tokenizer converts text to token IDs and adds special tokens ([CLS], [SEP])
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# num_labels specifies the number of classes; model automatically adds classification head
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Input text
text = "I love this movie!"
# tokenizer returns a dict containing:
# - input_ids: token IDs, shape: [batch_size, seq_len]
# - attention_mask: attention mask, shape: [batch_size, seq_len]
# - token_type_ids: sentence type IDs (all 0s for single sentence), shape: [batch_size, seq_len]
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)

# Forward pass
# Internal model flow:
# 1. Token embedding + position embedding + sentence embedding -> [batch_size, seq_len, hidden_size]
# 2. 12 Transformer encoder layers -> [batch_size, seq_len, hidden_size]
# 3. Extract [CLS] position representation -> [batch_size, hidden_size]
# 4. Classification head (linear layer) -> [batch_size, num_labels]
outputs = model(**inputs)
logits = outputs.logits  # shape: [batch_size, num_labels], unnormalized scores
predictions = torch.argmax(logits, dim=-1)  # shape: [batch_size], predicted class indices

Key Points: - [CLS] token role: Located at sequence start, aggregates entire sequence information after multi-layer encoding, suitable for sequence-level tasks - Classification head design: Typically just a linear layer with hidden_size × num_labels parameters, very lightweight - Input format: Tokenizer automatically adds [CLS] and [SEP] tokens, handles padding and truncation

Design Trade-offs: - ✅ Pros: Simple and efficient, [CLS] token specifically designed for classification - ⚠️ Note: For long texts, truncation may be needed, potentially losing information

Common Questions: - Q: Why use [CLS] instead of other positions? A: [CLS] is specifically trained for sequence-level representation during pretraining, works best - Q: How to handle multi-class classification? A: Just set num_labels to the number of classes, use cross-entropy loss

Usage Example:

# Batch processing
texts = ["I love this movie!", "This is terrible."]
inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
# predictions: tensor([1, 0])  # Assuming 1 is positive, 0 is negative

Named Entity Recognition (NER)

For sequence labeling tasks, use each token's representation for classification:

from transformers import BertTokenizer, BertForTokenClassification
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForTokenClassification.from_pretrained('bert-base-uncased', num_labels=9)

# Input text
text = "Barack Obama was born in Hawaii"
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)

# Forward pass
outputs = model(**inputs)
logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)

# Decode
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
ner_tags = ['O', 'B-PER', 'I-PER', 'B-LOC', 'I-LOC', ...]  # Tag list
for token, pred in zip(tokens, predictions[0]):
    print(f"{token}: {ner_tags[pred]}")

Question Answering (QA)

For reading comprehension tasks, need to predict answer start and end positions in the original text:

from transformers import BertTokenizer, BertForQuestionAnswering
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')

# Question and context
question = "Where was Barack Obama born?"
context = "Barack Obama was born in Hawaii in 1961."

# Encode
inputs = tokenizer(question, context, return_tensors='pt', padding=True, truncation=True)

# Forward pass
outputs = model(**inputs)
start_scores = outputs.start_logits
end_scores = outputs.end_logits

# Find answer positions
start_idx = torch.argmax(start_scores)
end_idx = torch.argmax(end_scores)

# Decode answer
answer_tokens = inputs['input_ids'][0][start_idx:end_idx+1]
answer = tokenizer.decode(answer_tokens)
print(f"Answer: {answer}")

Sentence Pair Classification

For tasks like natural language inference, need to handle sentence pairs:

from transformers import BertTokenizer, BertForSequenceClassification

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)

# Sentence pair
premise = "A man is playing guitar"
hypothesis = "Someone is making music"

# Encode (automatically adds [SEP] token)
inputs = tokenizer(premise, hypothesis, return_tensors='pt', padding=True, truncation=True)

# Forward pass
outputs = model(**inputs)
logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)

Finetuning Tips

1. Learning Rate Settings - Use smaller learning rate for pretrained layers (e.g., 2e-5) - Use larger learning rate for classification head (e.g., 1e-4)

from torch.optim import AdamW

# Distinguish pretrained layers and classification head
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {
        'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
        'weight_decay': 0.01
    },
    {
        'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
        'weight_decay': 0.0
    }
]
optimizer = AdamW(optimizer_grouped_parameters, lr=2e-5)

2. Gradient Accumulation - When batch size is limited, use gradient accumulation to simulate larger batches

accumulation_steps = 4
for i, batch in enumerate(dataloader):
    outputs = model(**batch)
    loss = outputs.loss / accumulation_steps
    loss.backward()
    
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

3. Learning Rate Scheduling - Use warmup and linear decay

from transformers import get_linear_schedule_with_warmup

total_steps = len(dataloader) * num_epochs
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0.1 * total_steps,
    num_training_steps=total_steps
)

HuggingFace Practice: Complete Finetuning Pipeline

Below is a complete text classification finetuning pipeline:

from transformers import (
    BertTokenizer, 
    BertForSequenceClassification,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding
)
from datasets import load_dataset
import torch
from torch.utils.data import Dataset

# 1. Load data and model
dataset = load_dataset("imdb")  # Movie review sentiment analysis
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# 2. Data preprocessing
def preprocess_function(examples):
    return tokenizer(
        examples['text'],
        truncation=True,
        padding=True,
        max_length=512
    )

tokenized_dataset = dataset.map(preprocess_function, batched=True)

# 3. Set training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# 4. Data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# 5. Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['test'],
    data_collator=data_collator,
)

trainer.train()

# 6. Evaluate
results = trainer.evaluate()
print(results)

BERT's Limitations

Despite BERT's tremendous success, it also has some limitations:

1. High Computational Cost - BERT-base has 110M parameters, BERT-large has 340M parameters - Inference speed is relatively slow, not suitable for real-time applications

2. Weak Unidirectional Generation Capability - BERT is a bidirectional encoder, not suitable for generation tasks - For text generation, autoregressive models like GPT are needed

3. Maximum Sequence Length Limitation - BERT's maximum sequence length is 512, cannot handle very long texts - Although can be extended through sliding window methods, effectiveness is limited

4. Pretraining Data Bias Toward English - BERT is mainly trained on English corpora, limited support for other languages - Multilingual BERT (mBERT) supports multiple languages but performance is inferior to monolingual models

Summary

BERT ushered in the pretrain-finetune era of NLP, proving the effectiveness of large-scale pretrained models. Since BERT, various variants and improvements have continuously emerged, driving rapid development in the NLP field.

BERT's core contributions: 1. Bidirectional Pretraining: Achieves bidirectional context understanding through MLM task 2. Universal Representations: The same model can adapt to multiple downstream tasks 3. Pretrain-Finetune Paradigm: Dramatically reduces data requirements for NLP tasks

Understanding BERT is not just key to understanding modern NLP — it's the starting point for diving into the era of large language models. In the next article, we'll explore the GPT series models to understand the charm of generative language models.

❓ Q&A: BERT Common Questions

Q1: Why does BERT use [MASK] tokens instead of directly predicting the next word?

A: BERT uses MLM task to achieve bidirectional pretraining. If it predicted the next word like GPT, the model could only see left context and couldn't leverage right-side information. MLM masks some tokens and predicts them, allowing the model to simultaneously leverage context from both sides.

Q2: Why can BERT's [CLS] token be used for classification?

A: The [CLS] token is at the beginning of the sequence. After passing through multiple Transformer layers, its representation aggregates information from the entire sequence. Although theoretically any position's token can be used for classification, [CLS]'s design makes it specifically for sequence-level representation, learning effective classification features during pretraining and finetuning.

Q3: What are the main differences between BERT and GPT?

A: Main differences: - Architecture: BERT is bidirectional encoder, GPT is unidirectional decoder - Pretraining Tasks: BERT uses MLM + NSP, GPT uses language modeling - Suitable Tasks: BERT excels at understanding tasks (classification, NER, QA), GPT excels at generation tasks - Context Utilization: BERT can see both forward and backward context, GPT can only see forward context

Q4: Why did RoBERTa remove the NSP task?

A: Research found that NSP task provides limited performance improvement and may even be harmful. The NSP task is too simple; models mainly learn topic prediction rather than logical relationships between sentences. RoBERTa achieved better performance on multiple tasks by removing NSP and training with longer sequences.

Q5: How does ALBERT reduce parameters?

A: ALBERT mainly uses two methods: 1. Factorized Embedding: Factorizes word embedding matrix into two smaller matrices, reducing embedding layer parameters 2. Cross-Layer Parameter Sharing: Shares parameters across all Transformer layers, dramatically reducing parameters

Q6: What are ELECTRA's advantages over BERT?

A: ELECTRA's main advantage is higher training efficiency: - MLM only predicts 15% of tokens; RTD predicts all tokens, achieving higher data utilization - Under the same computational budget, ELECTRA performs better - ELECTRA-small's performance approaches BERT-base but with fewer parameters

Q7: Can BERT be used for generation tasks?

A: BERT itself is not suitable for generation tasks because it's a bidirectional encoder and cannot perform autoregressive generation. However, BERT can be used for generation through some techniques: - BERT-GEN: Uses BERT encoder + independent decoder - MASS/BART: Uses BERT-style encoder-decoder architecture

For pure generation tasks, autoregressive models like GPT are more suitable.

Q8: How to choose BERT variants?

A: Selection recommendations: - BERT-base: General choice, balances performance and speed - RoBERTa: Pursuing higher performance, sufficient computational resources - ALBERT: Parameter-limited, need lightweight models - ELECTRA: Pursuing training efficiency, or need small models with performance approaching large models

Q9: What should be noted when finetuning BERT?

A: Key considerations: 1. Learning Rate: Use smaller learning rate (2e-5) to avoid destroying pretrained weights 2. Batch Size: Adjust according to GPU memory, can use gradient accumulation 3. Training Epochs: Usually 2-4 epochs suffice, avoid overfitting 4. Warmup: Use learning rate warmup to stabilize training 5. Regularization: Appropriately use dropout and weight decay

Q10: How does BERT handle multilingual tasks?

A: BERT handles multilingual tasks through: 1. mBERT: Pretrained on multilingual corpora, supports 100+ languages 2. XLM: Uses cross-lingual pretraining objectives to enhance cross-lingual transfer ability 3. Monolingual BERT: Train dedicated models for each language, better performance but requires more resources

For Chinese tasks, recommend using BERT-Chinese or RoBERTa-Chinese.