English has abundant labeled data, but there are over 7,000 languages in the world. How can models transfer knowledge learned from English to low-resource languages? Cross-Lingual Transfer enables models trained on English to be directly used on Chinese, Arabic, Swahili — without any target language labeled data.
This article systematically explains methods and implementations of bilingual word embedding alignment, multilingual pre-training, and cross-lingual prompt learning, starting from the mathematical principles of multilingual representation space. We analyze language universals and differences, zero-shot transfer performance, and language selection strategies, and provide complete code (280+ lines) for implementing cross-lingual text classification from scratch.
Problem Definition of Cross-Lingual Transfer
Zero-Shot Cross-Lingual Learning
Scenario: Train on source language
Formalized as:
Challenges: - Source and target language vocabularies completely different - Syntactic structures and word order may differ greatly - Cultural and pragmatic differences
Few-Shot Cross-Lingual Learning
Scenario: Target language has small amount of labeled data (e.g., 10-100 samples per class).
Formalized as:
Multi-Source Language Transfer
Scenario: Transfer from multiple source
languages
Objective function:
Advantage: Language diversity provides richer linguistic features.
Evaluation Metrics
Zero-Shot Accuracy:
Cross-Lingual Transfer Gap:
Smaller is better, 0 indicates perfect transfer. Average Performance:
Mathematical Principles of Multilingual Representation
Shared Semantic Space
Assumption: Different languages' ways of expressing same concepts share commonalities at deep semantic level.
Formalized as: There exists a language-agnostic semantic
space
Intuition: 猫 (Chinese) and cat (English) should map to the same region in semantic space.
Theoretical Foundations of Language Universals
Universal Grammar
Chomsky's Universal Grammar theory: All human languages share underlying grammatical structures.
Evidence: - Word orders like SVO, SOV have corresponding relationships at deep level - Parts of speech like nouns, verbs exist across all languages - Recursive structures, question transformations are cross-linguistically universal
Distributional Semantics Hypothesis
"Word meaning is determined by its context" (Distributional
Hypothesis):
Bilingual Word Embedding Alignment
Linear Transformation Assumption
Assume linear transformation
Procrustes Alignment:
Adversarial Training Alignment
Conneau et al.1 proposed unsupervised alignment method:
Discriminator: Distinguish between source and target
language word embeddings:
Generator (alignment matrix): Minimize
discriminator's ability:
Intuition: If discriminator cannot distinguish aligned source from target language, alignment is successful.
Multilingual Sentence Representations
Parallel Sentence Alignment
Given parallel corpus
Translation Language Modeling (TLM)2:
Jointly model parallel sentence pairs:
Contrastive Learning
LASER3 uses contrastive loss:
Multilingual Pre-trained Models
Multilingual BERT (mBERT)
Architecture and Pre-training
mBERT4 is pre-trained on Wikipedia in 104 languages using:
Masked Language Modeling (MLM):
Shared vocabulary: 110K WordPiece tokens covering all languages
Key design: - No explicit cross-lingual supervision signal (no parallel corpus) - Sentences from different languages randomly mixed during training - All layers share parameters
Why Does mBERT Work?
Theoretical explanation5:
- Anchor Vocabulary: Numbers, punctuation, English loanwords shared across languages
- Deep parameter sharing: Forces model to learn language-agnostic features
- Code-Switching: Naturally occurring multilingual mixing in training data
Empirical findings: - mBERT's hidden layer representations highly aligned across languages - Even without parallel corpus, similar concepts have close representations in different languages
XLM-RoBERTa (XLM-R)
Improved Design
XLM-R6 is pre-trained on 2.5TB text in 100 languages, compared to mBERT:
- Larger model: 550M parameters (mBERT has 110M)
- More data: 2.5TB vs few GB
- Better sampling strategy:
Define sampling probability for language
Performance Comparison
On XNLI (cross-lingual natural language inference):
| Model | English | Average | Worst Language |
|---|---|---|---|
| mBERT | 81.4 | 65.4 | 58.3 (Urdu) |
| XLM-R | 88.7 | 76.2 | 68.4 (Swahili) |
XLM-R significantly outperforms mBERT across all languages.
mT5
Architecture
mT57 is the multilingual version of T5, covering 101 languages, using:
Text-to-Text framework: All tasks unified as text generation
Denoising Autoencoding:
- Randomly mask text spans
- Model reconstructs complete text
Advantages: - Generative architecture suitable for seq2seq tasks (translation, summarization) - Unified framework supports multi-task learning
Comparison with XLM-R
| Dimension | XLM-R | mT5 |
|---|---|---|
| Architecture | Encoder-only | Encoder-decoder |
| Pre-training task | MLM | Denoising |
| Applicable tasks | Classification, tagging | Generation, translation |
| Inference overhead | Low | High |
Zero-Shot Cross-Lingual Transfer
Direct Transfer
Simplest strategy: Train on source language, directly test on target language.
Algorithm:
- Fine-tune multilingual model with source language data$_s
_t$ Key: Multilingual model's representations already aligned.
Performance:
On XNLI, English → other languages zero-shot accuracy:
| Target Language | mBERT | XLM-R |
|---|---|---|
| French | 73.5 | 79.2 |
| Chinese | 68.3 | 76.7 |
| Arabic | 64.1 | 73.8 |
| Swahili | 57.2 | 68.4 |
High-resource languages perform better.
Translate-Train
Strategy: Translate source language training data to target language, then train on target language.
Algorithm:
- Use machine translation to translate
to$_t _t$3. Evaluate on real target language test data
Advantage: Model directly trained on target language, avoiding language differences.
Disadvantages: - Depends on translation quality (translation errors propagate) - Semantics may be lost or distorted
Translate-Test
Strategy: Translate target language test data to source language, predict with source language model.
Algorithm:
- Train model on source language$_s
x{(t)} {(s)} = f(^{(s)})$ Advantage: Leverages high-quality source language model.
Disadvantages: Requires translation at inference, increasing latency and cost.
Ensemble Methods
Translate-Train-All (TTA):
Translate training data to all languages, train jointly:
Advantage: Model sees multiple language expressions, strong generalization.
Disadvantage: High computational cost (requires multiple translations and training).
Cross-Lingual Prompt Learning
Multilingual Prompt Templates
Prompt-Based Learning: Convert task to language model fill-in-the-blank.
English sentiment classification:
1 | The movie was great. It was [MASK]. → wonderful |
Cross-lingual extension: Use multilingual templates.
Chinese:
1 | 这部电影很好。它[MASK]。 → 很棒 |
Challenge: Template design varies greatly across languages.
Automatic Template Search
X-FACTR8: Automatically discover cross-lingual prompt templates.
Algorithm:
- Use AutoPrompt9 to search optimal template on English
- Translate template to target language
- Fine-tune template on target language
Example:
English template: 1
[X] is located in [Y]. → [X] is in the country of [MASK].
Translated to French: 1
[X] se trouve en [Y]. → [X] est dans le pays de [MASK].
Language-Agnostic Prompts
XPROMPT10: Learn language-agnostic continuous prompts.
Model input:
Training objective:
Advantage: One prompt applicable to all languages, no translation needed.
Code-Switching and Language Mixing
Code-Switching Phenomenon
Code-Switching: Mixing multiple languages within a single sentence.
Example:
1 | I'm feeling 很累,想 sleep 了。 |
Prevalence: Very common in multilingual communities (e.g., Singapore, India, US Latino communities).
Code-Switching Data Augmentation
Strategy: Artificially create code-switching data during training.
Algorithm11:
- Parse sentence dependency tree
- Randomly select words to replace with target language translations
- Maintain grammatical structure
Example:
Original sentence (English): 1
I love this movie very much.
Code-switched (English → Chinese): 1
I 喜欢 this 电影 very much.
Effect: Improves cross-lingual robustness and zero-shot performance.
Language Adaptive Pre-training
MALAPT12: Continue pre-training on target language monolingual data.
Algorithm:
- Initialize with multilingual model (e.g., XLM-R)
- Continue MLM training on target language monolingual corpus
- Fine-tune on downstream task
Effect:
| Setting | English → Chinese (XNLI) |
|---|---|
| XLM-R | 76.7 |
| + MALAPT | 79.3 (+2.6) |
Target language pre-training significantly improves performance.
Complete Code Implementation: Cross-Lingual Text Classification
Below is a complete cross-lingual text classification system including multilingual model loading, zero-shot transfer, few-shot fine-tuning, and evaluation.
1 | """ |
Code Explanation
Core Components:
- MultilingualTextClassifier: Classifier based on mBERT
- MultilingualDataset: Multilingual data loading
- Zero-shot transfer: Train on English, test on Chinese/French
Experimental Design:
- Train sentiment classifier on source language (English)
- Zero-shot transfer to target languages (Chinese, French)
- Calculate transfer gaps and average performance
Key Details:
- Use mBERT's shared representation space
- No target language labeled data
- Evaluate cross-lingual transfer effectiveness
Challenges and Frontiers of Cross-Lingual Transfer
Impact of Language Differences
Language Family Similarity
Finding: Languages from similar families transfer better13.
| Source → Target | Accuracy |
|---|---|
| English → French (same family) | 78.3 |
| English → Chinese (different family) | 69.1 |
| French → Spanish (same family) | 81.7 |
Reasons: - Similar word order (SVO vs SOV) - Shared vocabulary (Romance languages) - Close grammatical structures
Writing Systems
Finding: Languages with same writing system transfer more easily.
| Writing System | Example Languages | Transfer Difficulty |
|---|---|---|
| Latin alphabet | English, French, German, Spanish | Low |
| Chinese characters | Chinese, Japanese (partial) | Medium |
| Arabic alphabet | Arabic, Persian | Medium |
| Other (Thai, Korean) | - | High |
Challenges for Low-Resource Languages
Problems:
- Insufficient pre-training data: Few Wikipedia pages (e.g., Swahili has only thousands)
- Low vocabulary coverage: Low-resource languages have small proportion in mBERT's 110K vocabulary
- Language drift: High-resource languages dominate training, low-resource language representations degrade
Improvement directions:
- Specialized vocabulary: Design separate subword vocabulary for low-resource languages
- Data augmentation: Augment low-resource languages with high-resource language translations
- Adaptive pre-training: Continue pre-training on low-resource languages
Bias in Multilingual Models
Problem: Multilingual models exhibit language bias14:
- English usually performs best (most pre-training data)
- Low-resource language performance drops significantly
- Culture-related tasks (e.g., sentiment classification) show large cross-lingual differences
Measurement: Inter-language performance
variance:
Mitigation strategies:
- Balanced sampling: Increase sampling probability for low-resource languages
- Adversarial training: Minimize language discriminator accuracy
- Multi-task learning: Add language identification task to force learning language differences
Frequently Asked Questions
Q1: mBERT doesn't use parallel corpus, why does cross-lingual work?
Key factors:
- Anchor Words:
- Numbers: 1, 2, 3 (shared across all languages)
- Punctuation: , . ! ?
- English loanwords: OK, Internet, COVID
- Deep parameter sharing:
- Forces different languages through same Transformer layers
- Model forced to learn language-agnostic features
- WordPiece decomposition:
- Decomposes words into subword units
- Increases cross-lingual vocabulary overlap
Experimental evidence15: Removing anchor words, cross-lingual performance drops 15-20%.
Q2: How to choose source language?
Empirical rules:
- Data volume priority: Choose language with most labeled data (usually English)
- Language family similarity: If target is French, Spanish is better than Chinese
- Multi-source strategy: Combine multiple source languages (English+German → French)
Experiment: On XNLI, different source languages to French zero-shot accuracy:
| Source Language | Accuracy |
|---|---|
| English | 78.3 |
| Spanish | 81.2 |
| German | 79.7 |
| Chinese | 71.5 |
Spanish best (both Romance languages).
Q3: Translate-train vs zero-shot transfer, which is better?
Trade-offs:
| Dimension | Translate-Train | Zero-Shot Transfer |
|---|---|---|
| Performance | Higher (+2-5%) | Lower |
| Cost | High (needs translation) | Low (no translation) |
| Inference latency | Low | Low |
| Translation quality dependency | Yes | No |
Recommendation: - High-resource languages: Zero-shot transfer (translation quality high but unnecessary) - Low-resource languages: Translate-train (compensate for model weakness on low-resource languages)
Q4: What makes XLM-R better than mBERT?
Core improvements:
- Larger scale:
- mBERT: Few GB Wikipedia
- XLM-R: 2.5TB CommonCrawl
- More balanced language sampling:
- mBERT: High-resource languages dominate
- XLM-R:
(mitigates imbalance)
- More parameters:
- mBERT: 110M
- XLM-R: 550M
Performance improvement: On XNLI, XLM-R averages 10% higher than mBERT.
Q5: How to handle code-switching?
Strategies:
- Data augmentation:
- Randomly replace words with translations in other languages
- Maintain syntactic structure
- Multilingual pre-training:
- Collect real code-switching data (e.g., Twitter)
- Mix into pre-training corpus
- Language tags:
- Add language ID for each token
- Model learns language switching patterns
Effect: On code-switching benchmark (GLUECoS), adding code-switching data augmentation improves accuracy by 5-10%.
Q6: Can cross-lingual transfer be used for generation tasks?
Yes! Common applications:
- Machine translation: Source language training, target language generation
- Cross-lingual summarization: English document → Chinese summary
- Cross-lingual QA: Chinese question → English answer → translate back to Chinese
Models: mT5, mBART and other encoder-decoder models.
Challenges: - High fluency requirements for generation - Need to handle word order differences - Cultural adaptation (e.g., idiom translation)
Q7: Do multilingual models "forget" high-resource languages?
Yes! Phenomenon called "Language Competition"16.
Manifestation: - After fine-tuning on low-resource languages, English performance drops - Adding new language pre-training, old language performance degrades
Mitigation: - Multi-task learning: Optimize all languages simultaneously - Regularization: Methods like EWC (see Chapter 10 continual learning) - Language adapters: Independent parameters for each language
Q8: How to evaluate cross-lingual transfer quality?
Standard benchmarks:
- XNLI: Cross-lingual natural language inference (15 languages)
- XTREME: Cross-lingual multi-task benchmark (40 languages, 9 tasks)
- MLQA: Multilingual question answering (7 languages)
- TyDiQA: Typologically diverse QA (11 languages, covering low-resource languages)
Evaluation metrics: - Zero-shot accuracy - Transfer gap - Inter-language performance variance
Q9: What are theoretical limits of cross-lingual transfer?
Information theory perspective17:
Upper bound of cross-lingual transfer limited by Mutual
Information between languages:
Intuition: More similar languages have higher mutual information, higher transfer upper bound.
Empirical: - Same language family:
Breakthrough directions: - Use intermediate language (Pivot Language) - Multilingual pre-training increases language commonality
Q10: How to add cross-lingual support for new language?
Process:
- Collect monolingual data: Wikipedia, news, social media
- Expand vocabulary: Add subwords for new language
- Adaptive pre-training: Continue MLM on new language
- Zero-shot evaluation: Test on downstream tasks
- Few-shot fine-tuning: Fine-tune with small labeled data if available
Case study: Adding Swahili support:
| Step | Zero-Shot Accuracy |
|---|---|
| Baseline (XLM-R) | 68.4 |
| + Adaptive pre-training | 72.1 (+3.7) |
| + 100-sample fine-tuning | 76.8 (+4.7) |
Q11: What is inference overhead of multilingual models?
Comparison:
| Model | Parameters | Inference Time (Relative) |
|---|---|---|
| BERT-base | 110M | 1.0x |
| mBERT | 110M | 1.0x (same) |
| XLM-R-base | 270M | 1.5x |
| XLM-R-large | 550M | 3.0x |
Conclusion: Multilingual model inference overhead mainly depends on model size, not number of languages.
Optimization: - Model distillation: Distill XLM-R to smaller model - Language-specific pruning: Keep only target language vocabulary
Q12: Future directions for cross-lingual research?
Hot topics:
- Extremely low-resource languages:
- 7000+ languages on Earth, most without digital resources
- Leverage linguistic knowledge (grammar, phonology)
- Multimodal cross-lingual:
- Image-text cross-lingual alignment
- Video-text cross-lingual understanding
- Cross-lingual commonsense reasoning:
- Cultural differences in commonsense knowledge
- How to transfer culture-related knowledge?
- Interpretability:
- Why does mBERT work cross-lingually?
- Geometric structure of multilingual representations
- Efficient multilingual models:
- Parameter sharing vs language-specific parameters
- Sparse activation (only activate relevant language parameters)
Summary
This article comprehensively introduced cross-lingual transfer techniques:
- Problem definition: Zero-shot, few-shot, multi-source language transfer
- Mathematical principles: Shared semantic space, bilingual word embedding alignment, language universals theory
- Multilingual pre-training: Architecture and comparison of mBERT, XLM-R, mT5
- Transfer strategies: Direct transfer, translate-train, translate-test, ensemble methods
- Prompt learning: Multilingual prompt templates, automatic search, language-agnostic continuous prompts
- Code-switching: Data augmentation, language mixing, adaptive pre-training
- Complete code: 280+ lines implementing cross-lingual text classification from scratch
- Challenges and frontiers: Language differences, low-resource languages, model bias, theoretical limits
Cross-lingual transfer enables AI to benefit 7 billion people globally, breaking down language barriers. In the next chapter, we will explore transfer learning applications in industry and best practices, seeing how to transform theory into productivity.
References
Conneau, A., Lample, G., Ranzato, M. A., et al. (2018). Word translation without parallel data. ICLR.↩︎
Conneau, A., & Lample, G. (2019). Cross-lingual language model pretraining. NeurIPS.↩︎
Artetxe, M., & Schwenk, H. (2019). Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. TACL.↩︎
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL.↩︎
Pires, T., Schlinger, E., & Garrette, D. (2019). How multilingual is multilingual BERT? ACL.↩︎
Conneau, A., Khandelwal, K., Goyal, N., et al. (2020). Unsupervised cross-lingual representation learning at scale. ACL.↩︎
Xue, L., Constant, N., Roberts, A., et al. (2021). mT5: A massively multilingual pre-trained text-to-text transformer. NAACL.↩︎
Jiang, Z., Xu, F. F., Araki, J., & Neubig, G. (2020). How can we know what language models know? TACL.↩︎
Shin, T., Razeghi, Y., Logan IV, R. L., et al. (2020). AutoPrompt: Eliciting knowledge from language models with automatically generated prompts. EMNLP.↩︎
Wu, S., & Dredze, M. (2020). Are all languages created equal in multilingual BERT? RepL4NLP.↩︎
Winata, G. I., Madotto, A., Wu, Z., & Fung, P. (2019). Code-switching BERT: A task-agnostic language model for code-switching. arXiv:1908.05075.↩︎
Alabi, J., Amponsah-Kaakyire, K., Adelani, D., & Eskenazi, M. (2020). Massive vs. curated embeddings for low-resourced languages: the case of Yor ù b á and Twi. LREC.↩︎
Hu, J., Ruder, S., Siddhant, A., et al. (2020). XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. ICML.↩︎
Lauscher, A., Ravishankar, V., Vulic, I., & Glavas, G. (2020). From zero to hero: On the limitations of zero-shot language transfer with multilingual transformers. EMNLP.↩︎
Pires, T., Schlinger, E., & Garrette, D. (2019). How multilingual is multilingual BERT? ACL.↩︎
Artetxe, M., Ruder, S., & Yogatama, D. (2020). On the cross-lingual transferability of monolingual representations. ACL.↩︎
Zhao, W., Eger, S., Bjerva, J., & Augenstein, I. (2021). Inducing language-agnostic multilingual representations. ACL.↩︎
- Post title:Transfer Learning (11): Cross-Lingual Transfer
- Post author:Chen Kai
- Create time:2025-01-02 10:30:00
- Post link:https://www.chenk.top/transfer-learning-11-cross-lingual-transfer/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.