Word embeddings revolutionized natural language processing by transforming words from sparse one-hot vectors into dense, meaningful representations that capture semantic relationships. Before embeddings, machines saw "king" and "queen" as completely unrelated symbols — just different positions in a vocabulary list. After embeddings, machines learned that these words share gender and royalty concepts, enabling them to solve analogies like "king - man + woman = queen" through simple vector arithmetic.
This article explores the journey from one-hot encodings to modern embedding techniques. We'll examine Word2Vec's innovative training strategies (Skip-gram and CBOW), GloVe's global matrix factorization approach, and FastText's subword extensions. We'll also connect embeddings to language models, showing how predicting context naturally produces semantic representations. By the end, you'll understand not just how to use pre-trained embeddings, but why they work and how to train your own.
From Sparse to Dense: Why Embeddings Matter
The Problem with One-Hot Encoding
Traditional NLP represented words as one-hot vectors: if your
vocabulary has
Sparsity: Real vocabularies contain 50,000+ words, creating vectors that are 99.999% zeros. This wastes memory and computation.
No Semantic Information: The dot product between any two different words is always zero:
and . The model can't tell that "cat" is more similar to "dog" than to "quantum". Curse of Dimensionality: With
, even a simple linear classifier needs millions of parameters. Models can't generalize across similar words.
The Embedding Solution
Word embeddings map each word to a dense, low-dimensional vector
(typically 100-300 dimensions). Instead of
The Distributional Hypothesis
The foundation of modern embeddings is Firth's principle: "You shall know a word by the company it keeps." Words appearing in similar contexts tend to have similar meanings:
- "The cat sat on the mat" vs "The dog sat on the mat"
- "The king ruled the kingdom" vs "The queen ruled the kingdom"
Both "cat" and "dog" appear after "the" and before "sat on the mat". Both "king" and "queen" appear before "ruled the kingdom". By training models to predict words from context (or vice versa), we force the model to learn embeddings that capture these distributional patterns.
Word2Vec: Learning from Local Context
Word2Vec, introduced by Mikolov et al. in 2013, popularized neural word embeddings through two efficient architectures: Skip-gram and Continuous Bag-of-Words (CBOW). Both are shallow neural networks trained to predict words from context.
Skip-gram: Predicting Context from Target
Skip-gram takes a target word and predicts surrounding context words
within a window. Given a sentence, for each position
Input: Target word
Output: Context words
For example, with window size and sentence "the quick brown fox jumps": Target: "brown" → Context: "the", "quick", "fox", "jumps"
Target: "fox" → Context: "quick", "brown", "jumps"
Architecture:
Input: One-hot vector
for target word Embedding layer:
(this is the word embedding) Output layer: For each context position
, compute scores Softmax: Convert scores to probabilities
The objective maximizes the log probability of context words: where is the embedding of word and is its output vector.
Why Skip-gram Works: The model learns that words
appearing in similar contexts should have similar embeddings
CBOW: Predicting Target from Context
CBOW (Continuous Bag-of-Words) reverses the direction: it takes context words and predicts the target word in the center.
- Input: Context words
- Output: Target word
Architecture:
Input: One-hot vectors for all context words
Embedding: Average context embeddings
Output: Scores
Softmax:
The objective is:
Skip-gram vs CBOW:
- Skip-gram: Slower training but better on rare words (each target generates multiple training examples)
- CBOW: Faster training, smooths over context by averaging embeddings, better on frequent words
The Softmax Bottleneck
Both Skip-gram and CBOW have a computational problem: the softmax
denominator requires summing over all
Negative Sampling
Instead of computing the full softmax, negative sampling turns the problem into binary classification: distinguish the true context word (positive example) from random noise words (negative examples).
For each
- Positive example:
indicating appears in 's context - Negative examples: Sample
random words , creating The objective for each positive pair becomes: where is the sigmoid function.
Intuition: We want
Noise Distribution: Word2Vec uses
Complexity: Computing this objective requires
evaluating only
Hierarchical Softmax
Hierarchical softmax organizes vocabulary into a binary tree (typically a Huffman tree based on word frequencies). Each word is a leaf, and each internal node has a learned vector.
To compute
- Find the path from root to word
: a sequence of left/right decisions - Each decision is a binary classifier using the embedding
3. Multiply probabilities along the path
If the path to word
Complexity: Computing this requires
Trade-offs: Hierarchical softmax is faster for very large vocabularies but less flexible than negative sampling. Most implementations default to negative sampling for its simplicity and effectiveness.
Training Details and Hyperparameters
Subsampling Frequent Words: Very frequent words like
"the", "is", "a" provide little semantic information but dominate
training data. Word2Vec randomly discards word
Dynamic Window Size: Instead of always using window
size
Common Hyperparameters: - Embedding dimension:
GloVe: Global Matrix Factorization
While Word2Vec learns from local context windows, GloVe (Global Vectors for Word Representation, Pennington et al. 2014) takes a global view by explicitly factorizing word co-occurrence statistics.
Motivation: Capturing Global Statistics
Consider the co-occurrence probabilities of words with "ice" and "steam":
| Word |
Ratio |
||
|---|---|---|---|
| solid | 8.9 | ||
| gas | 0.36 | ||
| water | 1.36 | ||
| fashion | 0.96 |
The ratio reveals semantic relationships: - "solid" is strongly
associated with "ice" (ratio
GloVe argues that embeddings should encode these ratios directly.
The GloVe Objective
Let
-
GloVe seeks embeddings
Why This Weighting?
- Down-weight rare pairs:
as , reducing the impact of noisy, low-count co-occurrences - Cap frequent pairs:
for very frequent pairs, preventing common words from dominating - Smooth middle range: The
power provides smooth interpolation
GloVe vs Word2Vec
Similarities: - Both produce word embeddings in
Differences:
| Aspect | Word2Vec | GloVe |
|---|---|---|
| Approach | Local context prediction | Global matrix factorization |
| Training | Online (stochastic) | Batch (iterate over co-occurrence matrix) |
| Objective | Cross-entropy (implicit) | Weighted least squares |
| Output | Word embeddings |
Word + context embeddings |
Practical Note: GloVe requires building the full
co-occurrence matrix
Training GloVe
Step 1: Build Co-occurrence Matrix
Scan the corpus with a symmetric window of size
With window size
Step 2: Optimize Embeddings
Use AdaGrad or similar optimizer to minimize the GloVe objective.
Unlike Word2Vec, we iterate multiple epochs over the same matrix
Step 3: Combine Word and Context Embeddings
The final embedding for word
FastText: Subword Embeddings
Word2Vec and GloVe assign each word a single embedding, treating "unhappiness", "happiness", and "happy" as completely unrelated. FastText (Bojanowski et al., 2017) addresses this by representing words as bags of character n-grams.
Motivation: Morphology and Rare Words
Consider these problems:
- Out-of-vocabulary (OOV): If "unhappiness" wasn't in training data, Word2Vec has no embedding for it
- Morphology ignored: "teacher", "teaching", "teach" share the root "teach" but get unrelated embeddings
- Rare words: Words appearing once or twice get poorly trained embeddings
FastText solves these by building word embeddings from subword units.
Character N-gram Representation
For a word< and
> to mark word boundaries.
Example: "where" with
Actually for <wh, whe, her,
ere, re> - 4-grams: <whe,
wher, here, ere> - 5-grams:
<wher, where, here> -
6-grams: <where, where> - Full word:
<where>
Each n-gram
Training FastText
FastText uses the same Skip-gram or CBOW objectives as Word2Vec, but
replaces the word embedding
- Extract n-grams for each word in vocabulary
- Create embeddings for all unique n-grams (vocabulary expands
from
words to millions of n-grams) - Train Skip-gram/CBOW with negative sampling, updating n-gram embeddings
- Final word embedding is the sum of its n-gram embeddings
OOV Handling: For a new word like "superamazingly" not in training data, extract its n-grams and sum their embeddings. As long as some n-grams appeared in training (e.g., "super", "amaz", "zing", "ly>"), we get a reasonable embedding.
FastText vs Word2Vec/GloVe
Advantages: - Handles OOV words naturally - Better on morphologically rich languages (Turkish, Finnish, German) - Better on rare words (shares information through subwords) - Smaller model size for certain applications (can compress n-grams)
Disadvantages: - Slower training (millions of n-grams vs thousands of words) - May blur distinctions between unrelated words sharing character sequences (e.g., "mean" and "meaning" share n-grams but "mean" as a verb vs adjective)
Language-Specific Performance: - English: FastText and Word2Vec perform similarly (English has simpler morphology) - German/Turkish: FastText significantly outperforms (compound words and rich inflection) - Chinese: Character n-grams less useful (characters are semantic units, not morphemes)
Language Models and Embeddings
Language models (LMs) predict the probability of text sequences. Training LMs naturally produces word embeddings as a side effect — the hidden representations learned to predict the next word capture semantic information.
N-gram Language Models
An n-gram model predicts the next word based on the previous
Estimation from Counts:
Problems: 1. Sparsity: Most n-grams
never appear in training data. With
Neural Language Models
Neural LMs replace count-based estimation with neural networks. The key insight: instead of treating each n-gram independently, map words to embeddings and use those embeddings to predict the next word.
Architecture (Bengio et al., 2003):
Input: Previous
words as one-hot vectors Embedding layer: Convert to embeddings
Concatenation:
Hidden layer:
Output: Softmax over vocabulary
Objective: Maximize log-likelihood of training data:
Why Embeddings Help: - Words with similar embeddings make similar predictions - "The cat sat on the mat" and "The dog sat on the mat" share information through similar embeddings for "cat" and "dog" - Model generalizes to unseen n-grams: if "dog" appears in training but "puppy" doesn't, similar embeddings transfer knowledge
Modern Neural LMs: RNNs and Transformers
Recurrent Neural Networks (RNNs): Instead of fixed
n-gram windows, RNNs process sequences of any length:
Transformers (covered in later articles): Attention mechanisms replace recurrence, allowing parallel computation and better long-range modeling. Models like GPT, BERT, and their successors are transformer-based LMs trained on massive corpora.
Contextualized Embeddings: Unlike Word2Vec/GloVe, modern LMs produce context-dependent embeddings. The word "bank" gets different representations in "river bank" vs "bank account". This is a fundamental advance we'll explore in future articles on BERT and transformers.
Connection to Word2Vec
Notice that Word2Vec's objectives are closely related to neural LM objectives:
- Skip-gram: Predicts context words from target ≈ simplified LM predicting surrounding words
- CBOW: Predicts target from context ≈ fill-in-the-blank LM
The main simplification: Word2Vec uses shallow networks (one embedding layer, one output layer) and ignores word order in context (bag-of-words assumption). This makes training faster while still capturing distributional semantics.
Evaluating and Visualizing Embeddings
How do we know if embeddings are good? Evaluation falls into two categories: intrinsic (direct embedding quality) and extrinsic (downstream task performance).
Intrinsic Evaluation: Analogies
The famous "king - man + woman = queen" example is an analogy task:
More precisely: find the word
Standard Datasets: - Google Analogy Dataset: 19,544 questions in categories like: - Capital-country: "Paris is to France as Berlin is to ?" - Gender: "man is to woman as king is to ?" - Comparative: "good is to better as bad is to ?" - Plural: "dog is to dogs as cat is to ?"
- MSR Analogy Dataset: 8,000 morphological and syntactic analogies
Scoring: Report accuracy (percentage of questions answered correctly). Typical results: - Skip-gram with large corpus: 60-70% accuracy - GloVe: 70-80% accuracy - Random embeddings: <5% accuracy
Limitations: Some analogies are culturally biased or ambiguous. "Paris is to France" has multiple relationships (capital, part-of, located-in). Critics argue that analogy accuracy doesn't strongly correlate with downstream task performance.
Intrinsic Evaluation: Word Similarity
Word similarity datasets contain human-annotated similarity scores for word pairs. The task: compute cosine similarity between embeddings and correlate with human judgments.
Standard Datasets: - WordSim-353: 353 word pairs with similarity scores 0-10 - Example: ("tiger", "cat") → 7.35, ("book", "paper") → 7.46 - SimLex-999: 999 pairs emphasizing true similarity (not relatedness) - ("coast", "shore") → high similarity - ("coast", "ocean") → high relatedness but lower similarity - MEN: 3,000 pairs from naturally occurring text
Metric: Spearman correlation
Extrinsic Evaluation: Downstream Tasks
The ultimate test: do embeddings improve real NLP tasks?
Common Tasks: 1. Sentiment Analysis: Classify movie reviews as positive/negative 2. Named Entity Recognition: Tag tokens as person, organization, location 3. Text Classification: Categorize news articles by topic 4. Machine Translation: Use embeddings as encoder/decoder initialization
Typical Setup: - Initialize model with pre-trained embeddings (Word2Vec, GloVe, FastText) - Fine-tune on task-specific data - Compare to random initialization or task-specific embeddings
Results: Pre-trained embeddings typically improve accuracy by 2-10% when training data is limited. With massive task-specific data, the advantage diminishes (the model learns good embeddings from scratch).
Visualization: t-SNE and PCA
High-dimensional embeddings (
PCA (Principal Component Analysis): Linear
projection maximizing variance:
t-SNE (t-Distributed Stochastic Neighbor Embedding):
Nonlinear projection preserving local neighborhoods. t-SNE minimizes
divergence between high-dimensional and low-dimensional probability
distributions:
Visual Patterns: Good embeddings show: - Semantic clusters (countries grouped together, animals grouped together) - Smooth transitions (gradual color changes from "red" to "blue") - Analogical relationships (parallel vectors for gender, tense, etc.)
Example Clusters: - Countries: France, Germany, Spain, Italy - Animals: dog, cat, horse, cow - Professions: teacher, doctor, engineer, lawyer
Practical Training with Gensim
train Word2Vec, GloVe-like, and FastText embeddings using Python's Gensim library. We'll use a sample corpus and evaluate the results.
Installing Dependencies
1 | pip install gensim numpy matplotlib scikit-learn |
Training Word2Vec
1 | from gensim.models import Word2Vec |
Output (will vary due to small corpus):
1
2
3Cat embedding shape: (100,)
Words similar to 'cat': [('dog', 0.87), ('animals', 0.65), ...]
Similarity between 'cat' and 'dog': 0.8734
Training FastText
1 | from gensim.models import FastText |
Training Word2Vec on Large Corpus
For real applications, use larger corpora like Wikipedia, news articles, or domain-specific text:
1 | from gensim.models import Word2Vec |
Analogy Testing
1 | # Test analogies (requires sufficient vocabulary) |
Visualization with t-SNE
1 | import numpy as np |
Loading Pre-trained Embeddings
Instead of training from scratch, use pre-trained embeddings:
1 | import gensim.downloader as api |
Available Pre-trained Models: -
word2vec-google-news-300: 3M words, 300-dim, trained on
Google News (100B tokens) - glove-wiki-gigaword-300: 400K
words, 300-dim, trained on Wikipedia + Gigaword -
fasttext-wiki-news-subwords-300: 1M words, 300-dim,
includes subword info
Converting Embeddings to Other Formats
1 | # Save in word2vec text format |
Frequently Asked Questions
Q1: Why do we need dense embeddings? Can't we just use one-hot vectors?
One-hot vectors treat all words as equally different — the distance between "cat" and "dog" equals the distance between "cat" and "quantum". This prevents models from generalizing. If the training set contains "The cat runs fast" but not "The dog runs fast", a one-hot model can't infer that "dog" might also run fast.
Dense embeddings encode similarity: "cat" and "dog" have similar embeddings because they appear in similar contexts (both are animals, both can sit/run/eat). When a model learns that "cat" relates to "pet", it automatically knows "dog" does too, through their similar embeddings.
Additionally, one-hot vectors waste memory and computation. With
Q2: What's the difference between Skip-gram and CBOW?
Skip-gram takes a target word and predicts surrounding context words. Given "the quick brown fox jumps", it predicts "the", "quick", "fox", "jumps" from "brown". Each target word generates multiple training examples (one per context word), making it better for rare words.
CBOW takes context words and predicts the target. Given "the quick ___ fox jumps", it predicts "brown" from the context. It averages context embeddings, which smooths noise but loses information about individual context words. CBOW trains faster and works better for frequent words.
Rule of thumb: Use Skip-gram for small datasets or when rare words matter. Use CBOW for large datasets when speed is critical.
Q3: How does negative sampling make training faster?
Standard softmax requires computing probabilities over all
Negative sampling changes the task: instead of predicting which word
is the correct context (multi-class classification over
The trick works because the noise words are random, so the model learns to differentiate real context from noise without seeing all possible negatives.
Q4: What does the "0.75 exponent" in negative sampling's noise distribution do?
The noise distribution is
The problem: very frequent words like "the", "is", "and" would dominate negative samples, teaching the model "these aren't context words" redundantly. Meanwhile, rare words like "zebra" would rarely appear, so the model never learns to distinguish them.
The 0.75 exponent reduces frequent words' probability and boosts rare
words'. If "the" has count 1000 and "zebra" has count 10: -
Proportional:
Q5: Why does
GloVe use a weighting function ?
Without weighting, the objective treats all co-occurrences
equally:
Rare pairs have noisy co-occurrences:
or might be statistical accidents, not meaningful relationships. Fitting these exactly introduces noise. Frequent pairs dominate: The most common words co-occur millions of times. The optimization focuses on "the-the", "the-is", "is-a" pairs, ignoring less frequent but more informative pairs.
The weighting function
The result: balanced optimization across all co-occurrence ranges.
Q6: When should I use FastText instead of Word2Vec?
Use FastText when: - Your language has rich morphology (German, Turkish, Finnish, Russian, Arabic) - You have many compound words (German: "Schadenfreude", "Weltanschauung") - You need to handle OOV words (spelling variations, typos, new words) - Your vocabulary is large but training data is limited (subword sharing helps)
Use Word2Vec when: - Your language has simple morphology (English, Chinese) - Vocabulary is fixed and OOV words are rare - Speed matters (FastText is slower due to n-grams) - You want to distinguish between words that happen to share character sequences but have different meanings
Example: In English, "cat" and "category" share the substring "cat" but have unrelated meanings. Word2Vec keeps them separate; FastText might assign them slightly more similar embeddings due to shared n-grams.
Q7: How do I choose
embedding dimension ?
Larger dimensions capture more information but require more data and computation. Typical values:
to : Small datasets (< 1M tokens), simple tasks, limited memory to : Medium datasets (1M-1B tokens), general-purpose embeddings to : Large datasets (> 1B tokens), specialized domains, when quality matters more than speed
Diminishing returns: Increasing
Rule of thumb: Start with
Q8: Can I combine Word2Vec with deep learning models like BERT?
Yes, but it's usually unnecessary. Modern models like BERT produce contextualized embeddings— each word gets a different embedding depending on context. For example, "bank" in "river bank" vs "bank account" gets different BERT embeddings.
Word2Vec/GloVe produce static embeddings—"bank" always has the same embedding regardless of context. This is simpler but less powerful.
When to combine: - You have limited computational resources (BERT requires GPUs; Word2Vec works on CPUs) - Your task is simple and doesn't need context (e.g., word similarity, document clustering) - You're working with low-resource languages where BERT isn't available
Typical workflow: 1. Small tasks / limited resources: Use pre-trained Word2Vec/GloVe 2. Medium tasks: Fine-tune BERT 3. Large tasks: Pre-train your own BERT-like model
Q9: Why do embeddings capture analogies like "king - man + woman = queen"?
Embeddings learn from context, and certain relationships appear consistently across contexts:
- "The king ruled the kingdom" vs "The queen ruled the kingdom"
- "The man walked down the street" vs "The woman walked down the street"
The model learns: -
The vector difference
Mathematically, if we assume additive compositionality:
Q10: How do I know if my embeddings are good without labeled downstream data?
Use intrinsic evaluation metrics that don't require task-specific labels:
Word similarity correlation: Compare embedding similarities to human judgments on WordSim-353, SimLex-999, etc. If correlation is high (
), your embeddings capture human intuitions. Analogy accuracy: Test on Google Analogy Dataset. Good embeddings should achieve > 50% accuracy on semantic analogies.
Nearest neighbors inspection: Check
-nearest neighbors for sample words. For "cat", neighbors should include "dog", "kitten", "feline", not random words. For "Paris", neighbors should include "London", "France", "Berlin". Clustering coherence: Cluster embeddings (k-means or hierarchical) and inspect clusters. Good embeddings group semantically related words (countries together, animals together, professions together).
Visualization: Project to 2D with t-SNE and look for semantic structure. Related words should form visible clusters.
These methods aren't perfect — embeddings that score well on analogies might still fail on your specific task — but they provide quick sanity checks without requiring labeled data.
Summary and Future Directions
Word embeddings transformed NLP by encoding semantic relationships in dense vectors. We've covered:
- Evolution from one-hot to dense representations: Solving sparsity and enabling generalization
- Word2Vec (Skip-gram and CBOW): Local context prediction with negative sampling and hierarchical softmax
- GloVe: Global matrix factorization leveraging co-occurrence statistics
- FastText: Subword embeddings handling morphology and OOV words
- Language models: Predicting text sequences while learning embeddings as side effects
- Evaluation: Analogies, word similarity, downstream tasks, and visualization
- Practical training: Using Gensim for Word2Vec, FastText, and pre-trained embeddings
Limitations of Static Embeddings:
Despite their success, Word2Vec, GloVe, and FastText have a fundamental limitation: each word has a single embedding regardless of context. "Bank" means the same thing whether we're discussing rivers or finance. "Apple" represents both fruit and company with the same vector.
Future: Contextualized Embeddings:
Modern models like ELMo, GPT, and BERT produce context-dependent embeddings. Each word gets a different representation based on its sentence:
- "I visited the bank of the river" → "bank" has a "geography" embedding
- "I deposited money at the bank" → "bank" has a "finance" embedding
These models use deep transformers trained on massive corpora, achieving state-of-the-art results across NLP tasks. We'll explore them in future articles.
Key Takeaways:
- Embeddings capture distributional semantics: words in similar contexts get similar vectors
- Training objectives (Skip-gram, CBOW, GloVe) all implement the distributional hypothesis in different ways
- Computational tricks (negative sampling, hierarchical softmax) make training feasible
- Static embeddings remain useful for resource-constrained settings and interpretability
- The field has moved to contextualized embeddings, but understanding Word2Vec and GloVe provides the foundation for modern NLP
Word embeddings opened the door to deep learning in NLP. By representing words as continuous vectors, they enabled neural networks to process language effectively. The journey from discrete symbols to continuous representations continues with transformers and large language models — but it all started with the simple idea that words are defined by their context.
- Post title:NLP (2): Word Embeddings and Language Models
- Post author:Chen Kai
- Create time:2024-02-08 14:30:00
- Post link:https://www.chenk.top/en/nlp-word-embeddings-lm/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.