NLP (2): Word Embeddings and Language Models
Chen Kai BOSS

Word embeddings revolutionized natural language processing by transforming words from sparse one-hot vectors into dense, meaningful representations that capture semantic relationships. Before embeddings, machines saw "king" and "queen" as completely unrelated symbols — just different positions in a vocabulary list. After embeddings, machines learned that these words share gender and royalty concepts, enabling them to solve analogies like "king - man + woman = queen" through simple vector arithmetic.

This article explores the journey from one-hot encodings to modern embedding techniques. We'll examine Word2Vec's innovative training strategies (Skip-gram and CBOW), GloVe's global matrix factorization approach, and FastText's subword extensions. We'll also connect embeddings to language models, showing how predicting context naturally produces semantic representations. By the end, you'll understand not just how to use pre-trained embeddings, but why they work and how to train your own.

From Sparse to Dense: Why Embeddings Matter

The Problem with One-Hot Encoding

Traditional NLP represented words as one-hot vectors: if your vocabulary has words, each word becomes a -dimensional vector with a single 1 and the rest 0s. For example, with vocabulary :

This encoding has fatal flaws:

  1. Sparsity: Real vocabularies contain 50,000+ words, creating vectors that are 99.999% zeros. This wastes memory and computation.

  2. No Semantic Information: The dot product between any two different words is always zero:and. The model can't tell that "cat" is more similar to "dog" than to "quantum".

  3. Curse of Dimensionality: With, even a simple linear classifier needs millions of parameters. Models can't generalize across similar words.

The Embedding Solution

Word embeddings map each word to a dense, low-dimensional vector (typically 100-300 dimensions). Instead ofsparse dimensions, we getdense dimensions:For example, with:Now similarity makes sense: Related words have high dot products. This is the distributional hypothesis in action: words appearing in similar contexts should have similar embeddings.

The Distributional Hypothesis

The foundation of modern embeddings is Firth's principle: "You shall know a word by the company it keeps." Words appearing in similar contexts tend to have similar meanings:

  • "The cat sat on the mat" vs "The dog sat on the mat"
  • "The king ruled the kingdom" vs "The queen ruled the kingdom"

Both "cat" and "dog" appear after "the" and before "sat on the mat". Both "king" and "queen" appear before "ruled the kingdom". By training models to predict words from context (or vice versa), we force the model to learn embeddings that capture these distributional patterns.

Word2Vec: Learning from Local Context

Word2Vec, introduced by Mikolov et al. in 2013, popularized neural word embeddings through two efficient architectures: Skip-gram and Continuous Bag-of-Words (CBOW). Both are shallow neural networks trained to predict words from context.

Skip-gram: Predicting Context from Target

Skip-gram takes a target word and predicts surrounding context words within a window. Given a sentence, for each position:

  • Input: Target word

  • Output: Context wordsFor example, with window sizeand sentence "the quick brown fox jumps":

  • Target: "brown" → Context: "the", "quick", "fox", "jumps"

  • Target: "fox" → Context: "quick", "brown", "jumps"

Architecture:

  1. Input: One-hot vectorfor target word

  2. Embedding layer:(this is the word embedding)

  3. Output layer: For each context position, compute scores

  4. Softmax: Convert scores to probabilitiesThe objective maximizes the log probability of context words: whereis the embedding of wordandis its output vector.

Why Skip-gram Works: The model learns that words appearing in similar contexts should have similar embeddings. If "cat" and "dog" both predict "sat", "mat", "runs", their embeddings will be pushed closer together.

CBOW: Predicting Target from Context

CBOW (Continuous Bag-of-Words) reverses the direction: it takes context words and predicts the target word in the center.

  • Input: Context words
  • Output: Target word Architecture:
  1. Input: One-hot vectors for all context words

  2. Embedding: Average context embeddings

  3. Output: Scores

  4. Softmax:The objective is:

Skip-gram vs CBOW:

  • Skip-gram: Slower training but better on rare words (each target generates multiple training examples)
  • CBOW: Faster training, smooths over context by averaging embeddings, better on frequent words

The Softmax Bottleneck

Both Skip-gram and CBOW have a computational problem: the softmax denominator requires summing over allwords:With, computing this for every training example is prohibitively expensive. Word2Vec introduced two solutions: negative sampling and hierarchical softmax.

Negative Sampling

Instead of computing the full softmax, negative sampling turns the problem into binary classification: distinguish the true context word (positive example) from random noise words (negative examples).

For eachpair (target and context):

  1. Positive example:indicatingappears in's context
  2. Negative examples: Samplerandom words, creatingThe objective for each positive pair becomes:whereis the sigmoid function.

Intuition: We wantto be large (so) for true context pairs, and small (so) for random pairs.

Noise Distribution: Word2Vec uses. Theexponent reduces the probability of very frequent words and increases it for rare words, creating more informative negative samples.

Complexity: Computing this objective requires evaluating onlydot products (typicallyto), compared tofor full softmax. This makes training 100-1000× faster.

Hierarchical Softmax

Hierarchical softmax organizes vocabulary into a binary tree (typically a Huffman tree based on word frequencies). Each word is a leaf, and each internal node has a learned vector.

To compute:

  1. Find the path from root to word: a sequence of left/right decisions
  2. Each decision is a binary classifier using the embedding3. Multiply probabilities along the path

If the path to wordinvolves nodeswith directions(where= left,= right):

Complexity: Computing this requiresoperations instead of. Frequent words get shorter paths (fewer computations), while rare words have longer paths.

Trade-offs: Hierarchical softmax is faster for very large vocabularies but less flexible than negative sampling. Most implementations default to negative sampling for its simplicity and effectiveness.

Training Details and Hyperparameters

Subsampling Frequent Words: Very frequent words like "the", "is", "a" provide little semantic information but dominate training data. Word2Vec randomly discards wordduring training with probability:whereis the word's frequency andis a threshold. Words withget increasingly likely to be discarded.

Dynamic Window Size: Instead of always using window size, Word2Vec randomly samples the actual window size uniformly fromfor each training instance. This gives higher weight to closer context words.

Common Hyperparameters: - Embedding dimension:to - Window size:to - Negative samples:to - Learning rate: Start at, linearly decay to - Minimum word count: Discard words appearingtimes

GloVe: Global Matrix Factorization

While Word2Vec learns from local context windows, GloVe (Global Vectors for Word Representation, Pennington et al. 2014) takes a global view by explicitly factorizing word co-occurrence statistics.

Motivation: Capturing Global Statistics

Consider the co-occurrence probabilities of words with "ice" and "steam":

Word Ratio
solid 8.9
gas 0.36
water 1.36
fashion 0.96

The ratio reveals semantic relationships: - "solid" is strongly associated with "ice" (ratio) - "gas" is strongly associated with "steam" (ratio) - "water" relates to both (ratio) - "fashion" relates to neither (ratio)

GloVe argues that embeddings should encode these ratios directly.

The GloVe Objective

Letbe the number of times wordappears in the context of word(within some window in the corpus). Define:

-: total context words for word -: probability of wordin word's context

GloVe seeks embeddingsand context vectorssuch that:The full objective is:whereare bias terms andis a weighting function:Typical values:,.

Why This Weighting?

  1. Down-weight rare pairs:as, reducing the impact of noisy, low-count co-occurrences
  2. Cap frequent pairs:for very frequent pairs, preventing common words from dominating
  3. Smooth middle range: Thepower provides smooth interpolation

GloVe vs Word2Vec

Similarities: - Both produce word embeddings in - Both learn from co-occurrence patterns - Both yield similar performance on analogy and similarity tasks

Differences:

Aspect Word2Vec GloVe
Approach Local context prediction Global matrix factorization
Training Online (stochastic) Batch (iterate over co-occurrence matrix)
Objective Cross-entropy (implicit) Weighted least squares
Output Word embeddings Word + context embeddings

Practical Note: GloVe requires building the full co-occurrence matrix, which needsspace in the worst case. In practice,is sparse (most word pairs never co-occur), so sparse matrix formats work well. Word2Vec doesn't need to store any global statistics, making it more memory-efficient for truly massive corpora.

Training GloVe

Step 1: Build Co-occurrence Matrix

Scan the corpus with a symmetric window of sizearound each word. For sentence "the cat sat on the mat":

With window size: - "cat" co-occurs with: "the" (1×), "sat" (1×), "on" (0.5×) - Weights decrease with distance:This produces a sparsematrix.

Step 2: Optimize Embeddings

Use AdaGrad or similar optimizer to minimize the GloVe objective. Unlike Word2Vec, we iterate multiple epochs over the same matrix(typically 50-100 epochs).

Step 3: Combine Word and Context Embeddings

The final embedding for wordis often the sum or average:This symmetric treatment often improves performance.

FastText: Subword Embeddings

Word2Vec and GloVe assign each word a single embedding, treating "unhappiness", "happiness", and "happy" as completely unrelated. FastText (Bojanowski et al., 2017) addresses this by representing words as bags of character n-grams.

Motivation: Morphology and Rare Words

Consider these problems:

  1. Out-of-vocabulary (OOV): If "unhappiness" wasn't in training data, Word2Vec has no embedding for it
  2. Morphology ignored: "teacher", "teaching", "teach" share the root "teach" but get unrelated embeddings
  3. Rare words: Words appearing once or twice get poorly trained embeddings

FastText solves these by building word embeddings from subword units.

Character N-gram Representation

For a word, FastText extracts all character n-grams of length 3 to 6 (configurable), plus the full word itself. Add special boundary symbols < and > to mark word boundaries.

Example: "where" with :

Actually for : - 3-grams: <wh, whe, her, ere, re> - 4-grams: <whe, wher, here, ere> - 5-grams: <wher, where, here> - 6-grams: <where, where> - Full word: <where>

Each n-gramgets its own embedding. The word embedding is the sum:whereis the set of n-grams in word(including the full word).

Training FastText

FastText uses the same Skip-gram or CBOW objectives as Word2Vec, but replaces the word embeddingwith the n-gram sum. The full training process:

  1. Extract n-grams for each word in vocabulary
  2. Create embeddings for all unique n-grams (vocabulary expands fromwords to millions of n-grams)
  3. Train Skip-gram/CBOW with negative sampling, updating n-gram embeddings
  4. Final word embedding is the sum of its n-gram embeddings

OOV Handling: For a new word like "superamazingly" not in training data, extract its n-grams and sum their embeddings. As long as some n-grams appeared in training (e.g., "super", "amaz", "zing", "ly>"), we get a reasonable embedding.

FastText vs Word2Vec/GloVe

Advantages: - Handles OOV words naturally - Better on morphologically rich languages (Turkish, Finnish, German) - Better on rare words (shares information through subwords) - Smaller model size for certain applications (can compress n-grams)

Disadvantages: - Slower training (millions of n-grams vs thousands of words) - May blur distinctions between unrelated words sharing character sequences (e.g., "mean" and "meaning" share n-grams but "mean" as a verb vs adjective)

Language-Specific Performance: - English: FastText and Word2Vec perform similarly (English has simpler morphology) - German/Turkish: FastText significantly outperforms (compound words and rich inflection) - Chinese: Character n-grams less useful (characters are semantic units, not morphemes)

Language Models and Embeddings

Language models (LMs) predict the probability of text sequences. Training LMs naturally produces word embeddings as a side effect — the hidden representations learned to predict the next word capture semantic information.

N-gram Language Models

An n-gram model predicts the next word based on the previouswords:For example, a trigram model () uses:

Estimation from Counts:

Problems: 1. Sparsity: Most n-grams never appear in training data. With, a trigram model hastrillion possible trigrams 2. Storage: Storing all observed n-gram counts requires huge memory 3. Smoothing complexity: Need sophisticated smoothing (Kneser-Ney, etc.) to handle unseen n-grams

Neural Language Models

Neural LMs replace count-based estimation with neural networks. The key insight: instead of treating each n-gram independently, map words to embeddings and use those embeddings to predict the next word.

Architecture (Bengio et al., 2003):

  1. Input: Previouswords as one-hot vectors

  2. Embedding layer: Convert to embeddings

  3. Concatenation:

  4. Hidden layer:

  5. Output: Softmax over vocabulary Objective: Maximize log-likelihood of training data:

Why Embeddings Help: - Words with similar embeddings make similar predictions - "The cat sat on the mat" and "The dog sat on the mat" share information through similar embeddings for "cat" and "dog" - Model generalizes to unseen n-grams: if "dog" appears in training but "puppy" doesn't, similar embeddings transfer knowledge

Modern Neural LMs: RNNs and Transformers

Recurrent Neural Networks (RNNs): Instead of fixed n-gram windows, RNNs process sequences of any length: The hidden statesummarizes the entire history. LSTMs and GRUs extend this with gating mechanisms to handle long-range dependencies.

Transformers (covered in later articles): Attention mechanisms replace recurrence, allowing parallel computation and better long-range modeling. Models like GPT, BERT, and their successors are transformer-based LMs trained on massive corpora.

Contextualized Embeddings: Unlike Word2Vec/GloVe, modern LMs produce context-dependent embeddings. The word "bank" gets different representations in "river bank" vs "bank account". This is a fundamental advance we'll explore in future articles on BERT and transformers.

Connection to Word2Vec

Notice that Word2Vec's objectives are closely related to neural LM objectives:

  • Skip-gram: Predicts context words from target ≈ simplified LM predicting surrounding words
  • CBOW: Predicts target from context ≈ fill-in-the-blank LM

The main simplification: Word2Vec uses shallow networks (one embedding layer, one output layer) and ignores word order in context (bag-of-words assumption). This makes training faster while still capturing distributional semantics.

Evaluating and Visualizing Embeddings

How do we know if embeddings are good? Evaluation falls into two categories: intrinsic (direct embedding quality) and extrinsic (downstream task performance).

Intrinsic Evaluation: Analogies

The famous "king - man + woman = queen" example is an analogy task:

More precisely: find the word that maximizes cosine similarity:

Standard Datasets: - Google Analogy Dataset: 19,544 questions in categories like: - Capital-country: "Paris is to France as Berlin is to ?" - Gender: "man is to woman as king is to ?" - Comparative: "good is to better as bad is to ?" - Plural: "dog is to dogs as cat is to ?"

  • MSR Analogy Dataset: 8,000 morphological and syntactic analogies

Scoring: Report accuracy (percentage of questions answered correctly). Typical results: - Skip-gram with large corpus: 60-70% accuracy - GloVe: 70-80% accuracy - Random embeddings: <5% accuracy

Limitations: Some analogies are culturally biased or ambiguous. "Paris is to France" has multiple relationships (capital, part-of, located-in). Critics argue that analogy accuracy doesn't strongly correlate with downstream task performance.

Intrinsic Evaluation: Word Similarity

Word similarity datasets contain human-annotated similarity scores for word pairs. The task: compute cosine similarity between embeddings and correlate with human judgments.

Standard Datasets: - WordSim-353: 353 word pairs with similarity scores 0-10 - Example: ("tiger", "cat") → 7.35, ("book", "paper") → 7.46 - SimLex-999: 999 pairs emphasizing true similarity (not relatedness) - ("coast", "shore") → high similarity - ("coast", "ocean") → high relatedness but lower similarity - MEN: 3,000 pairs from naturally occurring text

Metric: Spearman correlationbetween embedding similarities and human scores. Typical results: - Word2Vec/GloVe:to - Random embeddings:

Extrinsic Evaluation: Downstream Tasks

The ultimate test: do embeddings improve real NLP tasks?

Common Tasks: 1. Sentiment Analysis: Classify movie reviews as positive/negative 2. Named Entity Recognition: Tag tokens as person, organization, location 3. Text Classification: Categorize news articles by topic 4. Machine Translation: Use embeddings as encoder/decoder initialization

Typical Setup: - Initialize model with pre-trained embeddings (Word2Vec, GloVe, FastText) - Fine-tune on task-specific data - Compare to random initialization or task-specific embeddings

Results: Pre-trained embeddings typically improve accuracy by 2-10% when training data is limited. With massive task-specific data, the advantage diminishes (the model learns good embeddings from scratch).

Visualization: t-SNE and PCA

High-dimensional embeddings (to) can't be plotted directly. Dimensionality reduction projects them to 2D or 3D for visualization.

PCA (Principal Component Analysis): Linear projection maximizing variance:wherecontains the top 2 principal components.

t-SNE (t-Distributed Stochastic Neighbor Embedding): Nonlinear projection preserving local neighborhoods. t-SNE minimizes divergence between high-dimensional and low-dimensional probability distributions:whereis the similarity between pointsin high dimensions, andis their similarity in 2D.

Visual Patterns: Good embeddings show: - Semantic clusters (countries grouped together, animals grouped together) - Smooth transitions (gradual color changes from "red" to "blue") - Analogical relationships (parallel vectors for gender, tense, etc.)

Example Clusters: - Countries: France, Germany, Spain, Italy - Animals: dog, cat, horse, cow - Professions: teacher, doctor, engineer, lawyer

Practical Training with Gensim

train Word2Vec, GloVe-like, and FastText embeddings using Python's Gensim library. We'll use a sample corpus and evaluate the results.

Installing Dependencies

1
pip install gensim numpy matplotlib scikit-learn

Training Word2Vec

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Sample corpus (replace with your data)
sentences = [
"the cat sat on the mat",
"the dog sat on the log",
"cats and dogs are animals",
"the quick brown fox jumps over the lazy dog",
"a cat and a dog are playing in the garden",
]

# Tokenize
tokenized_sentences = [simple_preprocess(sentence) for sentence in sentences]

# Train Skip-gram model
model_sg = Word2Vec(
sentences=tokenized_sentences,
vector_size=100, # Embedding dimension
window=5, # Context window size
min_count=1, # Minimum word frequency
sg=1, # 1 = Skip-gram, 0 = CBOW
negative=5, # Negative sampling
epochs=100, # Training iterations
seed=42
)

# Train CBOW model
model_cbow = Word2Vec(
sentences=tokenized_sentences,
vector_size=100,
window=5,
min_count=1,
sg=0, # CBOW
negative=5,
epochs=100,
seed=42
)

# Get embedding for a word
cat_embedding = model_sg.wv['cat']
print(f"Cat embedding shape: {cat_embedding.shape}")

# Find similar words
similar_words = model_sg.wv.most_similar('cat', topn=5)
print(f"Words similar to 'cat': {similar_words}")

# Compute similarity
similarity = model_sg.wv.similarity('cat', 'dog')
print(f"Similarity between 'cat' and 'dog': {similarity:.4f}")

Output (will vary due to small corpus):

1
2
3
Cat embedding shape: (100,)
Words similar to 'cat': [('dog', 0.87), ('animals', 0.65), ...]
Similarity between 'cat' and 'dog': 0.8734

Training FastText

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from gensim.models import FastText

# Train FastText model
model_ft = FastText(
sentences=tokenized_sentences,
vector_size=100,
window=5,
min_count=1,
sg=1, # Skip-gram
negative=5,
epochs=100,
min_n=3, # Min character n-gram length
max_n=6, # Max character n-gram length
seed=42
)

# Get embedding for OOV word
# Even if "cats" wasn't in training, FastText can generate an embedding
cats_embedding = model_ft.wv['cats']
print(f"Cats embedding (OOV or in-vocab): {cats_embedding.shape}")

# Compare in-vocab and OOV
print(f"Similarity cat-dog: {model_ft.wv.similarity('cat', 'dog'):.4f}")
# For a truly OOV word (assuming not in small corpus):
try:
model_ft.wv.similarity('cat', 'kitty')
except KeyError:
print("OOV word 'kitty' - FastText can still compute embedding")
# Access n-gram based embedding
kitty_vec = model_ft.wv['kitty']
print(f"Kitty embedding: {kitty_vec.shape}")

Training Word2Vec on Large Corpus

For real applications, use larger corpora like Wikipedia, news articles, or domain-specific text:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

# Assume corpus.txt has one sentence per line
sentences = LineSentence('corpus.txt')

model = Word2Vec(
sentences=sentences,
vector_size=300,
window=10,
min_count=5, # Ignore words appearing < 5 times
sg=1,
negative=15,
epochs=5,
workers=4, # Parallel training
seed=42
)

# Save model
model.save('word2vec.model')

# Load model
model = Word2Vec.load('word2vec.model')

Analogy Testing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Test analogies (requires sufficient vocabulary)
def test_analogy(model, a, b, c, topn=1):
"""
Test analogy: a is to b as c is to ?
Example: king is to queen as man is to ?
"""
try:
result = model.wv.most_similar(positive=[b, c], negative=[a], topn=topn)
return result[0][0]
except KeyError:
return "OOV word"

# Example (may not work with tiny corpus)
answer = test_analogy(model_sg, 'cat', 'cats', 'dog')
print(f"cat is to cats as dog is to {answer}")

Visualization with t-SNE

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

# Get all word vectors
words = list(model_sg.wv.index_to_key)
vectors = np.array([model_sg.wv[word] for word in words])

# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42)
vectors_2d = tsne.fit_transform(vectors)

# Plot
plt.figure(figsize=(12, 8))
plt.scatter(vectors_2d[:, 0], vectors_2d[:, 1], alpha=0.5)

# Annotate words
for i, word in enumerate(words):
plt.annotate(word, xy=(vectors_2d[i, 0], vectors_2d[i, 1]))

plt.title('Word Embeddings Visualization (t-SNE)')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.tight_layout()
plt.savefig('embeddings_tsne.png', dpi=150)
plt.show()

Loading Pre-trained Embeddings

Instead of training from scratch, use pre-trained embeddings:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import gensim.downloader as api

# List available models
print(list(api.info()['models'].keys()))

# Load pre-trained Word2Vec (trained on Google News)
model_pretrained = api.load('word2vec-google-news-300')

# Load pre-trained GloVe
model_glove = api.load('glove-wiki-gigaword-100')

# Load pre-trained FastText
model_fasttext = api.load('fasttext-wiki-news-subwords-300')

# Use them
print(model_pretrained.most_similar('computer', topn=5))
print(model_glove.similarity('king', 'queen'))

Available Pre-trained Models: - word2vec-google-news-300: 3M words, 300-dim, trained on Google News (100B tokens) - glove-wiki-gigaword-300: 400K words, 300-dim, trained on Wikipedia + Gigaword - fasttext-wiki-news-subwords-300: 1M words, 300-dim, includes subword info

Converting Embeddings to Other Formats

1
2
3
4
5
6
7
8
9
# Save in word2vec text format
model_sg.wv.save_word2vec_format('embeddings.txt', binary=False)

# Save in binary format (smaller, faster)
model_sg.wv.save_word2vec_format('embeddings.bin', binary=True)

# Load from word2vec format
from gensim.models import KeyedVectors
embeddings = KeyedVectors.load_word2vec_format('embeddings.txt', binary=False)

Frequently Asked Questions

Q1: Why do we need dense embeddings? Can't we just use one-hot vectors?

One-hot vectors treat all words as equally different — the distance between "cat" and "dog" equals the distance between "cat" and "quantum". This prevents models from generalizing. If the training set contains "The cat runs fast" but not "The dog runs fast", a one-hot model can't infer that "dog" might also run fast.

Dense embeddings encode similarity: "cat" and "dog" have similar embeddings because they appear in similar contexts (both are animals, both can sit/run/eat). When a model learns that "cat" relates to "pet", it automatically knows "dog" does too, through their similar embeddings.

Additionally, one-hot vectors waste memory and computation. With, a one-hot vector requires 100,000 dimensions (99.999% zeros). A 300-dimensional dense embedding captures more information in 0.3% of the space.

Q2: What's the difference between Skip-gram and CBOW?

Skip-gram takes a target word and predicts surrounding context words. Given "the quick brown fox jumps", it predicts "the", "quick", "fox", "jumps" from "brown". Each target word generates multiple training examples (one per context word), making it better for rare words.

CBOW takes context words and predicts the target. Given "the quick ___ fox jumps", it predicts "brown" from the context. It averages context embeddings, which smooths noise but loses information about individual context words. CBOW trains faster and works better for frequent words.

Rule of thumb: Use Skip-gram for small datasets or when rare words matter. Use CBOW for large datasets when speed is critical.

Q3: How does negative sampling make training faster?

Standard softmax requires computing probabilities over allwords:The denominator sumsexponentials — with, that's 100,000 operations per training example.

Negative sampling changes the task: instead of predicting which word is the correct context (multi-class classification overclasses), distinguish the true context word fromrandom "noise" words (binary classification repeatedtimes). With, you compute only 11 dot products instead of 100,000.

The trick works because the noise words are random, so the model learns to differentiate real context from noise without seeing all possible negatives.

Q4: What does the "0.75 exponent" in negative sampling's noise distribution do?

The noise distribution is. Without the exponent, we'd sample noise words proportional to their frequency: "the" appears 100× more often than "zebra", so it's sampled 100× more often as a negative example.

The problem: very frequent words like "the", "is", "and" would dominate negative samples, teaching the model "these aren't context words" redundantly. Meanwhile, rare words like "zebra" would rarely appear, so the model never learns to distinguish them.

The 0.75 exponent reduces frequent words' probability and boosts rare words'. If "the" has count 1000 and "zebra" has count 10: - Proportional: - With 0.75:Now "zebra" is sampled 3× more often relative to "the", making negative examples more informative.

Q5: Why does GloVe use a weighting function?

Without weighting, the objective treats all co-occurrences equally:Two problems:

  1. Rare pairs have noisy co-occurrences:ormight be statistical accidents, not meaningful relationships. Fitting these exactly introduces noise.

  2. Frequent pairs dominate: The most common words co-occur millions of times. The optimization focuses on "the-the", "the-is", "is-a" pairs, ignoring less frequent but more informative pairs.

The weighting functionfixes both: -as: Low-count pairs get low weight, reducing noise -for: Very frequent pairs are capped, preventing domination

The result: balanced optimization across all co-occurrence ranges.

Q6: When should I use FastText instead of Word2Vec?

Use FastText when: - Your language has rich morphology (German, Turkish, Finnish, Russian, Arabic) - You have many compound words (German: "Schadenfreude", "Weltanschauung") - You need to handle OOV words (spelling variations, typos, new words) - Your vocabulary is large but training data is limited (subword sharing helps)

Use Word2Vec when: - Your language has simple morphology (English, Chinese) - Vocabulary is fixed and OOV words are rare - Speed matters (FastText is slower due to n-grams) - You want to distinguish between words that happen to share character sequences but have different meanings

Example: In English, "cat" and "category" share the substring "cat" but have unrelated meanings. Word2Vec keeps them separate; FastText might assign them slightly more similar embeddings due to shared n-grams.

Q7: How do I choose embedding dimension?

Larger dimensions capture more information but require more data and computation. Typical values:

  • to: Small datasets (< 1M tokens), simple tasks, limited memory
  • to: Medium datasets (1M-1B tokens), general-purpose embeddings
  • to: Large datasets (> 1B tokens), specialized domains, when quality matters more than speed

Diminishing returns: Increasingfrom 50 to 100 gives large gains. Increasing from 300 to 600 gives small gains. Beyond, improvements are marginal unless your corpus is enormous.

Rule of thumb: Start withor. If you have > 1B tokens, try. Evaluate on your task — sometimes smaller embeddings work better (less overfitting).

Q8: Can I combine Word2Vec with deep learning models like BERT?

Yes, but it's usually unnecessary. Modern models like BERT produce contextualized embeddings— each word gets a different embedding depending on context. For example, "bank" in "river bank" vs "bank account" gets different BERT embeddings.

Word2Vec/GloVe produce static embeddings—"bank" always has the same embedding regardless of context. This is simpler but less powerful.

When to combine: - You have limited computational resources (BERT requires GPUs; Word2Vec works on CPUs) - Your task is simple and doesn't need context (e.g., word similarity, document clustering) - You're working with low-resource languages where BERT isn't available

Typical workflow: 1. Small tasks / limited resources: Use pre-trained Word2Vec/GloVe 2. Medium tasks: Fine-tune BERT 3. Large tasks: Pre-train your own BERT-like model

Q9: Why do embeddings capture analogies like "king - man + woman = queen"?

Embeddings learn from context, and certain relationships appear consistently across contexts:

  • "The king ruled the kingdom" vs "The queen ruled the kingdom"
  • "The man walked down the street" vs "The woman walked down the street"

The model learns: -andboth have a "royalty" component -andboth have a "male" component -andboth have a "female" component

The vector differenceremoves the "male" component, leaving "royalty". Addingadds the "female" component, resulting in a vector close to.

Mathematically, if we assume additive compositionality:where= royalty,= male,= female, then:This is a simplification — real embeddings aren't perfectly additive — but it explains why analogies often work.

Q10: How do I know if my embeddings are good without labeled downstream data?

Use intrinsic evaluation metrics that don't require task-specific labels:

  1. Word similarity correlation: Compare embedding similarities to human judgments on WordSim-353, SimLex-999, etc. If correlation is high (), your embeddings capture human intuitions.

  2. Analogy accuracy: Test on Google Analogy Dataset. Good embeddings should achieve > 50% accuracy on semantic analogies.

  3. Nearest neighbors inspection: Check-nearest neighbors for sample words. For "cat", neighbors should include "dog", "kitten", "feline", not random words. For "Paris", neighbors should include "London", "France", "Berlin".

  4. Clustering coherence: Cluster embeddings (k-means or hierarchical) and inspect clusters. Good embeddings group semantically related words (countries together, animals together, professions together).

  5. Visualization: Project to 2D with t-SNE and look for semantic structure. Related words should form visible clusters.

These methods aren't perfect — embeddings that score well on analogies might still fail on your specific task — but they provide quick sanity checks without requiring labeled data.

Summary and Future Directions

Word embeddings transformed NLP by encoding semantic relationships in dense vectors. We've covered:

  • Evolution from one-hot to dense representations: Solving sparsity and enabling generalization
  • Word2Vec (Skip-gram and CBOW): Local context prediction with negative sampling and hierarchical softmax
  • GloVe: Global matrix factorization leveraging co-occurrence statistics
  • FastText: Subword embeddings handling morphology and OOV words
  • Language models: Predicting text sequences while learning embeddings as side effects
  • Evaluation: Analogies, word similarity, downstream tasks, and visualization
  • Practical training: Using Gensim for Word2Vec, FastText, and pre-trained embeddings

Limitations of Static Embeddings:

Despite their success, Word2Vec, GloVe, and FastText have a fundamental limitation: each word has a single embedding regardless of context. "Bank" means the same thing whether we're discussing rivers or finance. "Apple" represents both fruit and company with the same vector.

Future: Contextualized Embeddings:

Modern models like ELMo, GPT, and BERT produce context-dependent embeddings. Each word gets a different representation based on its sentence:

  • "I visited the bank of the river" → "bank" has a "geography" embedding
  • "I deposited money at the bank" → "bank" has a "finance" embedding

These models use deep transformers trained on massive corpora, achieving state-of-the-art results across NLP tasks. We'll explore them in future articles.

Key Takeaways:

  1. Embeddings capture distributional semantics: words in similar contexts get similar vectors
  2. Training objectives (Skip-gram, CBOW, GloVe) all implement the distributional hypothesis in different ways
  3. Computational tricks (negative sampling, hierarchical softmax) make training feasible
  4. Static embeddings remain useful for resource-constrained settings and interpretability
  5. The field has moved to contextualized embeddings, but understanding Word2Vec and GloVe provides the foundation for modern NLP

Word embeddings opened the door to deep learning in NLP. By representing words as continuous vectors, they enabled neural networks to process language effectively. The journey from discrete symbols to continuous representations continues with transformers and large language models — but it all started with the simple idea that words are defined by their context.

  • Post title:NLP (2): Word Embeddings and Language Models
  • Post author:Chen Kai
  • Create time:2024-02-08 14:30:00
  • Post link:https://www.chenk.top/en/nlp-word-embeddings-lm/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments