NLP (1): Introduction and Text Preprocessing
Chen Kai BOSS

Natural Language Processing (NLP) bridges the gap between human communication and machine understanding. Whether you're building a chatbot, analyzing customer sentiment, or developing the next generation of language models, understanding how to preprocess text is fundamental. This article explores the evolution of NLP from rule-based systems to modern deep learning approaches, then dives deep into the practical techniques that transform raw text into machine-readable features. We'll cover tokenization strategies, normalization techniques, and feature extraction methods with hands-on Python implementations using NLTK, spaCy, and scikit-learn.

The Evolution of Natural Language Processing

Natural Language Processing has undergone several paradigm shifts throughout its history. Understanding this evolution helps us appreciate why current preprocessing techniques exist and when to apply them.

Symbolic Era: Rule-Based Systems

In the 1950s-1980s, NLP relied heavily on hand-crafted rules and symbolic reasoning. Researchers believed that language could be understood through explicit grammatical rules and logical representations. Systems like ELIZA (1966) and SHRDLU (1970) demonstrated limited success but struggled with ambiguity and scale.

Key characteristics: - Hand-written grammar rules - Logical inference systems - Pattern matching with regular expressions - Domain-specific expert systems

Limitations: - Required extensive manual effort - Brittle to language variation - Poor generalization to new domains - Couldn't handle ambiguity well

Statistical Revolution: Learning from Data

The 1990s brought a statistical revolution to NLP. Instead of encoding rules manually, systems learned patterns from large text corpora. This shift was enabled by increased computing power and the availability of digital text.

Key breakthroughs: - Hidden Markov Models (HMMs) for part-of-speech tagging - Probabilistic context-free grammars - Maximum entropy models - N-gram language models

The core idea: if we observe that word frequently follows words, we can estimate:This probabilistic approach handled ambiguity better than symbolic systems and scaled to larger datasets. However, these models still relied on hand-engineered features and struggled with long-range dependencies.

Deep Learning Era: Neural Representations

Around 2013-2015, deep learning fundamentally changed NLP. Word embeddings like Word2Vec and GloVe showed that we could learn dense vector representations where semantic relationships emerged naturally. The key insight: represent words in continuous vector spaces where similar words cluster together.

Word2Vec (Mikolov et al., 2013) introduced two architectures: - CBOW (Continuous Bag of Words): Predict target word from context - Skip-gram: Predict context words from target word

These embeddings captured semantic relationships through vector arithmetic:

Recurrent architectures (LSTMs, GRUs) followed, allowing models to process sequences and maintain context. Then came the attention mechanism (Bahdanau et al., 2014), which let models focus on relevant parts of the input.

Transformer Revolution: Attention Is All You Need

The 2017 paper "Attention Is All You Need" (Vaswani et al.) introduced the Transformer architecture, eliminating recurrence entirely in favor of self-attention mechanisms. This enabled:

  • Parallel processing of entire sequences
  • Better capture of long-range dependencies
  • More efficient training on GPUs

The self-attention mechanism computes attention weights:where (queries), (keys), and (values) are learned linear projections of the input.

Large Language Models: The Modern Era

Building on Transformers, Large Language Models (LLMs) emerged as powerful few-shot and zero-shot learners. Key milestones:

  • BERT (2018): Bidirectional pre-training with masked language modeling
  • GPT-2/3 (2019/2020): Autoregressive generation at scale
  • T5 (2019): Text-to-text framework
  • ChatGPT/GPT-4 (2022/2023): Instruction-tuned conversational agents
  • Claude, Gemini, Llama (2023+): Diverse architectural innovations

These models are pre-trained on massive corpora (hundreds of billions of tokens) and fine-tuned for specific tasks. They've achieved human-level or superhuman performance on many benchmarks.

Key insight: With sufficient scale and data, models can learn linguistic structure, world knowledge, and reasoning capabilities from raw text alone.

Applications of NLP

NLP powers a vast array of modern applications across industries:

Text Classification

  • Sentiment Analysis: Determine positive/negative/neutral sentiment in reviews, social media
  • Spam Detection: Filter unwanted emails and messages
  • Topic Categorization: Automatically assign articles to categories
  • Intent Recognition: Understand user intentions in chatbots

Information Extraction

  • Named Entity Recognition (NER): Identify people, organizations, locations, dates
  • Relation Extraction: Discover relationships between entities
  • Event Detection: Identify events mentioned in news articles
  • Knowledge Graph Construction: Build structured knowledge bases from text

Text Generation

  • Machine Translation: Translate between languages
  • Summarization: Create concise summaries of long documents
  • Question Answering: Generate answers to user questions
  • Creative Writing: Generate stories, poetry, code

Conversational AI

  • Chatbots: Customer service automation
  • Virtual Assistants: Siri, Alexa, Google Assistant
  • Dialogue Systems: Multi-turn conversations

Analysis and Understanding

  • Text Similarity: Find duplicate or similar documents
  • Document Clustering: Group related documents
  • Topic Modeling: Discover latent topics in document collections
  • Semantic Search: Search by meaning rather than keywords

Text Preprocessing Pipeline

Before feeding text into machine learning models, we need to transform raw text into clean, structured features. The preprocessing pipeline typically includes these stages:

1
2
3
4
5
6
7
8
9
10
11
12
13
Raw Text

Text Cleaning

Tokenization

Normalization (Lowercasing, Stemming/Lemmatization)

Stopword Removal

Feature Extraction (Bag of Words, TF-IDF, Embeddings)

Vectorized Features → Model

explore each stage with practical examples.

Setting Up Your Environment

First, install the required libraries:

1
2
pip install nltk spacy scikit-learn matplotlib numpy pandas
python -m spacy download en_core_web_sm

Download NLTK data:

1
2
3
4
5
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

Text Cleaning

Raw text often contains noise that doesn't contribute to meaning: HTML tags, special characters, extra whitespace, URLs, etc. Text cleaning is the first step in preprocessing, removing artifacts that would interfere with downstream NLP tasks.

Problem Context: Web-scraped text, social media posts, and user-generated content contain various noise: HTML markup from web pages, URLs and email addresses, special characters, and inconsistent whitespace. This noise increases vocabulary size unnecessarily and can confuse models that expect clean text.

Solution Approach: Use regular expressions to systematically remove different types of noise in a pipeline. Each cleaning step targets a specific noise type: HTML tags, URLs, emails, special characters, and whitespace normalization. The order matters — removing HTML first prevents tags from interfering with URL detection.

Design Considerations: The cleaning function is aggressive, removing all non-alphabetic characters. This works well for tasks like topic modeling or keyword extraction, but may harm sentiment analysis (where punctuation like "!!!" conveys emotion) or named entity recognition (where numbers and symbols matter). The function should be customized based on the target task.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import re

def clean_text(text):
"""
Clean raw text by removing various types of noise.

Function: Removes HTML tags, URLs, emails, special characters, and normalizes whitespace

Parameters:
text (str): Raw text that may contain HTML, URLs, emails, special characters

Returns:
str: Cleaned text containing only alphabetic characters and spaces

Processing Steps:
1. Remove HTML tags (e.g., <p>, <div>)
2. Remove URLs (http://, https://, www.)
3. Remove email addresses
4. Remove all non-alphabetic characters (keep only letters and spaces)
5. Normalize whitespace (multiple spaces → single space)

Example:
>>> text = "<p>Visit https://example.com or email info@test.com</p>"
>>> clean_text(text)
'Visit or email'

Note:
This is aggressive cleaning. For sentiment analysis, consider preserving
punctuation and emoticons. For NER, preserve numbers and special characters.
"""
# Step 1: Remove HTML tags
# Regex pattern: <[^>]+>
# - < : literal opening bracket
# - [^>]+ : one or more characters that are not >
# - > : literal closing bracket
# This matches any HTML tag like <p>, <div class="...">, </span>
text = re.sub(r'<[^>]+>', '', text)
# Example: "<p>Hello</p>" → "Hello"

# Step 2: Remove URLs
# Pattern: http\S+|www\.\S+
# - http\S+ : "http" followed by one or more non-whitespace characters
# - | : OR operator
# - www\.\S+ : "www." followed by one or more non-whitespace characters
# \S matches any non-whitespace character (captures entire URL)
text = re.sub(r'http\S+|www\.\S+', '', text)
# Example: "Visit https://example.com/page" → "Visit "

# Step 3: Remove email addresses
# Pattern: \S+@\S+
# - \S+ : one or more non-whitespace characters (username part)
# - @ : literal @ symbol
# - \S+ : one or more non-whitespace characters (domain part)
# This is simplified — real email regex is more complex but this works for most cases
text = re.sub(r'\S+@\S+', '', text)
# Example: "Contact info@example.com" → "Contact "

# Step 4: Remove special characters and digits
# Pattern: [^a-zA-Z\s]
# - ^ : negation (match anything NOT in the character class)
# - a-zA-Z : all lowercase and uppercase letters
# - \s : whitespace characters (space, tab, newline)
# This removes everything except letters and spaces
# Note: This also removes numbers, which may be needed for some tasks
text = re.sub(r'[^a-zA-Z\s]', '', text)
# Example: "Price:$29.99" → "Price "

# Step 5: Normalize whitespace
# Pattern: \s+
# - \s+ : one or more whitespace characters
# Replace multiple spaces/tabs/newlines with a single space
# .strip() removes leading and trailing whitespace
text = re.sub(r'\s+', ' ', text).strip()
# Example: "Hello world\n\n" → "Hello world"

return text

# Example usage
raw_text = """
<p>Check out our website at https://example.com for more info!</p>
Contact us at info@example.com. Price: $29.99 (50% off!)
"""

cleaned = clean_text(raw_text)
print(f"Original: {raw_text}")
print(f"Cleaned: {cleaned}")
# Output: "Check out our website at for more info Contact us at Price off"

Deep Dive: Cleaning Strategies and Trade-offs

Text cleaning seems straightforward, but each step involves important design decisions:

1. HTML Tag Removal

The regex <[^>]+> handles most HTML, but has limitations:

  • Nested tags: Works correctly for <p>text</p>
  • Self-closing tags: Handles <br/>, <img src="..."/>
  • Malformed HTML: May fail on unclosed tags or malformed markup

Alternative: Use BeautifulSoup for robust HTML parsing:

1
2
3
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_text, 'html.parser')
text = soup.get_text()

2. URL and Email Detection

The current regex is simplified and may miss edge cases:

  • URLs: Doesn't handle URLs without protocol (e.g., "example.com/page")
  • Emails: Doesn't validate email format strictly
  • Edge cases: May incorrectly match non-URLs containing "@" or "http"

Improved version:

1
2
3
4
# More robust URL pattern (still simplified)
url_pattern = r'https?://[^\s]+|www\.[^\s]+|[a-zA-Z0-9-]+\.[a-zA-Z]{2,}[^\s]*'
# More robust email pattern
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'

3. Special Character Removal

Removing all non-alphabetic characters is aggressive:

  • Problem 1: Numbers removed —"COVID-19" becomes "COVID" (loses important information)
  • Problem 2: Punctuation removed —"amazing!!!" becomes "amazing" (loses emphasis)
  • Problem 3: Currency symbols removed —"$100" becomes empty

Task-specific alternatives:

1
2
3
4
5
6
7
# For sentiment analysis: preserve punctuation
text = re.sub(r'[^a-zA-Z\s!?.,]', '', text)

# For NER: preserve numbers and some symbols
text = re.sub(r'[^a-zA-Z0-9\s-]', '', text)

# For topic modeling: current approach is fine

4. Whitespace Normalization

Normalizing whitespace is generally safe, but consider:

  • Preserving structure: Some tasks need to preserve line breaks (e.g., poetry, code)
  • Multiple spaces: May indicate intentional formatting (e.g., indentation)

5. Common Issues and Solutions

Issue Cause Solution
HTML entities not removed &nbsp;, &amp; remain Use html.unescape() before cleaning
URLs split incorrectly Regex too simple Use more robust URL detection library
Important info lost Cleaning too aggressive Customize cleaning based on task
Performance slow Processing large texts Use compiled regex or batch processing
Encoding errors Non-UTF8 text Handle encoding before cleaning

6. Performance Optimization

For large-scale text processing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import re

# Compile regex patterns (faster for repeated use)
HTML_PATTERN = re.compile(r'<[^>]+>')
URL_PATTERN = re.compile(r'http\S+|www\.\S+')
EMAIL_PATTERN = re.compile(r'\S+@\S+')
SPECIAL_CHAR_PATTERN = re.compile(r'[^a-zA-Z\s]')
WHITESPACE_PATTERN = re.compile(r'\s+')

def clean_text_optimized(text):
"""Optimized version using compiled regex."""
text = HTML_PATTERN.sub('', text)
text = URL_PATTERN.sub('', text)
text = EMAIL_PATTERN.sub('', text)
text = SPECIAL_CHAR_PATTERN.sub('', text)
text = WHITESPACE_PATTERN.sub(' ', text).strip()
return text

7. Task-Specific Cleaning

Different NLP tasks require different cleaning strategies:

  • Sentiment Analysis: Preserve punctuation, emoticons, capitalization
  • Named Entity Recognition: Preserve numbers, dates, currency symbols
  • Topic Modeling: Current aggressive cleaning is appropriate
  • Machine Translation: Minimal cleaning (preserve structure)
  • Text Classification: Moderate cleaning (remove noise, preserve content)

8. Best Practices

  1. Document cleaning steps: Record what was removed and why
  2. Preserve originals: Keep raw text for debugging and comparison
  3. Test on sample: Verify cleaning doesn't remove important information
  4. Version control: Track cleaning function versions and parameters
  5. Error handling: Handle edge cases (empty strings, None values, encoding errors)

Text cleaning is foundational to NLP pipelines. Understanding the trade-offs and customizing cleaning for your specific task is crucial for achieving good results.

Tokenization: Breaking Text into Units

Tokenization splits text into individual units (tokens) - typically words, but sometimes subwords or characters. This seems simple but involves subtle decisions.

Word Tokenization

The naive approach of splitting on whitespace fails for many cases:

1
2
3
4
5
6
# Naive tokenization
text = "Don't split can't into separate words!"
tokens = text.split()
print(tokens)
# ['Don't', 'split', "can't", 'into', 'separate', 'words!']
# Problem: Contractions aren't handled, punctuation attached

NLTK's word tokenizer handles these cases better:

1
2
3
4
5
6
7
from nltk.tokenize import word_tokenize

text = "Dr. Smith earned$150,000 in 2023! Isn't that amazing?"
tokens = word_tokenize(text)
print(tokens)
# ['Dr.', 'Smith', 'earned', '$', '150,000', 'in', '2023', '!',
# 'Is', "n't", 'that', 'amazing', '?']

Notice how it: - Keeps "Dr." as one token - Separates punctuation - Splits contractions ("Isn't" → "Is" + "n't") - Handles currency and numbers

spaCy's tokenizer uses linguistic rules:

1
2
3
4
5
6
7
8
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp("Dr. Smith earned$150,000 in 2023! Isn't that amazing?")
tokens = [token.text for token in doc]
print(tokens)
# ['Dr.', 'Smith', 'earned', '$', '150,000', 'in', '2023', '!',
# 'Is', "n't", 'that', 'amazing', '?']

Sentence Tokenization

Splitting text into sentences is non-trivial due to abbreviations and ambiguous periods:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from nltk.tokenize import sent_tokenize

text = """
Dr. Johnson works at A.I. Corp. He earned his Ph.D. in 2010.
His research focuses on NLP. Does he publish papers? Yes!
"""

sentences = sent_tokenize(text)
for i, sent in enumerate(sentences, 1):
print(f"{i}. {sent}")
# 1. Dr. Johnson works at A.I. Corp.
# 2. He earned his Ph.D. in 2010.
# 3. His research focuses on NLP.
# 4. Does he publish papers?
# 5. Yes!

Subword Tokenization

Modern NLP models use subword tokenization to handle: - Rare words - Morphological variations - Out-of-vocabulary (OOV) words - Multilingual text

Byte Pair Encoding (BPE) is a popular approach used in GPT, BERT, and others.

BPE Algorithm: 1. Start with a vocabulary of individual characters 2. Iteratively merge the most frequent pair of tokens 3. Continue until reaching desired vocabulary size

Example:

1
2
3
4
5
6
7
8
9
10
11
Corpus: "low", "lower", "newest", "widest"
Initial: l o w, l o w e r, n e w e s t, w i d e s t

Iteration 1: Merge most frequent pair (e, s) → es
Result: l o w, l o w e r, n e w es t, w i d es t

Iteration 2: Merge (es, t) → est
Result: l o w, l o w e r, n e w est, w i d est

Iteration 3: Merge (l, o) → lo
Result: lo w, lo w e r, n e w est, w i d est

Why BPE matters: - Handles rare words: "unbelievable" → "un", "believ", "able" - Reduces vocabulary size while maintaining coverage - Works across languages

Here's a simplified BPE implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
from collections import Counter, defaultdict
import re

def get_stats(vocab):
"""Count frequency of adjacent pairs."""
pairs = defaultdict(int)
for word, freq in vocab.items():
symbols = word.split()
for i in range(len(symbols) - 1):
pairs[symbols[i], symbols[i+1]] += freq
return pairs

def merge_vocab(pair, vocab):
"""Merge most frequent pair in vocabulary."""
new_vocab = {}
bigram = ' '.join(pair)
replacement = ''.join(pair)

for word in vocab:
new_word = word.replace(bigram, replacement)
new_vocab[new_word] = vocab[word]
return new_vocab

def learn_bpe(vocab, num_merges):
"""Learn BPE merges from vocabulary."""
for i in range(num_merges):
pairs = get_stats(vocab)
if not pairs:
break
best = max(pairs, key=pairs.get)
vocab = merge_vocab(best, vocab)
print(f"Merge {i+1}: {best[0]} + {best[1]}{''.join(best)}")
return vocab

# Example
vocab = {
'l o w </w>': 5,
'l o w e r </w>': 2,
'n e w e s t </w>': 6,
'w i d e s t </w>': 3
}

print("Initial vocabulary:")
for word, freq in vocab.items():
print(f" {word}: {freq}")

print("\nLearning BPE merges:")
final_vocab = learn_bpe(vocab.copy(), num_merges=10)

print("\nFinal vocabulary:")
for word, freq in final_vocab.items():
print(f" {word}: {freq}")

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
Initial vocabulary:
l o w </w>: 5
l o w e r </w>: 2
n e w e s t </w>: 6
w i d e s t </w>: 3

Learning BPE merges:
Merge 1: e + s → es
Merge 2: es + t → est
Merge 3: est + </w> → est</w>
Merge 4: l + o → lo
Merge 5: lo + w → low
...

Normalization: Standardizing Text

Lowercasing

Converting to lowercase reduces vocabulary size but loses information:

1
2
3
text = "Apple Inc. sells apples in APPLE stores"
print(text.lower())
# "apple inc. sells apples in apple stores"

When to lowercase: - ✅ Text classification with limited data - ✅ Information retrieval - ❌ Named Entity Recognition (Apple Inc. vs. apple) - ❌ Sentiment analysis (different emphasis)

Stemming: Crude Suffix Removal

Stemming chops word endings to reach a root form (stem). It uses heuristic rules and can be aggressive:

1
2
3
4
5
6
7
8
9
10
11
12
from nltk.stem import PorterStemmer, SnowballStemmer

porter = PorterStemmer()
snowball = SnowballStemmer('english')

words = ['running', 'runs', 'ran', 'easily', 'fairly',
'connection', 'connected', 'connecting']

print(f"{'Word':<15} {'Porter':<15} {'Snowball':<15}")
print("-" * 45)
for word in words:
print(f"{word:<15} {porter.stem(word):<15} {snowball.stem(word):<15}")

Output:

1
2
3
4
5
6
7
8
9
10
Word            Porter          Snowball       
---------------------------------------------
running run run
runs run run
ran ran ran
easily easili easili
fairly fairli fairli
connection connect connect
connected connect connect
connecting connect connect

Problems with stemming: - Over-stemming: "university" → "univers", "europe" → "europ" - Under-stemming: "aluminum" vs. "aluminium" remain different - Not real words: "easili" isn't a valid word

Lemmatization: Vocabulary-Based Normalization

Lemmatization uses vocabulary and morphological analysis to return the dictionary form (lemma):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()

# Without POS tags - defaults to noun
words = ['running', 'runs', 'ran', 'better', 'swimming', 'geese']
print("Without POS tags:")
for word in words:
print(f" {word}{lemmatizer.lemmatize(word)}")

# Output:
# running → running (not recognized as verb)
# runs → run
# ran → ran
# better → better
# geese → goose (correct!)

# With POS tags
print("\nWith POS tags (verb):")
for word in ['running', 'runs', 'ran', 'swimming']:
print(f" {word}{lemmatizer.lemmatize(word, pos='v')}")

# Output:
# running → run
# runs → run
# ran → run
# swimming → swim

spaCy's lemmatization (more advanced):

1
2
3
4
5
6
7
8
9
10
import spacy

nlp = spacy.load('en_core_web_sm')
text = "The geese were running and swimming better than the mice"
doc = nlp(text)

print(f"{'Token':<12} {'Lemma':<12} {'POS':<8}")
print("-" * 32)
for token in doc:
print(f"{token.text:<12} {token.lemma_:<12} {token.pos_:<8}")

Output:

1
2
3
4
5
6
7
8
9
10
11
12
Token        Lemma        POS     
--------------------------------
The the DET
geese goose NOUN
were be AUX
running run VERB
and and CCONJ
swimming swim VERB
better well ADV
than than SCONJ
the the DET
mice mouse NOUN

Stemming vs. Lemmatization:

Aspect Stemming Lemmatization
Speed Fast Slower
Accuracy Lower Higher
Output May not be real words Real words
Requires POS No Ideally yes
Use case IR, search NLU, QA

Stopword Removal

Stopwords are common words ("the", "is", "at") that appear frequently but carry little semantic meaning for many tasks.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
print(f"Number of stopwords: {len(stop_words)}")
print(f"Sample stopwords: {list(stop_words)[:10]}")

# Example
text = "The quick brown fox jumps over the lazy dog"
words = word_tokenize(text.lower())
filtered = [w for w in words if w not in stop_words]

print(f"Original: {words}")
print(f"Filtered: {filtered}")
# Original: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
# Filtered: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']

When to remove stopwords: - ✅ Bag-of-words models with limited features - ✅ Traditional IR systems - ✅ Topic modeling - ❌ Deep learning models (they learn to ignore them) - ❌ Sentiment analysis ("not good" vs. "good") - ❌ Question answering

Custom stopword lists:

1
2
3
4
5
# Add domain-specific stopwords
custom_stops = stop_words.union({'said', 'would', 'could'})

# Remove certain stopwords for specific tasks
sentiment_stops = stop_words - {'not', 'no', 'nor', 'neither'}

Feature Extraction: Bag of Words

Machine learning models require numerical inputs. Bag-of-Words (BoW) represents text as vectors of word counts, ignoring grammar and word order.

CountVectorizer

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
docs = [
"I love machine learning",
"Machine learning is amazing",
"I love deep learning and machine learning"
]

# Create BoW vectors
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)

# View vocabulary
vocab = vectorizer.get_feature_names_out()
print(f"Vocabulary: {list(vocab)}")

# View matrix
import pandas as pd
df = pd.DataFrame(X.toarray(), columns=vocab)
print("\nBag of Words matrix:")
print(df)

Output:

1
2
3
4
5
6
7
Vocabulary: ['amazing', 'and', 'deep', 'is', 'learning', 'love', 'machine']

Bag of Words matrix:
amazing and deep is learning love machine
0 0 0 0 0 1 1 1
1 1 0 0 1 1 0 1
2 0 1 1 0 2 1 1

N-grams: Capture word sequences

1
2
3
4
5
6
7
8
9
10
11
12
# Bigrams (2-grams)
vectorizer_bigram = CountVectorizer(ngram_range=(1, 2))
X_bigram = vectorizer_bigram.fit_transform(docs)

vocab_bigram = vectorizer_bigram.get_feature_names_out()
print(f"Bigram vocabulary size: {len(vocab_bigram)}")
print(f"Sample bigrams: {list(vocab_bigram)[:10]}")

# Output:
# Bigram vocabulary size: 20
# Sample bigrams: ['amazing', 'and', 'and machine', 'deep', 'deep learning',
# 'is', 'is amazing', 'learning', 'learning and', 'learning is']

Limitations of BoW: - Loses word order: "dog bites man" ≈ "man bites dog" - Ignores semantics: "car" and "automobile" are different - High dimensionality with large vocabularies - Sparse vectors (mostly zeros)

TF-IDF: Weighted Features

Term Frequency-Inverse Document Frequency (TF-IDF) weighs words by importance. Frequent words in a document but rare across documents score high.

Formula:where:

Intuition: - If "machine" appears 10 times in a doc about ML, it's important locally (high TF) - But if "machine" appears in every doc, it's not distinctive (low IDF) - Common words like "the" get low TF-IDF scores

TF-IDF (Term Frequency-Inverse Document Frequency) is a fundamental technique for converting text into numerical features. Unlike simple word counts, TF-IDF weights words by their importance: words that appear frequently in a document but rarely across the corpus receive high scores, while common words get low scores.

Problem Context: In text classification and information retrieval, we need to identify which words are most distinctive for each document. Simple word frequency fails because common words like "the" and "is" appear in every document, masking the truly informative terms.

Solution Approach: TF-IDF combines two metrics: TF (term frequency) measures local importance within a document, while IDF (inverse document frequency) measures global rarity across the corpus. The product of these metrics highlights words that are both locally frequent and globally rare — exactly the words that distinguish documents.

Design Considerations: scikit-learn's TfidfVectorizer implements a smoothed version of TF-IDF to avoid division by zero, includes L2 normalization to enable cosine similarity calculations, and uses sparse matrix storage for memory efficiency. The default parameters work well for most tasks, but can be tuned for specific use cases.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np

# Sample corpus: 4 documents about machine learning topics
# Note: Text should be preprocessed (tokenized, space-separated) before vectorization
docs = [
"Machine learning is a subset of artificial intelligence",
"Deep learning is a subset of machine learning",
"Natural language processing uses machine learning",
"Computer vision uses deep learning techniques"
]

# ========== Step 1: Create TF-IDF Vectorizer ==========
# TfidfVectorizer parameters (defaults shown):
# - token_pattern: r"(?u)\b\w\w+\b" (matches words with 2+ characters)
# - max_features: None (use all words in vocabulary)
# - min_df: 1 (ignore words appearing in fewer than 1 document)
# - max_df: 1.0 (ignore words appearing in more than 100% of documents)
# - ngram_range: (1, 1) (use only single words, not phrases)
# - norm: 'l2' (L2 normalization for cosine similarity)
# - smooth_idf: True (add 1 to numerator and denominator to avoid division by zero)
# - sublinear_tf: False (use raw term frequency, not log-scaled)
tfidf_vec = TfidfVectorizer()
# The vectorizer will learn the vocabulary and IDF values from the training data

# ========== Step 2: Fit and Transform ==========
# fit_transform() does two things:
# 1. fit(): Learn vocabulary from all documents, compute IDF for each term
# 2. transform(): Convert each document to TF-IDF vector
X_tfidf = tfidf_vec.fit_transform(docs)
# X_tfidf: Sparse matrix (CSR format), shape (4, vocab_size)
# - Rows: documents
# - Columns: terms (words) in vocabulary
# - Values: TF-IDF scores

# ========== Step 3: Inspect Vocabulary ==========
# get_feature_names_out() returns the vocabulary (list of all unique terms)
vocab = tfidf_vec.get_feature_names_out()
print(f"Vocabulary size: {len(vocab)}")
print(f"Vocabulary: {list(vocab)}")
# Output: ['a', 'artificial', 'computer', 'deep', 'intelligence', 'is', 'learning',
# 'machine', 'natural', 'of', 'processing', 'subset', 'techniques', 'uses', 'vision']

# ========== Step 4: View TF-IDF Matrix ==========
# Convert sparse matrix to dense array for visualization (only for small datasets)
# For large datasets, keep it sparse to save memory
df_tfidf = pd.DataFrame(X_tfidf.toarray(), columns=vocab)
print("\nTF-IDF matrix:")
print(df_tfidf.round(3))
# Each row is a document, each column is a term
# Values are TF-IDF scores (higher = more important for that document)

# ========== Step 5: Analyze Important Terms ==========
# For each document, find the top 3 most important terms (highest TF-IDF scores)
for idx, doc in enumerate(docs):
print(f"\nDocument {idx + 1}: '{doc[:50]}...'")
# Sort terms by TF-IDF score (descending)
scores = df_tfidf.iloc[idx].sort_values(ascending=False)
# Get top 3 terms
top_terms = scores.head(3).to_dict()
print("Top 3 terms:", top_terms)
# Interpretation: These terms best distinguish this document from others

# Expected output:
# Document 1: 'Machine learning is a subset of artificial intel...'
# Top 3 terms: {'artificial': 0.416, 'intelligence': 0.416, 'subset': 0.336}
#
# Document 2: 'Deep learning is a subset of machine learning...'
# Top 3 terms: {'deep': 0.424, 'subset': 0.343, 'learning': 0.265}
#
# Document 3: 'Natural language processing uses machine learn...'
# Top 3 terms: {'natural': 0.447, 'processing': 0.447, 'language': 0.447}
#
# Document 4: 'Computer vision uses deep learning techniques...'
# Top 3 terms: {'computer': 0.447, 'vision': 0.447, 'techniques': 0.447}

# ========== Step 6: Inspect IDF Values ==========
# idf_ attribute contains IDF values for each term
print("\nIDF values (higher = rarer across corpus):")
idf_df = pd.DataFrame({
'term': vocab,
'idf': tfidf_vec.idf_
}).sort_values('idf', ascending=False)
print(idf_df)
# Terms with high IDF appear in few documents (more distinctive)
# Terms with low IDF appear in many documents (less distinctive)

# ========== Step 7: Transform New Documents ==========
# For new documents, use transform() (not fit_transform())
# This uses the vocabulary and IDF learned from training data
new_doc = ["Machine learning algorithms are powerful"]
new_vector = tfidf_vec.transform(new_doc)
# Terms not in vocabulary (e.g., "algorithms", "are", "powerful") are ignored
# Only "machine" and "learning" will have non-zero values
print(f"\nNew document vector shape: {new_vector.shape}")
print(f"Non-zero values: {new_vector.nnz}") # Number of non-zero elements

Deep Dive: TF-IDF Mathematics and Implementation Details

Understanding TF-IDF requires diving into its mathematical foundations and scikit-learn's specific implementation:

1. TF-IDF Formula Variants

scikit-learn uses a smoothed version of TF-IDF:

Standard formula:

scikit-learn formula (with smooth_idf=True, default):

Differences: - Smoothing:andprevent division by zero - +1 term: Ensures IDF is always positive, even for terms appearing in all documents

2. Term Frequency (TF) Calculation

By default, scikit-learn uses normalized term frequency:With sublinear_tf=True, it uses:This reduces the impact of very frequent terms within a document.

3. L2 Normalization

Each document vector is normalized by its L2 norm:

Why normalize? - Fair comparison: Documents of different lengths can be compared fairly - Cosine similarity: Normalized vectors enable cosine similarity via dot product - Numerical stability: Prevents issues with very large vector magnitudes

4. Sparse Matrix Storage

fit_transform() returns a sparse matrix (CSR format), not a dense array:

Advantages: - Memory efficiency: TF-IDF matrices are typically 95%+ zeros - Computational efficiency: Matrix operations skip zero elements

Example: For 1000 documents and 10,000 vocabulary, dense matrix needs 80MB, sparse matrix may need only 2-5MB.

5. Parameter Tuning Guide

Parameter Default Tuning Advice Impact
max_features None Set to 1000-10000 for large datasets Controls dimensionality, prevents overfitting
min_df 1 Set to 2 or 0.01 (proportion) Filters rare terms, reduces noise
max_df 1.0 Set to 0.8-0.95 Automatically filters stopwords
ngram_range (1,1) (1,2) for phrases Increases expressiveness but dimensionality
sublinear_tf False True to reduce high-frequency term impact Emphasizes rare terms
norm 'l2' 'l1' or None based on task Affects vector distribution

6. Common Issues and Solutions

Issue Cause Solution
Memory overflow Vocabulary too large or too many documents Use max_features or HashingVectorizer
New words ignored Fixed vocabulary, new terms not included Use HashingVectorizer or retrain periodically
All zeros All document terms missing from vocabulary Check preprocessing, ensure tokenization matches
Slow computation Too many documents or large vocabulary Use HashingVectorizer or incremental learning
Dimensionality explosion Vocabulary grows unbounded Use max_features, min_df, max_df

7. Comparison with Other Methods

Method Advantages Disadvantages Use Cases
TF-IDF Simple, interpretable, no training needed No semantics, high dimensionality Text classification, IR
Word2Vec Captures semantics, lower dimensionality Requires pretraining or training time Text similarity, semantic analysis
BERT Context-aware, strongest performance High computational cost, needs GPU Complex NLP tasks
HashingVectorizer Memory efficient, handles new words Not interpretable, possible collisions Large-scale streaming data

8. Practical Optimization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Optimized TF-IDF configuration for production
tfidf_vec = TfidfVectorizer(
max_features=5000, # Limit feature count
min_df=2, # Term must appear in at least 2 documents
max_df=0.8, # Term appears in at most 80% of documents
ngram_range=(1, 2), # Use unigrams and bigrams
sublinear_tf=True, # Use sublinear TF scaling
norm='l2', # L2 normalization
smooth_idf=True # Smooth IDF
)

# For very large corpora, use HashingVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
hasher = HashingVectorizer(n_features=10000, norm='l2', ngram_range=(1, 2))
X = hasher.transform(corpus)

9. Performance Tips

  1. Preprocessing optimization: Complete all text preprocessing before vectorization
  2. Feature selection: Use max_features and min_df/max_df to limit features
  3. Sparse matrix operations: Use scipy.sparse operations, avoid converting to dense
  4. Parallel processing: TfidfVectorizer supports n_jobs parameter
  5. Incremental learning: For streaming data, consider partial_fit() methods

TF-IDF remains a cornerstone of text feature extraction. Understanding its mathematics and implementation details enables effective use and tuning in real-world projects.

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
TF-IDF matrix:
artificial computer deep intelligence ... subset techniques uses vision
0 0.416 0.000 0.000 0.416 ... 0.336 0.000 0.000 0.000
1 0.000 0.000 0.424 0.000 ... 0.343 0.000 0.000 0.000
2 0.000 0.000 0.000 0.000 ... 0.000 0.000 0.447 0.000
3 0.000 0.447 0.361 0.000 ... 0.000 0.447 0.361 0.447

Document 1: 'Machine learning is a subset of artificial intel...'
Top 3 terms: {'artificial': 0.416, 'intelligence': 0.416, 'subset': 0.336}

Document 2: 'Deep learning is a subset of machine learning...'
Top 3 terms: {'deep': 0.424, 'subset': 0.343, 'learning': 0.265}
...

TF-IDF parameters:

1
2
3
4
5
6
7
tfidf_vec = TfidfVectorizer(
max_features=1000, # Keep only top 1000 features
min_df=2, # Ignore terms appearing in < 2 documents
max_df=0.8, # Ignore terms appearing in > 80% of documents
ngram_range=(1, 2), # Use unigrams and bigrams
stop_words='english' # Remove English stopwords
)

Complete Preprocessing Pipeline

Building a reusable preprocessing pipeline is essential for production NLP systems. This class encapsulates all preprocessing steps into a single, configurable interface that can be easily adapted for different tasks.

Problem Context: Real-world NLP projects require consistent preprocessing across training and inference. Without a unified pipeline, preprocessing code gets duplicated, leading to inconsistencies and bugs. A well-designed pipeline allows easy experimentation with different preprocessing strategies.

Solution Approach: Create a class that encapsulates all preprocessing steps (cleaning, tokenization, normalization, stopword removal) with configurable parameters. The class supports both lemmatization (using spaCy) and stemming (using NLTK), allowing flexibility based on task requirements. Methods are designed to handle both single texts and batches efficiently.

Design Considerations: The pipeline uses a modular design where each step (clean, tokenize_and_normalize) can be called independently or together. spaCy is loaded once in __init__ to avoid repeated model loading overhead. The preprocess_corpus method enables batch processing, which is more efficient than processing texts one by one.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
import re
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer

class TextPreprocessor:
"""
A reusable text preprocessing pipeline for English text.

This class provides a unified interface for text cleaning, tokenization,
normalization, and stopword removal. It supports both lemmatization (spaCy)
and stemming (NLTK) approaches.

Attributes:
use_lemmatization (bool): If True, use spaCy lemmatization; else use NLTK stemming
remove_stopwords (bool): If True, remove stopwords from tokens
nlp (spacy.Language): spaCy language model (loaded once for efficiency)
stemmer (PorterStemmer): NLTK stemmer (if lemmatization disabled)
stop_words (set): Set of stopwords for filtering

Methods:
clean(text): Remove HTML, URLs, emails, and special characters
tokenize_and_normalize(text): Tokenize and normalize text (lemmatize or stem)
preprocess(text): Full preprocessing pipeline (clean + tokenize + normalize)
preprocess_corpus(texts): Batch preprocessing for multiple documents

Example:
>>> preprocessor = TextPreprocessor(use_lemmatization=True, remove_stopwords=True)
>>> text = "I'm learning NLP! Visit https://example.com"
>>> preprocessor.preprocess(text)
'learn nlp visit'
"""

def __init__(self, use_lemmatization=True, remove_stopwords=True):
"""
Initialize the text preprocessor.

Parameters:
use_lemmatization (bool): Use spaCy lemmatization if True, else NLTK stemming
remove_stopwords (bool): Remove stopwords if True

Design Notes:
- spaCy model is loaded once here (expensive operation)
- Parser and NER are disabled for speed (only need tokenization and lemmatization)
- Stopwords are loaded as a set for O(1) lookup performance
"""
self.use_lemmatization = use_lemmatization
self.remove_stopwords = remove_stopwords

# Load spaCy model if lemmatization is enabled
# disable=['parser', 'ner'] speeds up processing (we only need tokenization and POS)
if use_lemmatization:
# Load model once (expensive, so do it in __init__)
self.nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
# disable=['parser', 'ner'] disables dependency parsing and NER
# This speeds up processing by 2-3x since we only need tokenization and lemmatization
else:
# Fallback to NLTK stemming (faster but less accurate)
from nltk.stem import PorterStemmer
self.stemmer = PorterStemmer()

# Load stopwords if needed
# Using set for O(1) lookup instead of list O(n)
if remove_stopwords:
from nltk.corpus import stopwords
self.stop_words = set(stopwords.words('english'))
# NLTK stopwords list contains ~179 common English words

def clean(self, text):
"""
Remove noise from text: HTML tags, URLs, emails, special characters.

Parameters:
text (str): Raw text that may contain HTML, URLs, emails, special chars

Returns:
str: Cleaned text containing only lowercase letters and spaces

Processing Steps:
1. Convert to lowercase
2. Remove HTML tags
3. Remove URLs
4. Remove email addresses
5. Remove all non-alphabetic characters
6. Normalize whitespace

Note:
This is aggressive cleaning. For sentiment analysis, consider preserving
punctuation and capitalization.
"""
# Step 1: Lowercase (standardizes text)
text = text.lower()

# Step 2: Remove HTML tags
# Pattern: <[^>]+> matches any HTML tag
text = re.sub(r'<[^>]+>', '', text)

# Step 3: Remove URLs
# Pattern: http\S+|www\.\S+ matches URLs with or without protocol
text = re.sub(r'http\S+|www\.\S+', '', text)

# Step 4: Remove email addresses
# Pattern: \S+@\S+ matches basic email format
text = re.sub(r'\S+@\S+', '', text)

# Step 5: Remove special characters and digits
# Pattern: [^a-zA-Z\s] keeps only letters and whitespace
text = re.sub(r'[^a-zA-Z\s]', '', text)

# Step 6: Normalize whitespace
# Replace multiple spaces/tabs/newlines with single space
text = re.sub(r'\s+', ' ', text).strip()

return text

def tokenize_and_normalize(self, text):
"""
Tokenize text and normalize tokens (lemmatize or stem).

Parameters:
text (str): Cleaned text (should be lowercase, no special chars)

Returns:
list[str]: List of normalized tokens (lemmas or stems)

Processing:
- If use_lemmatization=True: Use spaCy for tokenization and lemmatization
- If use_lemmatization=False: Use NLTK for tokenization and stemming
- Optionally filter stopwords
"""
if self.use_lemmatization:
# Use spaCy for tokenization and lemmatization
# doc contains all tokens with their properties (lemma_, pos_, is_stop, etc.)
doc = self.nlp(text)
# Extract lemmas, filtering out whitespace tokens
# token.is_space checks if token is whitespace (spaCy tokenizes spaces separately)
tokens = [token.lemma_ for token in doc if not token.is_space]
# Example: "running dogs" → ["run", "dog"]
else:
# Use NLTK for tokenization and stemming
from nltk.tokenize import word_tokenize
# word_tokenize handles contractions, punctuation, etc.
tokens = word_tokenize(text)
# Stem each token (may produce non-words like "studi" from "studies")
tokens = [self.stemmer.stem(token) for token in tokens]
# Example: "running dogs" → ["run", "dog"]

# Filter stopwords if enabled
# Stopwords are high-frequency words with little semantic content
if self.remove_stopwords:
tokens = [t for t in tokens if t not in self.stop_words]
# Example: ["the", "cat", "sat"] → ["cat", "sat"] (if "the" is stopword)

return tokens

def preprocess(self, text):
"""
Full preprocessing pipeline: clean + tokenize + normalize.

Parameters:
text (str): Raw text input

Returns:
str: Preprocessed text (space-separated tokens)

Pipeline:
1. clean(): Remove noise
2. tokenize_and_normalize(): Tokenize and normalize
3. Join tokens with spaces (format required by TfidfVectorizer)
"""
# Step 1: Clean text (remove HTML, URLs, etc.)
text = self.clean(text)

# Step 2: Tokenize and normalize
tokens = self.tokenize_and_normalize(text)

# Step 3: Join tokens with spaces
# TfidfVectorizer expects space-separated token strings
return ' '.join(tokens)

def preprocess_corpus(self, texts):
"""
Preprocess multiple documents efficiently.

Parameters:
texts (list[str]): List of raw text documents

Returns:
list[str]: List of preprocessed documents (space-separated tokens)

Note:
For spaCy, consider using nlp.pipe() for batch processing:
docs = list(self.nlp.pipe(texts, batch_size=1000))
This is more efficient than processing texts one by one.
"""
# Process each text through the full pipeline
return [self.preprocess(text) for text in texts]
# Note: For large corpora, consider using spaCy's nlp.pipe() for batch processing

# ========== Example Usage ==========
# Create preprocessor instance
# use_lemmatization=True: Use spaCy (more accurate but slower)
# remove_stopwords=True: Filter out common words
preprocessor = TextPreprocessor(use_lemmatization=True, remove_stopwords=True)

# Sample texts with various noise
texts = [
"Natural Language Processing (NLP) is amazing! Visit https://example.com",
"Machine learning models are trained on large datasets.",
"Deep learning has revolutionized computer vision and NLP."
]

# Preprocess all texts
processed = preprocessor.preprocess_corpus(texts)

# Display results
for orig, proc in zip(texts, processed):
print(f"Original: {orig}")
print(f"Processed: {proc}\n")

# Expected output:
# Original: Natural Language Processing (NLP) is amazing! Visit https://example.com
# Processed: natural language processing nlp amazing visit
#
# Original: Machine learning models are trained on large datasets.
# Processed: machine learn model train large dataset
#
# Original: Deep learning has revolutionized computer vision and NLP.
# Processed: deep learn revolutionize computer vision nlp

Deep Dive: Pipeline Design and Optimization

This preprocessing pipeline demonstrates several important design patterns and optimization techniques:

1. Object-Oriented Design Benefits

Encapsulating preprocessing in a class provides:

  • State management: Model loading happens once in __init__, not per call
  • Configuration: Parameters (lemmatization vs stemming) set once, used everywhere
  • Reusability: Same instance can process multiple texts consistently
  • Testability: Easy to unit test individual methods

2. Performance Optimizations

Model Loading: spaCy model is loaded once in __init__ (expensive operation, ~1-2 seconds). Loading it per call would be 100-1000x slower.

Batch Processing: For large corpora, use spaCy's nlp.pipe():

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def preprocess_corpus_optimized(self, texts):
"""Optimized batch processing with spaCy."""
if self.use_lemmatization:
# Batch process with spaCy (much faster)
cleaned_texts = [self.clean(text) for text in texts]
docs = list(self.nlp.pipe(cleaned_texts, batch_size=1000))
processed = []
for doc in docs:
tokens = [token.lemma_ for token in doc if not token.is_space]
if self.remove_stopwords:
tokens = [t for t in tokens if t not in self.stop_words]
processed.append(' '.join(tokens))
return processed
else:
return [self.preprocess(text) for text in texts]

3. Design Trade-offs

Aspect Current Design Alternative Trade-off
Model Loading Once in __init__ Per call Memory vs Speed
Stopword Storage Set (O(1) lookup) List (O(n) lookup) Memory vs Speed
Method Granularity Separate clean/tokenize Single method Flexibility vs Simplicity
Error Handling None (fails fast) Try-except Robustness vs Clarity

4. Extensibility

The pipeline can be extended for specific needs:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class CustomTextPreprocessor(TextPreprocessor):
"""Extended preprocessor with custom features."""

def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# Add custom stopwords
self.custom_stopwords = {'said', 'according', 'reported'}
if self.remove_stopwords:
self.stop_words.update(self.custom_stopwords)

def clean(self, text):
"""Extended cleaning with URL/email replacement."""
text = super().clean(text)
# Replace URLs with <URL> token (preserves information)
text = re.sub(r'http\S+|www\.\S+', '<URL>', text)
return text

5. Common Issues and Solutions

Issue Cause Solution
Slow processing Processing texts one by one Use nlp.pipe() for batch processing
Memory issues Loading large spaCy model Use smaller model (sm) or disable components
Inconsistent results Model reloaded each time Load model once in __init__
Stopwords not removed Set not updated Ensure stopwords are set, not list
Encoding errors Non-UTF8 text Handle encoding before preprocessing

6. Production Considerations

For production deployment:

  1. Error Handling: Add try-except blocks for robustness
  2. Logging: Log preprocessing steps for debugging
  3. Caching: Cache preprocessed results for repeated texts
  4. Versioning: Track preprocessing pipeline versions
  5. Monitoring: Monitor processing time and memory usage

7. Testing Strategy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def test_preprocessor():
"""Unit tests for TextPreprocessor."""
preprocessor = TextPreprocessor()

# Test cleaning
assert preprocessor.clean("Hello <p>World</p>") == "hello world"

# Test tokenization
tokens = preprocessor.tokenize_and_normalize("running dogs")
assert "run" in tokens or "running" in tokens

# Test full pipeline
result = preprocessor.preprocess("I'm learning NLP!")
assert "learn" in result or "learning" in result

This preprocessing pipeline provides a solid foundation for NLP projects. Understanding its design choices and optimization opportunities helps adapt it for specific use cases.

Practical Example: Text Classification

build a complete spam classifier using our preprocessing pipeline:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# Sample dataset (in practice, use SMS Spam Collection or similar)
texts = [
"Congratulations! You've won a$1000 gift card. Call now!",
"Hey, are we still meeting for dinner tonight?",
"URGENT: Your account will be closed. Click here immediately!",
"Can you send me the project report by EOD?",
"Get rich quick! Amazing investment opportunity!",
"Don't forget to pick up milk on your way home",
"You have been selected for a free cruise. Reply YES",
"Meeting moved to 3pm tomorrow in conference room B",
"Lose 20 pounds in 2 weeks with this miracle pill!",
"Thanks for your help with the presentation yesterday"
]

labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0] # 1=spam, 0=ham

# Preprocess
preprocessor = TextPreprocessor(use_lemmatization=True, remove_stopwords=False)
processed_texts = preprocessor.preprocess_corpus(texts)

# Vectorize with TF-IDF
vectorizer = TfidfVectorizer(max_features=50, ngram_range=(1, 2))
X = vectorizer.fit_transform(processed_texts)
y = np.array(labels)

# Split data (normally you'd have more data)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)

# Train models
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train, y_train)

# Evaluate
print("Naive Bayes Performance:")
y_pred_nb = nb_model.predict(X_test)
print(classification_report(y_test, y_pred_nb, target_names=['Ham', 'Spam']))

print("\nLogistic Regression Performance:")
y_pred_lr = lr_model.predict(X_test)
print(classification_report(y_test, y_pred_lr, target_names=['Ham', 'Spam']))

# Test on new examples
new_messages = [
"Can you review my code changes?",
"FREE MONEY!!! Click now to claim your prize!!!",
"grab coffee this weekend"
]

new_processed = preprocessor.preprocess_corpus(new_messages)
new_vectors = vectorizer.transform(new_processed)
predictions = lr_model.predict(new_vectors)

print("\nPredictions on new messages:")
for msg, pred in zip(new_messages, predictions):
label = "SPAM" if pred == 1 else "HAM"
print(f"[{label}] {msg}")

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Naive Bayes Performance:
precision recall f1-score support

Ham 1.00 1.00 1.00 1
Spam 1.00 1.00 1.00 2

accuracy 1.00 3
macro avg 1.00 1.00 1.00 3
weighted avg 1.00 1.00 1.00 3

Logistic Regression Performance:
precision recall f1-score support

Ham 1.00 1.00 1.00 1
Spam 1.00 1.00 1.00 2

accuracy 1.00 3
macro avg 1.00 1.00 1.00 3
weighted avg 1.00 1.00 1.00 3

Predictions on new messages:
[HAM] Can you review my code changes?
[SPAM] FREE MONEY!!! Click now to claim your prize!!!
[HAM] grab coffee this weekend

Visualizing Text Features

Visualization helps understand our features:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

def visualize_tfidf(texts, labels, method='pca'):
"""Visualize high-dimensional TF-IDF vectors in 2D."""
# Preprocess and vectorize
preprocessor = TextPreprocessor()
processed = preprocessor.preprocess_corpus(texts)
vectorizer = TfidfVectorizer(max_features=100)
X = vectorizer.fit_transform(processed).toarray()

# Dimensionality reduction
if method == 'pca':
reducer = PCA(n_components=2)
X_2d = reducer.fit_transform(X)
title = 'PCA of TF-IDF Features'
else:
reducer = TSNE(n_components=2, random_state=42)
X_2d = reducer.fit_transform(X)
title = 't-SNE of TF-IDF Features'

# Plot
plt.figure(figsize=(10, 6))
colors = ['red' if l == 1 else 'blue' for l in labels]
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=colors, alpha=0.6, s=100)

# Add labels
for i, txt in enumerate(texts):
plt.annotate(txt[:20] + '...', (X_2d[i, 0], X_2d[i, 1]),
fontsize=8, alpha=0.7)

plt.title(title)
plt.xlabel('Component 1')
plt.ylabel('Component 2')
plt.legend(['Ham', 'Spam'])
plt.tight_layout()
plt.savefig('tfidf_visualization.png', dpi=150)
plt.close()

# Example (requires matplotlib)
if __name__ == '__main__':
visualize_tfidf(texts, labels, method='pca')
print("Visualization saved as 'tfidf_visualization.png'")

Advanced Preprocessing Considerations

Handling Different Languages

spaCy supports 60+ languages:

1
2
3
4
5
6
7
# Load German model
nlp_de = spacy.load('de_core_news_sm')
text_de = "Ich liebe maschinelles Lernen und k ü nstliche Intelligenz"
doc = nlp_de(text_de)

for token in doc:
print(f"{token.text}{token.lemma_} ({token.pos_})")

Handling Emojis and Special Characters

For sentiment analysis, emojis matter:

1
2
3
4
5
6
7
import emoji

text = "I love this product! 😍👍"
# Convert emojis to text
text_with_emoji_text = emoji.demojize(text)
print(text_with_emoji_text)
# Output: "I love this product! :smiling_face_with_heart-eyes::thumbs_up:"

Dealing with Contractions

1
2
3
4
5
6
import contractions

text = "I can't believe it's already 5 o'clock!"
expanded = contractions.fix(text)
print(expanded)
# Output: "I cannot believe it is already 5 o'clock!"

Handling Rare Words and Typos

Use spell checking libraries:

1
2
3
4
5
6
7
8
from spellchecker import SpellChecker

spell = SpellChecker()
text = "I have a speling problm"
words = text.split()
corrected = [spell.correction(word) for word in words]
print(' '.join(corrected))
# Output: "i have a spelling problem"

When to Use Which Technique

Here's a practical guide:

Task Tokenization Normalization Stopword Removal Feature Method
Search/IR Word Stemming Yes TF-IDF
Sentiment Analysis Word/Subword Lemmatization No TF-IDF or embeddings
Topic Modeling Word Lemmatization Yes BoW or TF-IDF
Machine Translation Subword (BPE) Minimal No Embeddings
Text Classification Word Lemmatization Optional TF-IDF
NER Word None No Embeddings + context
QA Systems Subword Minimal No Contextual embeddings
Modern LLMs Subword (BPE/WordPiece) None No Learned embeddings

General principles: - More data → Less preprocessing: Deep learning models learn representations; aggressive preprocessing can hurt - Less data → More preprocessing: Traditional ML benefits from feature engineering - Domain-specific → Custom rules: Medical, legal text may need specialized handling - Multilingual → Subword tokenization: BPE/SentencePiece work across languages

Questions and Answers

Q1: Why do modern language models like GPT use subword tokenization instead of word-level tokenization?

A: Subword tokenization (BPE, WordPiece) offers several advantages:

  1. Handles rare words: Rare words are split into common subwords. Example: "unhappiness" → "un", "happi", "ness"
  2. Reduces vocabulary size: Instead of millions of words, use 50k subword units
  3. No unknown tokens: Any word can be represented as subword combinations
  4. Multilingual capability: Subwords work across languages (shared roots, morphemes)
  5. Better generalization: Model learns word composition ("re-" prefix meaning)

Trade-off: Sequences become longer (more tokens per sentence), but benefits outweigh costs.

Q2: Should I always remove stopwords?

A: No. It depends on your task:

Remove stopwords when: - Using traditional ML with limited features (BoW, TF-IDF) - Document similarity/clustering - Search engines (to reduce index size) - Topic modeling

Keep stopwords when: - Using deep learning models (LSTM, Transformer) - Sentiment analysis ("not good" ≠ "good") - Question answering ("who", "what", "where" are critical) - Machine translation (grammatical words matter) - Named entity recognition (context matters)

Modern neural models learn to attend to important words and ignore stopwords automatically.

Q3: What's the difference between stemming and lemmatization, and which should I use?

A:

Stemming: - Rule-based suffix removal - Fast but crude - Output may not be real words ("studies" → "studi") - Doesn't require POS tags - Use for: Information retrieval, search engines where speed matters

Lemmatization: - Dictionary-based transformation - Slower but accurate - Output is real words ("studies" → "study") - Benefits from POS tags - Use for: NLU tasks, question answering, when semantics matter

Example:

1
2
3
Word: "better"
Stemming: "better" (unchanged, missed the relationship to "good")
Lemmatization: "well" or "good" (depending on context)

Recommendation: Use lemmatization unless you have: - Huge datasets where speed is critical - Minimal computational resources - IR/search use case where aggressive normalization is acceptable

Q4: How do I choose the right n-gram range?

A: Consider these factors:

Unigrams (n=1): - Pro: Captures individual words, simple - Con: Loses word order, phrases

Bigrams (n=2): - Pro: Captures common phrases ("machine learning", "not good") - Con: Increases vocabulary size, may overfit

Trigrams (n=3): - Pro: Captures longer phrases ("natural language processing") - Con: Very sparse, huge vocabulary, overfitting risk

Practical guidelines: - Small dataset (<1000 docs): Use (1, 1) unigrams only - Medium dataset (1k-10k): Try (1, 2) unigrams + bigrams - Large dataset (>10k): Experiment with (1, 3) - Always monitor vocabulary size and model performance

Example:

1
2
3
4
5
6
7
# Start simple
vec = TfidfVectorizer(ngram_range=(1, 1)) # Unigrams only

# If underfitting, add bigrams
vec = TfidfVectorizer(ngram_range=(1, 2), max_features=5000)

# Control vocabulary explosion with max_features

Q5: How do I handle imbalanced text datasets?

A: Text classification often faces imbalance (e.g., 95% legitimate emails, 5% spam):

Techniques:

  1. Resampling:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    from imblearn.over_sampling import SMOTE
    from imblearn.under_sampling import RandomUnderSampler

    # Undersample majority class
    rus = RandomUnderSampler(random_state=42)
    X_resampled, y_resampled = rus.fit_resample(X, y)

    # SMOTE for text (works on TF-IDF vectors)
    smote = SMOTE(random_state=42)
    X_resampled, y_resampled = smote.fit_resample(X, y)

  2. Class weights:

    1
    2
    3
    4
    from sklearn.linear_model import LogisticRegression

    # Automatically adjust weights inversely proportional to class frequencies
    model = LogisticRegression(class_weight='balanced')

  3. Evaluation metrics:

    1
    2
    3
    4
    5
    6
    7
    8
    # Don't use accuracy! Use:
    from sklearn.metrics import f1_score, precision_recall_curve, roc_auc_score

    # F1 balances precision and recall
    f1 = f1_score(y_test, y_pred, average='weighted')

    # AUC-ROC for imbalanced classes
    auc = roc_auc_score(y_test, y_pred_proba[:, 1])

  4. Collect more minority class data (best solution when possible)

Q6: What's the best way to preprocess text for BERT and other transformers?

A: Transformers have their own tokenizers; don't use traditional preprocessing:

What to do:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "Natural Language Processing is amazing!"

# BERT handles everything internally
tokens = tokenizer.tokenize(text)
print(tokens)
# ['natural', 'language', 'processing', 'is', 'amazing', '!']

# Convert to IDs
ids = tokenizer.encode(text, add_special_tokens=True)
print(ids)
# [101, 3019, 2653, 6364, 2003, 6429, 999, 102]
# 101 = [CLS], 102 = [SEP]

What NOT to do: - ❌ Don't remove stopwords (BERT learns their importance) - ❌ Don't stem/lemmatize (BERT uses subword tokenization) - ❌ Don't remove punctuation (can carry meaning) - ❌ Don't lowercase if using cased models

Minimal preprocessing for transformers:

1
2
3
4
5
6
7
# Only do basic cleaning
def clean_for_transformer(text):
# Remove excessive whitespace
text = ' '.join(text.split())
# Maybe remove HTML, URLs (task-dependent)
text = re.sub(r'<[^>]+>', '', text)
return text

The model's tokenizer handles the rest!

Q7: How do I evaluate preprocessing choices?

A: Use empirical evaluation:

Method: 1. Split data into train/val/test 2. Train model with different preprocessing pipelines 3. Compare validation performance 4. Choose best configuration 5. Report final test performance

Example experiment:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import pandas as pd

# Define preprocessing variations
configs = [
{'name': 'baseline', 'stem': False, 'lemma': False, 'stop': False},
{'name': 'stem_only', 'stem': True, 'lemma': False, 'stop': False},
{'name': 'lemma_only', 'stem': False, 'lemma': True, 'stop': False},
{'name': 'lemma_stop', 'stem': False, 'lemma': True, 'stop': True},
]

results = []
for config in configs:
# Preprocess with this config
preprocessor = TextPreprocessor(
use_lemmatization=config['lemma'],
remove_stopwords=config['stop']
)
processed = preprocessor.preprocess_corpus(X_train)

# Train and evaluate
vec = TfidfVectorizer()
X_vec = vec.fit_transform(processed)
model = LogisticRegression()
model.fit(X_vec, y_train)

# Validate
X_val_processed = preprocessor.preprocess_corpus(X_val)
X_val_vec = vec.transform(X_val_processed)
score = model.score(X_val_vec, y_val)

results.append({'config': config['name'], 'accuracy': score})

# Compare
df_results = pd.DataFrame(results)
print(df_results.sort_values('accuracy', ascending=False))

Metrics to track: - Accuracy (if balanced dataset) - F1-score (if imbalanced) - Training time - Inference time - Model size

Q8: How do I handle domain-specific jargon and abbreviations?

A: Create custom preprocessing rules:

1. Build domain dictionary:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Medical domain example
medical_expansions = {
'MI': 'myocardial infarction',
'HTN': 'hypertension',
'DM': 'diabetes mellitus',
'pt': 'patient'
}

def expand_abbreviations(text, expansions):
words = text.split()
expanded = [expansions.get(w, w) for w in words]
return ' '.join(expanded)

text = "pt has HTN and DM"
print(expand_abbreviations(text, medical_expansions))
# Output: "patient has hypertension and diabetes mellitus"

2. Custom tokenization rules:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex

nlp = spacy.load('en_core_web_sm')

# Don't split on hyphens for medical terms
infixes = [r for r in nlp.Defaults.infixes if r != r"(?<=[{a}])(?:{h})(?=[{a}])"]
infix_regex = compile_infix_regex(infixes)
nlp.tokenizer = Tokenizer(nlp.vocab, infix_finditer=infix_regex.finditer)

# Now "COVID-19" stays as one token
doc = nlp("COVID-19 is a coronavirus disease")
print([token.text for token in doc])
# ['COVID-19', 'is', 'a', 'coronavirus', 'disease']

3. Domain-specific stopwords:

1
2
3
# Remove/add words specific to your domain
legal_stopwords = {'whereas', 'herein', 'hereby', 'aforementioned'}
tech_stopwords = {'algorithm', 'system', 'method'} # If too common

Q9: What's the impact of preprocessing on model interpretability?

A: Preprocessing affects how you can interpret model decisions:

Aggressive preprocessing reduces interpretability:

1
2
3
4
5
6
7
8
# Original text
text = "The movie wasn't good at all"

# After stemming + stopword removal
processed = "movi good" # Loses negation!

# Model sees only "movi good" → predicts positive
# But original sentiment was negative!

For interpretable models: 1. Keep preprocessing minimal 2. Document all transformations 3. Store mapping from processed → original text 4. Use techniques that preserve semantics (lemmatization > stemming)

Example with traceability:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
class InterpretablePreprocessor:
def __init__(self):
self.transformations = []

def preprocess(self, text):
original = text

# Track each transformation
text = text.lower()
self.transformations.append(('lowercase', original, text))

# ... more preprocessing ...

return text, self.transformations

def explain(self):
"""Show all transformations."""
for step, before, after in self.transformations:
print(f"{step}: '{before}' → '{after}'")

For deep learning models: - Use attention visualization to see which tokens matter - Apply LIME/SHAP on processed text - Keep preprocessing minimal to preserve original semantics

Q10: How do I build a preprocessing pipeline for production systems?

A: Production pipelines need robustness, speed, and reproducibility:

Key principles:

  1. Version control everything:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    class ProductionPreprocessor:
    VERSION = "1.2.0"

    def __init__(self):
    self.config = {
    'version': self.VERSION,
    'lowercase': True,
    'remove_urls': True,
    'min_token_length': 2,
    'max_tokens': 512,
    'vocab_size': 10000
    }

    def save_config(self, path):
    import json
    with open(path, 'w') as f:
    json.dump(self.config, f)

  2. Handle edge cases:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    def robust_preprocess(text):
    # Handle None, empty strings
    if not text or not isinstance(text, str):
    return ""

    # Handle very long texts
    if len(text) > 1000000: # 1M chars
    text = text[:1000000]
    logging.warning(f"Text truncated to 1M chars")

    try:
    # Main preprocessing
    return preprocess(text)
    except Exception as e:
    logging.error(f"Preprocessing failed: {e}")
    return text # Return original on error

  3. Optimize for speed:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    # Use spaCy's pipe for batch processing
    def preprocess_batch(texts, batch_size=1000):
    nlp = spacy.load('en_core_web_sm')
    nlp.disable_pipes('parser', 'ner') # Disable unused components

    processed = []
    for doc in nlp.pipe(texts, batch_size=batch_size):
    tokens = [token.lemma_ for token in doc if not token.is_stop]
    processed.append(' '.join(tokens))

    return processed

  4. Use consistent serialization:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    import joblib
    from sklearn.feature_extraction.text import TfidfVectorizer

    # Train
    vectorizer = TfidfVectorizer()
    vectorizer.fit(train_texts)

    # Save with versioning
    joblib.dump({
    'vectorizer': vectorizer,
    'version': '1.0',
    'date': '2025-02-01',
    'vocab_size': len(vectorizer.vocabulary_)
    }, 'vectorizer_v1.0.pkl')

    # Load in production
    pipeline = joblib.load('vectorizer_v1.0.pkl')
    vectorizer = pipeline['vectorizer']

  5. Monitor in production:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    import time

    class MonitoredPreprocessor:
    def preprocess(self, text):
    start = time.time()

    result = self._preprocess(text)

    duration = time.time() - start
    if duration > 1.0: # Alert if slow
    logging.warning(f"Slow preprocessing: {duration:.2f}s")

    # Track metrics
    self.log_metrics({
    'duration': duration,
    'input_length': len(text),
    'output_length': len(result)
    })

    return result

Conclusion

Text preprocessing bridges raw human language and machine-readable features. We've covered the evolution from symbolic to neural NLP, explored tokenization strategies from word-level to subword methods like BPE, and implemented practical pipelines with stemming, lemmatization, stopword removal, and TF-IDF vectorization.

Key takeaways:

  1. Preprocessing is task-dependent: Search engines need aggressive normalization; deep learning models need minimal preprocessing
  2. Modern NLP favors subword tokenization: BPE and WordPiece handle rare words and multilingual text elegantly
  3. Less is often more: Over-preprocessing can hurt modern neural models that learn representations from data
  4. Always evaluate empirically: Test different preprocessing strategies and measure impact on your specific task
  5. Production systems need robustness: Version control, error handling, and monitoring are critical

As NLP evolves toward even larger language models with better zero-shot capabilities, preprocessing may become less critical for many tasks. However, understanding these fundamentals remains essential for building reliable, efficient, and interpretable NLP systems.

In the next article, we'll explore word embeddings (Word2Vec, GloVe, FastText) and how they capture semantic relationships in continuous vector spaces.

Further Reading

  • Post title:NLP (1): Introduction and Text Preprocessing
  • Post author:Chen Kai
  • Create time:2024-02-03 09:00:00
  • Post link:https://www.chenk.top/en/nlp-introduction-and-preprocessing/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments