Chen Kai Blog

NLP (6): GPT and Generative Language Models

If BERT opened the golden age of understanding-based NLP, then the GPT series represents the pinnacle of generative NLP. From GPT-1 in 2018 to GPT-4 in 2023, OpenAI has demonstrated through continuously scaling model size and optimizing training strategies that autoregressive language models can serve as the foundation for artificial general intelligence. GPT's success lies not only in its powerful text generation capabilities but also in demonstrating the magical power of In-Context Learning: models can learn new tasks with just a few examples without updating parameters.

GPT's core is autoregressive language modeling: given previous tokens, predict the next token. This seemingly simple objective, combined with Transformer decoder architecture and large-scale data training, produces astonishing emergent capabilities. Understanding GPT is not just key to understanding modern large language models — it's the starting point for exploring AI general intelligence.

This article provides an in-depth exploration of the GPT series evolution, principles of autoregressive language modeling, various decoding strategies, in-context learning mechanisms, and how to evaluate generation quality. We'll also build a dialogue system through practical code, demonstrating GPT's powerful capabilities in real applications.

2024-03-03

Natural Language Processing

Kernel Methods: From Theory to Practice (RKHS, Common Kernels, and Hyperparameter Tuning)

A kernel lets you use linear methods on non-linear problems by implicitly mapping data into a (possibly very high-dimensional) feature space. This note builds the intuition (the kernel trick), the math foundation (positive definite kernels, RKHS, Mercer's theorem), and the practical side (how to choose kernels and tune hyperparameters). We cover common kernels (RBF, polynomial, Mat é rn, periodic), troubleshooting (overfitting, underfitting, numerical issues), and a decision flowchart for kernel selection in SVM, Gaussian Processes, and Kernel PCA.

2024-02-28

Algorithm

NLP (5): BERT and Pretrained Models

In 2018, Google released BERT (Bidirectional Encoder Representations from Transformers), which fundamentally transformed the field of natural language processing. Prior to BERT, pretrained models primarily used unidirectional language modeling (like GPT), which could only leverage context in one direction. BERT revolutionized NLP by introducing bidirectional encoder architecture and masked language modeling (MLM), achieving state-of-the-art performance on 11 NLP tasks and ushering in the golden age of the "pretrain-finetune" paradigm.

BERT's success lies not only in its architectural innovations but also in demonstrating that large-scale pretrained models can serve as universal foundations for NLP tasks. Since BERT, variants like RoBERTa, ALBERT, and ELECTRA have continuously emerged, each optimizing BERT's design in different dimensions. Understanding BERT is not just key to understanding modern NLP — it's the starting point for diving into the era of large language models.

This article provides an in-depth analysis of BERT's architecture, training strategies, and finetuning methods, demonstrates practical usage through HuggingFace code examples, and compares various BERT variants and their improvements.

2024-02-26

Natural Language Processing

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Prefix-Tuning is a parameter-efficient way to adapt a frozen language model: instead of updating model weights, you learn a small set of continuous vectors (“ prefixes ”) that steer the model ’ s generation. A key practical variant injects learned prefixes into the attention mechanism as per-layer key/value prefixes. This note explains the method, why reparameterization helps optimization stability, how Prefix-Tuning compares to prompt tuning and LoRA, and what implementation details matter in real training.

2024-02-26

Paper

LLM
| PEFT

NLP (4): Attention Mechanism and Transformer

The Transformer architecture revolutionized natural language processing by introducing a mechanism that allows models to focus on relevant parts of the input when processing each element. Unlike recurrent networks that process sequences step-by-step, Transformers use attention to capture dependencies regardless of distance, making them both more powerful and more parallelizable. This article explores the evolution from basic sequence-to-sequence models to the full Transformer architecture, diving deep into attention mechanisms, multi-head attention, positional encoding, and providing complete PyTorch implementations that you can run and modify.

2024-02-20

Natural Language Processing

NLP (3): RNN and Sequence Modeling

Sequence data is everywhere in natural language processing — from sentences and documents to time-series conversations. Unlike feedforward networks that treat inputs as independent fixed-size vectors, Recurrent Neural Networks (RNNs) maintain an internal state that evolves as they process sequences step by step. This recurrent connection allows the network to capture temporal dependencies and context, making RNNs a natural choice for language modeling, machine translation, and text generation. However, vanilla RNNs struggle with long-range dependencies due to vanishing gradients. This challenge led to the development of gated architectures like LSTM and GRU, which selectively control information flow and maintain long-term memory. In this article, we'll explore the core mechanics of RNN architectures, understand why gradient issues arise during backpropagation through time, and dive into practical implementations using PyTorch for text generation and sequence-to-sequence tasks.

2024-02-14

Natural Language Processing

NLP
| RNN

NLP (2): Word Embeddings and Language Models

Word embeddings revolutionized natural language processing by transforming words from sparse one-hot vectors into dense, meaningful representations that capture semantic relationships. Before embeddings, machines saw "king" and "queen" as completely unrelated symbols — just different positions in a vocabulary list. After embeddings, machines learned that these words share gender and royalty concepts, enabling them to solve analogies like "king - man + woman = queen" through simple vector arithmetic.

This article explores the journey from one-hot encodings to modern embedding techniques. We'll examine Word2Vec's innovative training strategies (Skip-gram and CBOW), GloVe's global matrix factorization approach, and FastText's subword extensions. We'll also connect embeddings to language models, showing how predicting context naturally produces semantic representations. By the end, you'll understand not just how to use pre-trained embeddings, but why they work and how to train your own.

2024-02-08

Natural Language Processing

NLP
| Word Embeddings

NLP (1): Introduction and Text Preprocessing

Natural Language Processing (NLP) bridges the gap between human communication and machine understanding. Whether you're building a chatbot, analyzing customer sentiment, or developing the next generation of language models, understanding how to preprocess text is fundamental. This article explores the evolution of NLP from rule-based systems to modern deep learning approaches, then dives deep into the practical techniques that transform raw text into machine-readable features. We'll cover tokenization strategies, normalization techniques, and feature extraction methods with hands-on Python implementations using NLTK, spaCy, and scikit-learn.

2024-02-03

Natural Language Processing

NLP
| Deep Learning

HCGR: Hyperbolic Contrastive Graph Representation Learning for Session-based Recommendation

Session-based recommendation often hides a hierarchical structure: users start with a coarse intent (e.g., “ running shoes ”), then narrow down to brand, style, size, and price. Euclidean embeddings are good at “ flat similarity ”, but they are not a natural geometry for tree-like growth. HCGR ’ s core idea is to model session graphs in hyperbolic space (specifically the Lorentz model) and use contrastive learning to make the representations more robust and discriminative.

2023-09-04

Paper

Graph Neural Networks for Learning Equivariant Representations of Neural Networks

Neural network parameters live in a space with strong permutation symmetries: you can reorder hidden units without changing the function, yet the raw weight tensors look completely different. If a representation ignores this, it ends up learning spurious differences and struggles to generalize across architectures or widths. This paper proposes representing a neural network as a neural graph (nodes as neurons/bias features, edges as weights) and then using a GNN to produce equivariant representations that respect these symmetries. This enables tasks like predicting generalization, classifying networks by behavior, retrieving similar architectures, and meta-learning over model populations.

2023-09-02

Paper