• paper2repo: GitHub Repository Recommendation for Academic Papers

    Finding the code behind a paper is often the most frustrating part of reproducing results: links are missing, names drift, and keyword search is noisy. paper2repo frames this as a cross-platform recommendation problem — matching academic papers to relevant GitHub repositories by aligning them in a shared embedding space. It combines text encoders with graph-based signals (e.g., citation/context relations and repository-side structure) via a constrained GCN to learn comparable representations and rank candidate repos. This note summarizes the motivation, how the joint graph is built, what the “ constrained ” alignment is doing, and which components seem to drive improvements in Hit@K / MAP / MRR.

  • Reinforcement Learning (11): Hierarchical Reinforcement Learning and Meta-Learning

    Traditional reinforcement learning treats complex tasks as flat decision sequences — selecting atomic actions at each timestep, gradually optimizing policies through temporal differences. This paradigm works for simple tasks but becomes inefficient and difficult to generalize in tasks requiring long-term planning and multi-level decisions (such as robot manipulation, game completion, autonomous driving). Humans solve complex problems using hierarchical strategies: decomposing large goals into subgoals (e.g., "make breakfast" decomposes to "brew coffee," "fry eggs," "toast bread"), with each subgoal corresponding to a temporally extended action sequence. Hierarchical Reinforcement Learning (Hierarchical RL) introduces this hierarchical thinking into RL — learning multi-level policies through Temporal Abstraction, improving sample efficiency and interpretability. Meanwhile, traditional RL requires learning from scratch on new tasks, whereas humans can quickly adapt to new scenarios (e.g., quickly learning to ride a motorcycle after learning to ride a bicycle). Meta-Reinforcement Learning (Meta-RL) studies how to "learn to learn"— training across multiple related tasks to enable agents to rapidly adapt to new tasks with minimal samples. From Options' semi-Markov decision processes to MAML's second-order gradient optimization, from MAXQ's value decomposition to RL ²'s memory augmentation, hierarchical and meta-learning demonstrate reinforcement learning's enormous potential in structure and generalization. This chapter systematically examines these cutting-edge methods and helps you implement Options and MAML algorithms through complete code.

  • Reinforcement Learning (10): Offline Reinforcement Learning

    Traditional reinforcement learning relies on online interaction between agents and environments — collecting experience through trial and error to gradually optimize policies. However, in many real-world scenarios, online interaction is costly or even infeasible: autonomous vehicles cannot freely explore on real roads, medical AI cannot conduct dangerous experiments on patients, and robot errors in production environments can cause massive losses. More importantly, many domains have already accumulated vast amounts of historical data — medical records, traffic logs, user behavior data — and if we could learn from this offline data, the deployment barrier for RL would dramatically lower. Offline reinforcement learning (Offline RL, also known as Batch RL) studies how to learn policies from fixed datasetswithout further environment interaction. This seemingly simple task is actually full of challenges: data distribution mismatches with the optimal policy's distribution (distributional shift), Q-functions produce unreliable estimates on unseen actions (extrapolation error), leading to catastrophic failure of learned policies. From Conservative Q-Learning's pessimistic estimation to Decision Transformer reframing RL as sequence modeling, Offline RL's methodology demonstrates how to safely learn under data constraints. This chapter systematically examines Offline RL's core challenges and solutions, and helps you implement the CQL algorithm through complete code.

  • Reinforcement Learning (9): Multi-Agent Reinforcement Learning

    When multiple agents interact in the same environment, the fundamental assumption of single-agent reinforcement learning - environment stationarity - breaks down. In autonomous driving, each vehicle is an agent whose decisions affect others; in multiplayer games, opponents' strategy evolution determines your optimal strategy; in robot collaboration, team success depends on each member's coordination. Multi-Agent Reinforcement Learning (MARL) studies how to enable multiple agents to learn cooperative, competitive, or mixed strategies in complex interactions. This field's challenges far exceed single-agent RL: the environment appears non-stationary from each agent's perspective (other agents are constantly learning), credit assignment becomes difficult (how to attribute individual contributions when the team succeeds?), and partial observability intensifies uncertainty (you can't see what teammates are doing). But these challenges also bring new opportunities - through modeling other agents, communication protocols, centralized training with decentralized execution, and other techniques, MARL has achieved breakthroughs in complex tasks like StarCraft, Dota 2, and autonomous driving simulation. DeepMind's AlphaStar reached StarCraft Grandmaster level, OpenAI Five defeated world champions in Dota 2 - these successes demonstrate MARL as a key path toward general intelligence. This chapter will start from game theory's mathematical foundations, progressively delving into independent learning, value decomposition, multi-agent Actor-Critic methods, with complete code to implement QMIX algorithm in cooperative tasks.

  • Reinforcement Learning (8): AlphaGo and Monte Carlo Tree Search

    In March 2016, when AlphaGo defeated world Go champion Lee Sedol 4-1, the world was stunned - Go had long been considered the ultimate fortress for AI, with a search space of(far exceeding the universe'satoms), making brute-force search completely ineffective. AlphaGo's success wasn't the victory of a single algorithm, but the perfect fusion of Monte Carlo Tree Search (MCTS), deep learning, and reinforcement learning. Even more astonishing, AlphaGo Zero, developed 18 months later, completely discarded human game records and learned purely through self-play from scratch, surpassing millennia of human Go wisdom in just 3 days. Subsequently, AlphaZero generalized this paradigm to chess and shogi, while MuZero went further by breaking free from game rules, learning to plan without knowing the environment dynamics. This series of breakthroughs not only transformed the landscape of game-playing AI but also provided a new paradigm for combining reinforcement learning with search - how to efficiently plan in vast state spaces, how to let AI discover knowledge autonomously, and how to perform forward planning without a model. This chapter will start from MCTS's mathematical foundations, progressively delving into the core designs of AlphaGo, AlphaGo Zero, AlphaZero, and MuZero, with complete code implementations to help you truly understand the essence of these algorithms.

  • Reinforcement Learning (7): Imitation Learning and Inverse Reinforcement Learning

    In previous chapters, we learned various reinforcement learning algorithms — from Q-Learning to PPO — all relying on an explicit reward function to guide learning. However, in many real-world scenarios, designing an appropriate reward function is extremely difficult:

    • Autonomous driving: What constitutes "good" driving behavior? Safety first? Comfort priority? Maximum efficiency? How do we balance these goals? How do we quantify "driving like an experienced driver" with a single number?
    • Robot manipulation: How do we write a reward function for teaching a robot to fold clothes, cook, or tidy a room? The final state is easy to define, but how much reward should each intermediate step receive?
    • Game AI: Making an AI learn human player styles, not just maximize scores. Some players prefer aggressive play, others prefer defensive strategies — how do we make AI imitate specific styles?
    • Dialogue systems: What makes a "good" conversation? Interesting? Helpful? Polite? How do we balance these objectives?

    Imitation Learning provides a different path: instead of laboriously designing reward functions, learn directly from expert demonstrations. This is a very natural way of learning — humans learn this way too. Infants learn to walk and talk by imitating parents, apprentices learn crafts by observing masters, students learn math by imitating teacher's problem-solving methods.

    This chapter systematically introduces core imitation learning methods: from the simplest Behavioral Cloning to distribution shift-solving DAgger, from reward-recovering Inverse Reinforcement Learning to end-to-end adversarial GAIL. We'll dive deep into each method's principles, pros and cons, applicable scenarios, and implementation details.

  • Reinforcement Learning (6): PPO and TRPO - Trust Region Policy Optimization

    In Chapter 3, we introduced the basic principles of policy gradient methods: computing gradients through trajectory sampling to directly optimize policy parameters. However, vanilla policy gradient has a fundamental problem —update instability. A single overly large policy update can cause dramatic performance collapse, and since the policy has already changed, recovery from errors becomes extremely difficult. It's like walking a tightrope at the edge of a cliff: one wrong step, and everything is lost.

    Trust Region Methods address this with a core idea: limit the magnitude of each policy update to ensure the new policy doesn't deviate too far from the old one. TRPO (Trust Region Policy Optimization) achieves this through KL divergence constraints, while PPO (Proximal Policy Optimization) uses a simpler clipping mechanism for similar effects. Due to its simplicity and stability, PPO has become the most widely used reinforcement learning algorithm today — from OpenAI Five defeating professional Dota 2 players to ChatGPT's RLHF training pipeline, PPO is everywhere.

    This chapter starts from the instability problem of policy gradients, dives deep into the theoretical foundations of trust region methods, explains the mathematical derivation of TRPO and the geometric intuition of natural gradients, then introduces how PPO approximates TRPO's effect with a simple clipping mechanism. We'll also explore practical techniques and PPO's application in RLHF.

  • LLMGR: Integrating Large Language Models with Graphical Session-Based Recommendation

    Session-based recommendation (SBR) is a "short-history" problem: given a short click sequence in a session (typically 3–20 clicks), predict the next item without relying on a stable long-term user profile. The difficulty is not conceptual but practical: sessions are short, long-tail items are abundant, cold-start is frequent, and relying purely on interaction graphs (IDs + transition edges) often fails to learn stable representations— new items have almost no edges, long-tail items have very sparse edges, and user exploration introduces significant noise.

    However, real-world systems often have a wealth of underutilized textual side information (titles, descriptions, attributes, reviews). If this semantic information could be leveraged, it could theoretically alleviate cold-start and long-tail problems: even if a new item has no interactions, it still has a title and description; even if a long-tail item has few interactions, its semantic information is still available. The challenge is that traditional GNN-SBR methods struggle to effectively inject textual semantics into session graph modeling — graph models excel at learning structure, LLMs excel at understanding semantics, but their representation spaces are naturally incompatible, and simply concatenating them often fails to train stably.

    LLMGR's core approach is to treat a large language model as a "semantic engine" that converts text into representations alignable with graph nodes; then use a hybrid encoding layer to fuse semantics and graph structure into the same representation space; finally, use a two-stage prompt tuning strategy to first align "node – text" (teaching the model "which description corresponds to which item") and then align "session – behavior patterns" (teaching the model "how to predict next-item intent from session graphs"). This note explains why it is designed this way, what bottlenecks each stage of training solves, how the fusion layer combines semantics with transition patterns, and why it can more stably widen the gap in sparse and cold-start settings. I'll also preserve the key experimental details and numbers from the paper (e.g., on Amazon Music/Beauty/Pantry datasets, compared to the strongest baseline, HR@20 improves by ~8.68%, NDCG@20 by 10.71%, MRR@20 by 11.75%) to help you evaluate whether this method is worth trying.

  • Reinforcement Learning (5): Model-Based RL and World Models

    If Model-Free methods from previous chapters are "learn by doing"— directly optimizing policies or value functions through extensive trial-and-error, then Model-Based methods are "think before doing"— learning environment dynamics modelsto plan futures in imagination, dramatically improving sample efficiency. Human and animal intelligence heavily relies on internal world models: chess grandmasters simulate dozens of future moves mentally, infants predict object trajectories through physical intuition. DeepMind's AlphaGo plans through Monte Carlo Tree Search in simulated games, OpenAI's Dota 2 agents use environment simulators to "rehearse" strategies during training. The core advantage of Model-Based RL is sample efficiency — in scenarios where real environment interaction is expensive (like robot control, autonomous driving), generating virtual experiences through learned models can achieve Model-Free performance with 1/10 or even 1/100 of the samples. From classic Dyna architecture to World Models combining deep learning, from MuZero's implicit planning to Dreamer series learning in latent dream spaces, Model-Based RL has demonstrated enormous potential in sample efficiency, generalization, and interpretability. This chapter systematically traces this evolution, deeply analyzing the design motivation, mathematical principles, and implementation details of each algorithm.

  • Reinforcement Learning (4): Exploration Strategies and Curiosity-Driven Learning

    One of the central challenges in reinforcement learning is the exploration-exploitation dilemma. An agent that only exploits known good policies may never discover better solutions; but excessive exploration wastes time on low-reward behaviors. Traditional methods like-greedy and Boltzmann exploration rely on randomness, which becomes extremely inefficient in high-dimensional state spaces with sparse rewards — imagine in a game like Montezuma's Revenge, where the agent needs hundreds of precise steps to receive the first reward; pure random exploration is virtually impossible to succeed. In recent years, inspired by cognitive science theories of "intrinsic motivation," researchers have proposed curiosity-driven learning — rewarding agents with intrinsic rewards for exploring novel states, enabling continuous learning even when external rewards are zero. From count-based methods to ICM's prediction error, from RND's random network distillation to NGU's episodic memory, exploration strategies have evolved into a complete theoretical and engineering framework. This chapter systematically traces this evolution, deeply analyzing the design motivation, mathematical principles, and implementation details of each method, ultimately validating their effectiveness on Atari hard-exploration games.