LLMGR: Integrating Large Language Models with Graphical Session-Based Recommendation

Session-based recommendation (SBR) is a "short-history" problem: given a short click sequence in a session (typically 3–20 clicks), predict the next item without relying on a stable long-term user profile. The difficulty is not conceptual but practical: sessions are short, long-tail items are abundant, cold-start is frequent, and relying purely on interaction graphs (IDs + transition edges) often fails to learn stable representations— new items have almost no edges, long-tail items have very sparse edges, and user exploration introduces significant noise.

However, real-world systems often have a wealth of underutilized textual side information (titles, descriptions, attributes, reviews). If this semantic information could be leveraged, it could theoretically alleviate cold-start and long-tail problems: even if a new item has no interactions, it still has a title and description; even if a long-tail item has few interactions, its semantic information is still available. The challenge is that traditional GNN-SBR methods struggle to effectively inject textual semantics into session graph modeling — graph models excel at learning structure, LLMs excel at understanding semantics, but their representation spaces are naturally incompatible, and simply concatenating them often fails to train stably.

LLMGR's core approach is to treat a large language model as a "semantic engine" that converts text into representations alignable with graph nodes; then use a hybrid encoding layer to fuse semantics and graph structure into the same representation space; finally, use a two-stage prompt tuning strategy to first align "node – text" (teaching the model "which description corresponds to which item") and then align "session – behavior patterns" (teaching the model "how to predict next-item intent from session graphs"). This note explains why it is designed this way, what bottlenecks each stage of training solves, how the fusion layer combines semantics with transition patterns, and why it can more stably widen the gap in sparse and cold-start settings. I'll also preserve the key experimental details and numbers from the paper (e.g., on Amazon Music/Beauty/Pantry datasets, compared to the strongest baseline, HR@20 improves by ~8.68%, NDCG@20 by 10.71%, MRR@20 by 11.75%) to help you evaluate whether this method is worth trying.

Paper information

Paper: Integrating Large Language Models with Graphical Session-Based Recommendation (arXiv PDF)

Background: why session recommendation "doesn't learn stably" in sparse/cold-start settings

Session-based recommendation (SBR) systems primarily rely on user interaction sequences to make recommendations. In recent years, graph neural network (GNN)-based methods have become state-of-the-art (SOTA) because they can capture implicit relationships between items and complex transition patterns. However, traditional graph-based recommendation methods mainly rely on user interaction data (clicks, purchases, ratings) while ignoring textual information related to users and items (titles, descriptions, attributes, reviews), which limits their ability to capture implicit context and semantics in user interactions.

In real-world scenarios, the input to session recommendation is typically a very short sequence:

The goal is to predict, or to rank a candidate set. The real bottlenecks are:

Short sequences: 3–20 clicks are common, with significant exploration noise, making it hard to extract stable "intent signals" from such short sequences.
Abundant long-tail items: Many items have almost no edges (or highly unreliable edges); relying purely on transition graphs makes it difficult to learn meaningful representations.
IDs lack semantics: The same "neighbor relationship" could mean items are similar, complementary, or substitutes — transition edges alone cannot distinguish these cases.

In reality, "text" is often the lifeline: even if a new item has no interactions, it still has a title/description/attributes; even if a long-tail item has few interactions, its semantic information is still available. But traditional graph-based SBR methods struggle to leverage text because they are primarily designed for IDs and edges; textual side information is either completely ignored or simply concatenated with a pre-trained BERT representation, which often yields poor results.

Common graph-based SBR: 1-minute recap (to set up LLMGR)

Before diving into LLMGR, let's quickly recap how traditional graph-based session recommendation methods work and where their limitations lie. This will help us understand why LLMGR is designed the way it is.

Graph-based recommendation systems (GRS) use graph neural networks (GNNs) and other graph structure learning algorithms to model relationships between users and items. In recommendation systems, user interaction data is often represented as some form of network or graph. For example, user clicks, purchases, or ratings on items can be viewed as nodes and edges in a graph, where nodes represent users and items, and edges represent interaction behaviors.

The typical graph-based session recommendation pipeline is roughly:

Convert a session's click sequence into a session graph: nodes are items appearing in the session, edges are(directed edges, possibly weighted).
Use a GNN to perform message passing on the session graph to obtain node representations.
Aggregate into a session representation (via pooling/attention), then perform next-item ranking (via dot product or MLP).

Its shortcoming is also intuitive: it primarily learns from IDs + edges. Once edges are sparse or nodes are cold-start, the representations become unstable.

SR-GNN (Session-based Recurrent Graph Neural Network)

SR-GNN is a classic graph neural network (GNN)-based session recommendation method, primarily used to capture user behavior sequences in a session and predict the next item the user might click. Its main idea is to convert item click behaviors in a session into a graph structure and use GNN to learn transition patterns between items.

Core ideas

Session graph construction: First, convert a user's click sequence in a session into a directed graph, where nodes represent clicked items and edges represent the click order. This transforms the behavior sequence into a graph structure. The GNN performs information propagation (message passing) on this graph, aggregating information from neighbor nodes to learn each item's representation. The output is an embedding vector for each item node, representing the item's state in the current session and its relationships with other items.

Although SR-GNN constructs a graph for each session individually, when processing multiple users, the model can learn shared item embeddings across all sessions by learning item transition patterns. These item embeddings not only reflect the local structure of individual sessions but also capture global relationships between items across users and sessions through the model's parameter sharing mechanism. In other words, although graphs are constructed for individual sessions, through model training, all session graphs contribute to the model's global learning — this is a key advantage of graph-based methods and why they work well when there is sufficient interaction data.
Session representation generation: After GNN information propagation, SR-GNN obtains an embedding representation for each item (the item's node vector). To summarize the entire session's state, the model needs to generate a session-level representation. Typically, the session representation can be obtained by aggregating all item node embeddings in the current session. Common aggregation methods include average pooling, max pooling, or using the embedding of the last item node (since the last click is often most relevant to the next item).
Introducing GRU for temporal dependencies: To compensate for GNN's limitations in capturing long-range dependencies (GNN mainly relies on multi-hop neighbor aggregation; distant dependencies require stacking many layers), SR-GNN introduces GRU (Gated Recurrent Unit). GRU can sequentially update hidden states along the temporal order in a recurrent manner, preserving context information in the time series at each step. This allows GRU to explicitly capture the global temporal order across the entire session, not just local transition patterns.

By combining GNN (local dependencies in graph structure) and GRU (global temporal dependencies), SR-GNN can more comprehensively model user behavior sequences and improve prediction accuracy. In this step, the model feeds item embeddings in the session (node representations computed by GNN) into GRU as inputs, one by one. The specific steps are:
- For each item embedding in the session, feed it into GRU in click order.
- GRU's internal gating mechanisms (update gate and reset gate) automatically decide whether to retain or update the hidden state, capturing temporal dependencies in the sequence.
- Finally, GRU's hidden state contains the dynamic information of the entire item sequence, i.e., the user's behavior patterns in the session.
Predicting the next item: Through GNN, we obtain representations of graph structural relationships between items; through GRU, we capture temporal order information in user click behavior. Finally, SR-GNN combines these two components to predict the most likely next item the user will click in the current session.

In practice, the model uses the GNN representation of the last item node, combined with GRU's final hidden state, as a comprehensive session representation. This session representation is then passed through a fully connected layer (or MLP) to map into the item space, computing the probability distribution of the user's most likely next click.

Advantages

SR-GNN can capture local item relationships and transition patterns in user behavior (via GNN).
After using GNN, it can effectively capture complex item interaction relationships in sessions (compared to pure sequential models, it can better model nonlinear patterns like "revisits" and "jumps").
Combined with GRU, it enhances the ability to process sequential information. It can dynamically update hidden states based on sequence order, capturing changes in user interests throughout the session. For example, a user might be interested in one category of items early in the session but shift to another category later; GRU can dynamically adjust attention to these items during this process. Unlike traditional RNNs, GRU uses gating mechanisms to reduce the vanishing gradient problem, allowing it to better capture temporal dependencies in longer sequences.

Limitations

SR-GNN mainly relies on local graph structure and transition patterns; it may underperform in longer or more complex sessions because long-range dependencies require multiple layers of GNN propagation (and stacking too many layers risks over-smoothing, where node representations become too similar and lose distinctiveness).
More critically: it almost entirely relies on IDs and edges, with insufficient utilization of textual side information. In cold-start/long-tail scenarios, this shortcoming is especially evident.

GCSAN (Global Contextual Self-attention Network)

GCSAN is another graph neural network-based session recommendation method that improves upon SR-GNN by introducing a self-attention mechanism to capture global contextual information. Its goal is to simultaneously learn both local and global item relationships in a session.

Core ideas

Local context capture: Similar to SR-GNN, GCSAN first uses GNN to extract local contextual information in the session, updating each item's representation by aggregating neighbor nodes. Capturing local context allows the model to effectively learn short-range dependencies between items (e.g., "two consecutively clicked items are often related").
Introducing self-attention mechanism: To overcome GNN's difficulty in capturing long-range dependencies, GCSAN introduces a self-attention mechanism. Through self-attention, the model can perform weighted computation across all items in the session, identifying items most relevant to the current item, rather than relying solely on neighbor nodes. This allows the model to "skip over" intermediate nodes and directly attend to items at any position in the session.
Global context learning: The self-attention mechanism allows the model to assign different importance weights to each item in the session, effectively modeling both global interests and short-term preferences. This way, the model can learn not only short-term local relationships (e.g., "just clicked A, likely to click B") but also capture global information in the session (e.g., "clicked X at the session start, clicked many other things in between, but often returns to items related to X at the end").

Advantages

Combines GNN and self-attention mechanism, capturing both local context (via GNN's neighbor aggregation) and global dependency information (via self-attention's global weighting).
For longer sessions, GCSAN performs better at capturing long-range dependencies (self-attention hascomplexity, but for SBR tasks where session length is typically not large, this is acceptable).
Using self-attention makes the model more flexible in processing complex behavioral sequences without needing to stack many GNN layers.

Limitations

Due to the introduction of self-attention, computational complexity is higher (attention computation), especially when processing large-scale datasets or very long sessions, which may impact training speed and efficiency.
The same problem persists: it still mainly relies on ID representations, with very limited utilization of textual semantics.

HCGR (Hyperbolic Contrastive Graph Representation)

HCGR is a recommendation system that performs graph representation learning in non-Euclidean geometric space (hyperbolic space). Traditional recommendation models typically perform data modeling in Euclidean space, but this approach is prone to information distortion in high-dimensional spaces — for example, hierarchical category structures (e.g., "Electronics > Phones > iPhone") are difficult to represent compactly in Euclidean space and require very high dimensions. HCGR attempts to learn in hyperbolic space to more effectively handle complex relationships between users and items.

Core ideas

Hyperbolic space representation: HCGR uses hyperbolic space for user and item representation learning. Compared to Euclidean space, hyperbolic space can better represent hierarchical and nonlinear data structures (because hyperbolic space has "negative curvature," allowing tree-like/hierarchical structures to be represented in lower dimensions), enabling more compact capture of relationships between users and items. Hyperbolic space is particularly suitable for modeling complex hierarchical relationships, such as multi-level interactions between users and items (e.g., "user → category preference → specific item").
Collaborative graph construction: HCGR learns user preferences by constructing user-item collaborative graphs. Each user and item node is embedded in hyperbolic space, and information is propagated and aggregated through graph neural networks to learn relationships between nodes.
Reducing information distortion: In Euclidean space, distances between node representations may not accurately reflect their true relationships (e.g., in high-dimensional space, many points' distances "tend to equalize," losing discriminability). In hyperbolic space, distances between nodes can more compactly represent similarity or dissimilarity between users and items, improving recommendation accuracy.

Advantages

Using hyperbolic space representation can reduce information distortion problems in high-dimensional data, especially suitable for complex, hierarchical data structures (e.g., category trees, knowledge graphs).
Can effectively model user preferences and complex item relationships in non-Euclidean geometric space.
HCGR performs well with sparse and high-dimensional data because hyperbolic space can better handle these scenarios (achieving better representation capacity with lower dimensions).

Limitations

While introducing hyperbolic space can reduce information distortion, it also increases model complexity and difficulty of understanding. In practical applications, efficiently training and optimizing such models remains a challenge (e.g., gradient descent in hyperbolic space requires special optimizers like Riemannian Adam).
Model interpretability is low because user and item representations in hyperbolic space are not intuitively understandable (we have intuition about "distance" in Euclidean space, but "distance" in hyperbolic space is abstract for most people).
The same core problem: it also mainly relies on interaction data, with limited utilization of textual information.

Existing problems with graph-based recommendation algorithms (why LLMGR is needed)

Although graph-based recommendation algorithms excel at handling complex user-item interaction relationships, they also face challenges and limitations, especially in sparse/cold-start scenarios:

Long-range dependency problem

When propagating information, GNNs typically can only capture information from neighboring nodes (local context). For long-range dependencies (e.g., correlations between items at the beginning and end of a long session, or patterns like "user clicked A, then clicked B/C/D in between, and finally clicked E which is related to A"), GNN effectiveness is limited. Especially in longer session recommendation tasks, users may have important preference transitions between distant time points, and traditional GNN models need to stack many layers to capture these global behavior patterns (but stacking too many layers risks over-smoothing, where node representations become very similar and lose distinctiveness).

Sparsity problem (the core pain point LLMGR addresses)

Traditional graph-based recommendation systems mainly rely on user interaction data (clicks, purchases, ratings), and interaction data is often highly sparse. In many real-world scenarios, users interact infrequently with items, especially in cold-start scenarios where new users or new items have even sparser data. Since graph neural networks (GNNs) mainly rely on aggregating neighbor nodes to learn node representations, data sparsity makes it difficult for models to effectively capture user preferences and item characteristics.

Specifically:

New items (cold-start): Have almost no interaction edges; the representation learned by GNN is essentially random initialization embeddings, very unreliable.
Long-tail items: Have few interactions, unstable edge weights; the representation learned by GNN is easily biased by noise.
Short sessions: Only 3–5 clicks; the constructed session graph is very sparse; GNN can aggregate very limited neighbor information.

Limited understanding of context (text information is wasted)

Although graph-based recommendation methods can effectively capture structured user-item interaction relationships ("who clicked what, what follows what"), they have limited understanding of textual information related to users or items (e.g., item descriptions, reviews, attribute tags). These models mainly rely on interaction data while ignoring the rich textual information associated with user behavior or items.

In real systems, this textual information is actually very valuable:

Two items may never have appeared in the same session (no edge in the graph), but their titles/descriptions are very similar (semantically highly related).
A new item may have no interaction data, but its description contains clear category/brand/function information, which are all usable signals.

Traditional graph methods struggle to leverage this text, at most concatenating a pre-trained BERT embedding, but this simple concatenation often performs poorly because: 1. Text representations and graph representations live in different spaces; after concatenation, the model doesn't know how to "align" them. 2. Pre-trained text encoders (like BERT) are trained on general corpora and may not understand recommendation-specific semantics well (e.g., "iPhone" and "charger" are semantically unrelated in general contexts, but highly complementary in recommendation contexts).

This is the core problem LLMGR aims to solve.

Large language models (LLMs) have demonstrated powerful capabilities in natural language understanding and generation, prompting researchers to explore combining LLMs with GNNs. However, directly converting graph-based session recommendation tasks into natural language tasks faces a mismatch between structured data and natural language. The paper's core challenge is how to represent graph-based SBR tasks as natural language tasks and how to combine LLMs with graph data's graph structure so that both sides' representations can "understand each other."

LLMGR framework: treating LLM as a "semantic engine," not a "recommender"

To address the above challenges, the paper proposes a session recommendation framework combining large language models (LLMs) and graph neural networks (GNNs)— LLMGR (Large Language Model with Graphical Recommendation).

A common pitfall is "letting the LLM directly recommend items." This usually doesn't work because:

The candidate set is large (thousands to tens of thousands of items); LLM's token budget is limited.
Ranking requires calibrated scores and negative sampling; LLM's generative output is difficult to directly provide usable ranking scores.
Online inference cost is too high (running LLM for every request causes latency and cost explosions).

LLMGR's strategy is more pragmatic: treat the LLM as a "semantic module" responsible for extracting textual semantics; leave ranking to models that excel at ranking (GNN + recommendation head).

The framework uses multi-task prompting to combine textual information and graph structure data, and employs a hybrid encoding layer to enhance recommendation effectiveness. Specifically, LLMGR's main contributions include:

Multi-task prompt design: using prompts as "supervision interfaces"

LLMGR designs a series of prompt templates to enable the large language model to understand session graph structure and capture latent preferences in user behavior. Here, prompts are not a UI feature shown to users at deployment time, but rather supervision signal interfaces during training— by designing different tasks, the model is forced to learn correct cross-modal alignments.

These prompts are divided into two main tasks:

Main task: behavior pattern modeling (predicting next-item)

Model user behavior patterns by using prompts to guide the LLM to understand user preferences in a session and predict the next item the user might click. This task is mainly implemented through prompts based on nodes and session graphs. For example (schematic):

1 2	Prompt: Given a session graph (and a list of nodes in the session / structured description) Question: Predict the most suitable next item (for the positive sample in the ranking objective)

The paper provides a concrete example (schematic diagram from the original paper):

The goal of this task is not to have the LLM directly output "the next item is X," but to have it produce a representation that, after fusion and passing through the ranking head, can provide better next-item ranking.

Auxiliary task: node – text alignment (semantic grounding)

Align nodes in the graph with their related textual information, using prompts to help the LLM understand the relationship between item nodes and their textual descriptions. For example (schematic):

1
2
3

Prompt: Below are several node (item) IDs: {v1, v2, v3, ...}
Given an item description/title: "Seagull Pro-G Guitar Stand, Black"
Question: Which node does this text most likely correspond to?

The purpose of this task is to teach the model "which text corresponds to which node/item," thereby anchoring textual semantics to ID representations. In cold-start/long-tail scenarios, this alignment is especially important because even if an item lacks sufficient interaction data, its textual description can still provide a stable semantic anchor.

Hybrid encoding layer: bringing ID representations into the same space as LLM

To enable the LLM to effectively process graph structure data, LLMGR designs a hybrid encoding layer. This layer encodes node IDs and graph IDs from the session into vectors of the same dimension as textual information, allowing the LLM to simultaneously process text and graph structure information.

The LLM can naturally process text tokens (via tokenizer + word embedding), but graph models output ID embeddings (dimensions are typically much smaller than LLM's hidden layer dimensions). The key to the hybrid encoding layer is to project ID embeddings through linear mapping to align with text embedding dimensions, then perform fusion.

In the hybrid encoding layer:

Node ID embedding transformation: Since node ID embedding dimensions (e.g., 64 or 128) differ from text embedding dimensions (e.g., LLaMA2-7B's hidden layer dimension is 4096), we need to linearly transform node embeddings to project them into dimensions processable by the LLM:whereis the weight matrix for linear transformation, projecting node embeddings from original dimension(e.g., 64) to the same dimension as text embeddings(e.g., 4096).
Text embeddings: Textual information (e.g., item titles, descriptions) is converted into embedding representations through the LLM's tokenizer and word embedding layer. Assuming the embedding of textis, this embedding is input to the LLM together with the node ID embedding.
Final input vector: The hybrid encoding layer concatenates text embeddings and node embeddings to generate the LLM's input sequence:This way, the LLM's input simultaneously contains textual information (semantics) and node information (structure), allowing them to be processed in a unified space.

The overall framework diagram from the paper:

This diagram shows the LLMGR framework architecture, divided into two parts: the left side is the auxiliary tuning stage (node-text alignment), and the right side is the main tuning stage (behavior pattern modeling). Through these two stages, the model can combine graph structure data and natural language information for more accurate recommendations.

Two-stage prompt tuning strategy: why split into two training steps

To improve model performance, LLMGR adopts a two-stage prompt tuning strategy. This is not arbitrary but designed to avoid a typical optimization trap:

If we jump directly to training the main task (behavior pattern modeling), the model easily falls into a problem where "semantics are not yet aligned and are led astray by behavioral noise." Because session data itself has significant exploration noise (users clicking randomly, misclicks, etc.), if the model doesn't yet know "which text corresponds to which node," it can only rely on IDs and edges, and the result is no different from traditional GNN.

But if we only do alignment (auxiliary task), the model won't learn the structural patterns in sessions that truly determine next-item choices (e.g., "consecutively clicked items are often related," "session endings often return to the opening theme").

So LLMGR splits into two stages:

Stage 1: Auxiliary prompt tuning stage (semantic grounding)

In this stage:

Freeze the graph neural network parameters (GNN encoder parameters are frozen), so the model cannot "cheat" by fitting transition patterns to bypass text alignment.
Focus on adjusting the hybrid encoding layer and LLM parameters (typically using lightweight adapters like LoRA to avoid the cost of full parameter fine-tuning).
Through prompts that align nodes with textual information, learn the association between node IDs and textual information.

The core task of this stage is to teach the model "which text corresponds to which node/item," thereby anchoring textual semantics to ID representations. In cold-start/long-tail scenarios, this alignment is especially important because even if an item lacks sufficient interaction data, its textual description can still provide a stable semantic anchor.

Stage 2: Main prompt tuning stage (behavior pattern alignment)

In this stage:

Unfreeze (or partially unfreeze) the graph neural network parameters, allowing the GNN to learn information in the graph structure and adapt to the next-item prediction task.
Through behavior pattern modeling prompt tasks, capture user preferences in the session and ultimately predict the next click item.
Retain the semantic alignment learned in the first stage (the semantic anchor is not lost) while allowing the model to learn behavior patterns.

In both tuning stages, the loss function remains the same, both optimized through cross-entropy loss:whereis the true click label (one-hot) andis the model's predicted probability.

The paper's training schedule provides a very "engineering-oriented" arrangement: auxiliary stage trains for 1 epoch (quickly establishing semantic alignment), main stage trains for about 3 epochs per dataset (original paper setting). The intuition behind this schedule is: semantic alignment is relatively simple (the correspondence between text and nodes is fairly deterministic), not requiring many epochs; but behavior pattern learning is more complex (session data has noise, requiring multiple training rounds to converge).

Technical details: GNN message passing and LLM encoding layer (mathematical derivations)

Session graph construction

The foundation of session recommendation tasks is converting user click behavior sequences into graph structures. Assuming a given user click sequence, we treat each itemas a node in the graph, and the click order between items forms edges.

Graph representation: The constructed session graph is represented as, whereis the node set andis the directed edge set. For example, for sequence, the constructed session graph is:
- Nodes(deduplicated item set)
- Edges(directed edges representing click order)

Note thatappears twice in the sequence but is only one node in the graph; both its incoming and outgoing edges are preserved (this is the standard approach in session graph modeling).

Information propagation and aggregation (core GNN mechanism)

The core step of graph neural networks is updating each node's representation through a message passing mechanism. We update a node's embedding by aggregating its neighbor nodes. Assuming node's neighbor node set isand its layer-embedding representation is, the information propagation and update steps are:

Information aggregation (Aggregator): Aggregate information from node's neighbor nodes to generate intermediate state: $Extra close brace or missing open brace\mathbf{t}_v^{(l+1)} = f_{\text{aggregator }}\left( \{\mathbf{x}_u^{(l)} | u \in N(v) } \right)$ Common aggregation functions include mean, sum, max, etc.
Node state update (Updater): Use the aggregated neighbor informationto update node's state:Common update functions include GRU-style updates, simple concatenation followed by MLP, etc.

Afterlayers of propagation, node's final embedding representation aggregates information from-hop neighbor nodes (e.g., a 2-layer GNN can aggregate neighbors within 2 hops).

Graph-level representation generation (Graph Readout)

To obtain a representation of the entire graph (session), we need to aggregate all node embeddings into a graph-level representation. This step is called Graph Readout: $Extra close brace or missing open brace\mathbf{Z} = f_{\text{readout }} \left( \{\mathbf{x}_v^{(l+1)} | v \in V } \right)$ Common aggregation operations include:

Mean pooling: Average over all node representations.
Max pooling: Take element-wise maximum over all node representations.
Attention pooling: Weighted sum based on each node's importance (typically, the last clicked item has the highest weight).

LLM encoding layer and output layer

Engineering details of the hybrid encoding layer

To enable the LLM to process graph structure data in sessions, LLMGR designs a hybrid encoding layer. This layer combines node IDs, session IDs, and textual information from the graph, encoding these elements into input vectors processable by the LLM.

The key engineering challenge is: Node ID embedding dimensions (e.g., 64) and text embedding dimensions (e.g., 4096) differ — how can they be processed together in the LLM?

The solution is linear transformation:whereis the weight matrix for linear transformation, projecting node embeddings from original dimension(e.g., 64) to the same dimension as text embeddings(e.g., 4096).

Then, textual information is converted into embedding representationthrough the LLM's tokenizer and word embedding layer.

Finally, the hybrid encoding layer concatenates (or uses other fusion methods like gating, attention) text embeddings and node embeddings to generate the LLM's input:This way, the LLM's input sequence simultaneously contains text tokens and "projected node representations," allowing them to be processed in a unified representation space.

LLM output layer

After processing by the LLM, the output is the LLM layer's result, assumed to be(typically the last layer's hidden state). To generate recommendation results, we use a multi-layer perceptron (MLP) or simple linear layer to compute click probabilities for each candidate item:whereis the output layer's linear transformation (or MLP), generating the click probability distributionfor each item.

This output distribution is what we ultimately use to rank candidate items.

Experimental design and results: what the paper reports (preserving key numbers)

The paper validates LLMGR's effectiveness through extensive experiments on real-world datasets. I'll preserve the most signal-rich details and numbers from the paper to help you evaluate whether this method is worth trying.

Overall experimental approach

The core of the experiments is to validate LLMGR's effectiveness on different datasets and compare it with other state-of-the-art (SOTA) methods. To better validate LLMGR's performance, the experiments are organized around five core research questions (RQs):

RQ1: How does LLMGR perform in session-based recommendation (SBR) scenarios? Can it surpass existing state-of-the-art models?
RQ2: What is LLMGR's effectiveness and portability across different models? (i.e., can LLMGR's components be "grafted" onto other baseline models to bring gains?)
RQ3: How do each of LLMGR's components (such as auxiliary tasks, graph neural networks, hybrid encoding layer, etc.) separately contribute to overall model performance? (ablation study)
RQ4: How does LLMGR handle data sparsity problems, especially in cold-start scenarios, and how does it perform? (This is the most critical question because LLMGR's selling point is alleviating sparsity)
RQ5: Can LLMGR provide reasonable explanations for predicting user preferences, thereby improving recommendation effectiveness? (interpretability)

To answer these questions, the experiments set up different datasets, comparison methods, and evaluation metrics to validate LLMGR's performance from multiple angles.

Dataset selection

The experiments use three real-world public datasets, all from Amazon:

Music (music-related product dataset): This dataset contains user interactions with music-related products (such as instruments, CDs, audio equipment, etc.), including purchases, clicks, etc.
Beauty (beauty product dataset): This dataset records user interactive behaviors in beauty product categories (such as cosmetics, skincare products, beauty tools, etc.).
Pantry (household essentials dataset): This dataset contains user purchase behaviors and browsing records related to household daily necessities (such as food, cleaning supplies, etc.).

Why choose these datasets?

These datasets cover different types of user behavior patterns and item categories, facilitating testing of LLMGR's adaptability and generalization ability (user behavior patterns for music, beauty, and daily necessities differ significantly).
The datasets have high sparsity, especially in cold-start scenarios, suitable for validating how LLMGR handles data sparsity problems. Amazon datasets contain many long-tail items (with only a few interactions), making them ideal scenarios for testing whether textual semantics are useful.
These datasets all contain rich textual side information (item titles, descriptions, categories, brands, etc.), suitable for validating LLMGR's ability to leverage textual information.

Data preprocessing

To ensure data quality, following convention, users and items with fewer than 5 interactions are filtered out, ensuring the model has sufficient data during training (avoiding training instability caused by extreme sparsity).
Data splitting uses "leave-one-out": for each user's interaction sequence, the last item is used for testing, the second-to-last for validation, and the rest for training. This is the standard splitting method for session recommendation tasks.

Comparison methods (baseline models)

To validate LLMGR's effectiveness, the experiments compare it with multiple state-of-the-art baseline methods. These methods perform well in session recommendation tasks and have different principles (Markov chain-based, RNN-based, GNN-based, attention-based), facilitating comprehensive comparison of LLMGR's performance.

Baseline model list:

FPMC (Factorized Personalized Markov Chain): Classic Markov chain-based recommendation method that predicts the next likely click by considering the user's most recent interaction. It combines matrix factorization techniques to learn users' long-term preferences and short-term interests.
CASER (Convolutional Sequence Embedding Recommendation): A convolutional neural network (CNN)-based recommendation method using horizontal and vertical convolution operations to capture high-order interaction relationships in user behavior sequences (e.g., joint patterns like "after clicking A and B, usually clicks C").
GRU4Rec (Gated Recurrent Unit for Recommender Systems): Recurrent neural network (RNN)-based session recommendation method stacking multiple GRU layers to learn user preferences through sequence modeling. This is a classic baseline in session recommendation.
NARM (Neural Attentive Session-based Recommendation): A model combining attention mechanisms and RNN, effectively capturing short-term behavior patterns and long-term interests in sessions (via attention mechanism focusing on the most relevant items in the session).
STAMP (Short-Term Attention Priority Model): Attention mechanism-based model capturing users' short-term interests (e.g., recently clicked items), extracting users' current interests from historical clicks.
SRGNN (Session-based Recurrent Graph Neural Network): Graph neural network (GNN)-based model converting click behaviors in sessions into graph structures, using GNN to learn transition patterns between items (the SR-GNN introduced earlier).
GCSAN (Global Contextual Self-Attention Network): Combines GNN and self-attention mechanism to extract both local contextual information and global semantic information in sessions (introduced earlier).
NISER (Normalized Item and Session Graph Representation): A GNN-based method that alleviates popular item bias (popular bias, where the model tends to always recommend popular items) by normalizing item and session graph representations.
HCGR (Hyperbolic Collaborative Graph Representation): GNN method in non-Euclidean geometric space (hyperbolic space), using hyperbolic space to reduce data distortion in high-dimensional spaces, particularly suitable for recommendation scenarios with power-law distributions (introduced earlier).

These baselines cover methods from traditional approaches (FPMC) to deep learning methods (RNN, CNN, GNN, Attention), providing a fairly comprehensive comparison experimental setup.

Evaluation metrics

In the LLMGR framework, the model's goal is to predict the user's next click behavior in the session. The final prediction result is a click probability distribution over each candidate item. To evaluate the model's recommendation quality, we use several common evaluation metrics (all standard metrics for session recommendation/next-item prediction tasks):

HitRate@K (Hit Rate)

HitRate@K is a commonly used metric to evaluate whether the recommendation system's top K recommendations contain an item the user is actually interested in.

Calculation method: If the model's top K recommended items contain the item the user actually clicked, we call it a "hit."
Formula:where:

-is the session set in the test set; -is the true item in session(ground truth); -is the model's predicted top K recommended items; -is the indicator function, taking value 1 if the condition is true, 0 otherwise.

HitRate@K measures whether at least one item the user is actually interested in is hit within the top K recommendations. Its value ranges from 0 to 1; the higher the value, the higher the recommendation system's hit rate. This is a "binary" metric (either hit or miss), not concerned with ranking position.

NDCG@K (Normalized Discounted Cumulative Gain)

NDCG (Normalized Discounted Cumulative Gain)@K is a ranking-based metric designed to measure the ranking quality of correct items in the recommendation list. It not only considers whether correct items are recommended but also where these items appear in the recommendation list (ranked higher = higher score).

Calculation method: NDCG calculates cumulative gain (CG) and adjusts it based on the item's ranking position (discount). The intuition is: a correct item ranked 1st should score higher than one ranked 10th.
Formula:where:

-is the recommendation list length; -is the position of the recommended item in the list (1-indexed); -is the indicator function, indicating whether the item at positionis the true target item; -is the normalization factor, limiting NDCG values to the range 0 to 1 (ideally, when the true item is ranked 1st, NDCG is maximum).

NDCG places more emphasis on the order of recommended items: if the model's correct item ranks higher, its score is higher; if it ranks lower (e.g., positions 19, 20), the score decreases due to thediscount. Higher NDCG@K indicates better model ranking quality for recommendation results.

MRR@K (Mean Reciprocal Rank)

MRR (Mean Reciprocal Rank)@K is a metric measuring recommendation system accuracy and ranking. It focuses on the ranking of the first correctly recommended item in the list (i.e., "how many items does the user need to scroll through to see the correct item").

Calculation method: MRR is the average reciprocal rank of the first correctly recommended item.
Formula:where:

-is the number of all sessions; -is the position of the first correct item in session's recommendation list (if no correct item is recommended in the top K, this term is 0, or defined as).

MRR measures the average ranking position of the first correct item. For example, if the correct item is ranked 2nd on average, MRR ≈0.5; if ranked 5th on average, MRR ≈0.2. Higher values indicate the model ranks correct items higher, improving user experience.

Parameter configuration

To ensure experimental fairness, all baseline models and the LLMGR model use the same hyperparameter settings (or are tuned within the same search space):

General settings for baseline models:

Mini-batch size: 1024;
Dropout rate: 0.3, to prevent overfitting;
Learning rate: Tuned from(grid search);
Embedding dimension: 64 (this is the ID embedding dimension);
Maximum sequence length: 50 (sessions longer than 50 clicks are truncated).

Model training uses the Adam optimizer. For GNN models (e.g., SRGNN, GCSAN, etc.), the number of GNN aggregation layers is tuned (from 1 to 5 layers) to find the optimal configuration.

LLMGR implementation details (this part is very important for reproducibility):

Base LLM: LLMGR is based on LLaMA2-7B (original paper setting) and developed using the HuggingFace library.
Model acceleration: Uses DeepSpeed technology, trained on 2 Nvidia Tesla A100 GPUs (this is the paper's hardware setup, indicating training cost is relatively controllable without needing dozens or hundreds of GPUs).
ID embedding source: ID embeddings (item embeddings) are directly extracted from a pre-trained GCSAN model and not modified in the experiments (this is an engineering trick to avoid training ID embeddings from scratch and accelerate convergence).
Optimizer: Uses AdamW optimizer to optimize the LLMGR model, with learning rate tuned from(note this search space is smaller than baseline models because LLM fine-tuning typically uses smaller learning rates), batch size is 16 (limited by LLM's GPU memory footprint; batch size is typically smaller than traditional models).
Learning rate scheduling: Uses cosine scheduler to adjust learning rate (warm-up + cosine decay), with weight decay set to 1e-2 (to prevent overfitting).
Training epochs: In the auxiliary task tuning stage, the model trains for 1 epoch (quickly establishing semantic alignment); in the main task tuning stage, the model trains for 3 epochs per dataset (original paper setting).

These details indicate that while LLMGR's training cost is higher than pure GNN models (after all, it uses a 7B LLM), through the two-stage strategy and lightweight tuning (LoRA, small batch size, few epochs), training time remains controllable.

Experimental results analysis

RQ1: LLMGR's performance in session recommendation tasks (comparison with SOTA methods)

Experimental results show that compared to existing baseline models (such as GRU4Rec, STAMP, SRGNN, GCSAN, etc.), LLMGR performs better across all metrics (HitRate@K, NDCG@K, MRR@K), especially at high K values (e.g., K=20), where LLMGR's ranking ability is stronger.

Compared to the most competitive baseline model (typically GCSAN or HCGR), LLMGR shows significant improvements on the following metrics:

HR@20: Improved by about 8.68% (relative improvement, meaning if baseline is 0.50, LLMGR is 0.5434);
NDCG@20: Improved by 10.71%;
MRR@20: Improved by 11.75%.

This indicates that the LLMGR model can not only accurately predict the next item the user might click (improved hit rate) but also better rank the recommendation list (NDCG and MRR improvements are more significant), placing correct items in higher positions.

This improvement margin is quite significant in recommendation systems, especially when there are already many strong baselines (GCSAN, HCGR); still being able to widen the gap by nearly 10% indicates that introducing textual semantics is indeed valuable.

RQ2: LLMGR's effectiveness and portability (can it be "grafted" onto other models?)

To validate LLMGR's portability, the experiments apply it to other baseline models to observe performance improvements. The specific approach is: graft LLMGR's "semantic module" (LLM + hybrid encoding layer + multi-task tuning) onto other baseline models (such as GRU4Rec, STAMP) to see if it brings gains.

The experiments show:

LLMGR improves performance on all tested models (such as GCSAN, GRU4Rec, STAMP). This indicates LLMGR's design is modular and can serve as a "plugin" to enhance other models.
For simpler baseline models (such as GRU4Rec, STAMP, which don't use GNN themselves), LLMGR provides significant performance improvements, indicating that even simpler models, when combined with LLMGR's semantic module, can surpass many SOTA session recommendation methods.

Across different models, LLMGR averaged improvements of about 8.58% (Music dataset) and 17.09% (Beauty dataset). This indicates LLMGR has good portability and can be applied to multiple models to enhance their performance.

This result has high practical value: if you already have a reasonably performing baseline model, you can try grafting LLMGR's semantic module onto it without redesigning the entire system from scratch.

RQ3: Contribution analysis of LLMGR components (ablation study)

To analyze the contribution of each component in LLMGR, ablation studies remove auxiliary tasks (such as node-text alignment tasks) and train only the main task. Results show:

After removing auxiliary tasks, model performance significantly declines across multiple metrics, especially NDCG and MRR (these two metrics focus more on ranking quality), indicating that node-text alignment tasks play a key role in improving model ranking ability.
In the Music dataset, removing auxiliary tasks caused HitRate@20 to drop by 2.04%; in the Beauty dataset, the drop was larger, with NDCG@20 decreasing by 4.16%.

This ablation study validates the necessity of the two-stage strategy: if we skip the first stage (semantic alignment) and directly train the main task, model performance noticeably worsens, especially in ranking quality (NDCG/MRR decline more significantly). This indicates semantic alignment genuinely helps the model learn better representations, not just "training on more data."

RQ4: Cold-start analysis (LLMGR's core selling point)

To validate LLMGR's performance in cold-start scenarios (this is LLMGR's most important selling point), the experiments partition datasets into "warm-start" and "cold-start" scenarios:

Warm-start: User-item interaction data is abundant (e.g., items have 50+ interactions); the system can learn sufficient preference information from this data.
Cold-start: User-item interaction data is very sparse (e.g., items have only 5–10 interactions); traditional recommendation systems struggle to learn stable representations from such data.

The experiments show:

In cold-start scenarios, LLMGR's performance is significantly better than traditional baseline models, effectively handling data sparsity problems.
Compared to warm-start scenarios, LLMGR's performance improvement in cold-start scenarios is even more significant (larger relative improvement margin), mainly due to LLM's language understanding and knowledge transfer capabilities when processing limited data (even if items have few interactions, textual descriptions can still provide stable semantic signals).

This is LLMGR's most convincing result: its gains primarily come from solving the shortcomings of traditional methods (sparsity/cold-start), rather than being "icing on the cake" in warm-start scenarios. This indicates the method's design genuinely targets real pain points.

RQ5: Interpretability analysis

The paper also provides some qualitative analysis, showcasing some cases of LLMGR's predictions. For example, by examining the model's alignment results on auxiliary tasks, we can see that the model indeed learned to "align similar textual descriptions to similar nodes," and on the main task, the model can provide reasonable next-item predictions based on session graph structure and textual semantics.

This analysis is mainly qualitative (case demonstrations), but it helps understand "why the model works"— not black-box improvement of metrics, but genuine learning of alignment between text and graph structure.

Engineering perspective: how to deploy it in production

The paper demonstrates LLMGR's effectiveness in offline evaluation, but if you actually want to use it in a production system, you need to consider some engineering issues:

Don't make prompts an online dependency

In LLMGR, prompts are mainly supervision signal interfaces during training, used to force the model to learn correct cross-modal alignments. At deployment time, you don't need to run extensive prompt inference for every request— that would cause cost and latency explosions.

A more realistic deployment approach is:

Offline pre-computation: For all item text descriptions, run LLM once offline to obtain text representations (text embeddings) and store them.
Online lightweight fusion: When a user makes a request, only use a lightweight graph encoder (GNN/sequence model) to process the session, then fuse with pre-computed text representations, and finally perform ranking.
LoRA or distillation: If online LLM inference is truly necessary, use LoRA adapters (smaller parameter count) or distill into a smaller model (e.g., distill into BERT-base) to reduce inference cost.

Text cleaning is crucial

Marketing jargon, repetitive templates, and meaningless modifiers (e.g., "Best Choice! Top Quality! Limited Offer!") make semantics "look similar," actually harming ranking. Before using LLMGR, it's best to do text cleaning:

Remove marketing jargon and HTML tags.
Extract structured information (brand, category, key attributes).
If descriptions are too long, truncate or summarize (LLM's token budget is limited).

Focus on long-tail/cold-start slices

Improvement in overall metrics doesn't necessarily mean you've solved the sparsity problem — it's possible that gains only came from head items while long-tail items still underperform. It's best to do stratified analysis:

Bucket items by interaction count (e.g., <10 times, 10–50 times, 50+ times).
Calculate metrics for each bucket separately to see where LLMGR's gains mainly come from.
If gains mainly come from head items, it means textual semantics haven't truly solved the sparsity problem; if gains mainly come from long-tail/cold-start buckets, that's genuinely valuable.

Computational cost and efficiency tradeoffs

LLMGR uses a 7B LLM, so training cost is definitely higher than pure GNN models. The paper used 2 A100 GPUs; training time is not explicitly stated, but based on experience, the two stages combined likely take several hours to a day (depending on dataset size).

If your system is cost-sensitive, consider:

Using a smaller LLM (e.g., LLaMA2-1.3B or BERT-base).
Only using LLMGR for cold-start/long-tail items; keep using traditional GNN for head items (hybrid deployment).
Periodically updating text representations offline; only doing lightweight fusion online.

Q&A: questions you might ask in practical applications

Why not let the LLM directly "generate the next item"?

Because session recommendation is fundamentally a large-scale ranking problem: large candidate set (thousands to tens of thousands of items), requires negative sampling and calibrated scores, and is highly sensitive to latency/cost. LLM's generative output is difficult to directly provide usable ranking scores (you can't just have it output "item A: 0.87, item B: 0.43, ..." in that format).

LLMGR's strategy is to let the LLM handle semantics and leave ranking to models that excel at ranking (GNN + MLP head). This leverages LLM's semantic understanding capabilities while maintaining the recommendation system's efficient ranking framework.

Is this equivalent to "BERT embedding + GNN"?

This is a great question and one of the key ablation experiments you should run. Simple "BERT embedding + GNN" (i.e., using pre-trained BERT to encode item text, then concatenating it with GNN's node representations) is indeed a very natural baseline.

LLMGR's claim is: through multi-task prompts + staged alignment, textual semantics can "anchor to nodes more effectively and remain more stable under sparsity," rather than "any text encoder can automatically solve cold-start." Specifically:

Auxiliary tasks (node-text alignment) force the model to learn "which text corresponds to which node," which simple BERT embedding concatenation cannot achieve.
Two-stage training separates semantic alignment and behavior learning optimization, avoiding mutual interference.
Large LLMs like LLaMA2 have stronger semantic understanding (especially zero-shot/few-shot transfer capabilities) than BERT, potentially offering advantages in long-tail/cold-start scenarios.

But this requires experimental validation — if simple "BERT + GNN" can achieve similar results, then LLMGR's cost may not be worth it.

Is the two-stage approach really necessary? Can we do it in one step?

The paper's ablation study (RQ3) has partially answered this question: removing auxiliary tasks leads to noticeable performance decline (especially NDCG/MRR).

From a methodological perspective, the two-stage approach solves a typical optimization trap:

When semantics are not aligned, the main task is driven by behavioral noise (the model can only rely on IDs and edges, resulting in no difference from traditional GNN).
After semantics are aligned, learning behavior patterns becomes more stable, and it's easier to deliver gains in long-tail/cold-start slices.

However, other training strategies may exist (e.g., joint training + weighted loss), which warrant further exploration.

Does deployment require running LLM for every request?

Not necessarily. A reasonable deployment pattern is:

Offline pre-computation: For all item text descriptions, run LLM once offline (or LoRA-finetuned LLM) to obtain text representations, stored in a vector database.
Online lightweight computation: When a user makes a request, use a lightweight graph encoder (GNN/sequence model) to process the session graph (this part is fast), then retrieve pre-computed item text representations, perform fusion and ranking (this part is also fast).

This way, online latency is mainly GNN + fusion + ranking, without needing to run LLM inference every time.

What scenarios is LLMGR suitable for? What scenarios is it not suitable for?

Suitable scenarios:

Rich item textual side information (titles, descriptions, attributes, reviews).
Serious sparsity/cold-start problems (many long-tail items, frequent new item launches).
Want to maintain existing recommendation framework (GNN/SBR), just want to inject semantic signals.

Unsuitable scenarios:

Very little or poor-quality item text information (e.g., only simple SKU numbers).
Very abundant interaction data, cold-start is not a problem (in this case, traditional GNN may suffice; adding LLMGR provides limited gains).
Cost-sensitive and cannot accept LLM's training/inference overhead (even with lightweight strategies).

References

Paper: Integrating Large Language Models with Graphical Session-Based Recommendation (arXiv PDF)