Finding the code behind a paper is often the most frustrating part of reproducing results: links are missing, names drift, and keyword search is noisy. paper2repo frames this as a cross-platform recommendation problem — matching academic papers to relevant GitHub repositories by aligning them in a shared embedding space. It combines text encoders with graph-based signals (e.g., citation/context relations and repository-side structure) via a constrained GCN to learn comparable representations and rank candidate repos. This note summarizes the motivation, how the joint graph is built, what the “ constrained ” alignment is doing, and which components seem to drive improvements in Hit@K / MAP / MRR.
Background and motivation
What makes GitHub repositories “ recommendable ”
GitHub is a social code hosting platform where users publish and share open-source projects. A repository typically exposes:
- Description: a short summary of what the repo is
about.
- Tags/topics: keywords describing the repo ’ s theme
or functionality.
- Stars: implicit feedback that reflects interest/popularity.
Challenges in cross-platform matching
- Heterogeneous data: papers and repositories have
different formats and signals.
- Missing explicit links: many papers do not provide
a reliable code URL.
- Representation alignment: you need comparable embeddings for papers and repos to compute similarity.
Model overview
paper2repo is a joint model with two main components:
- Text encoder: encodes paper abstracts and repo
descriptions/tags into content features.
- Constrained GCN: runs GCNs on the paper graph and the repo graph, and aligns the two embedding spaces via a constraint on “ bridged ” paper – repo pairs.
Context graph construction
Paper citation graph
- Nodes: papers.
- Edges: citation links (treated as an undirected
graph in the paper).
- Node features: text-encoded abstract vectors.
Repository association graph
Because repositories do not have citations, the repo graph is built from implicit associations such as:
- Co-starring: connect repos that are starred by the
same user.
- Tag overlap: connect repos that share high-TF-IDF
tags (e.g. above 0.3).
- Node features: text-encoded description and tag vectors.
Text encoding
Encoding repository text
Word embeddings:
- Tokenize the repo description/tags into a sequence
. - Map each token to a
-dimensional pretrained vector (e.g., GloVe), .
- Tokenize the repo description/tags into a sequence
Convolution layer:
Apply 1D convolutions with multiple window sizes
, using kernels . Convolution:
-
denotes the token vectors from position to . - is a bias term and is a nonlinearity (e.g., ReLU).
Max pooling:
- Apply max-over-time pooling to obtain a fixed-size feature vector.
Tag encoding:
- Since tags are unordered, average the tag word vectors to get a tag
representation.
- Use a fully-connected layer to project it to the same dimensionality as the description features.
- Since tags are unordered, average the tag word vectors to get a tag
representation.
Feature fusion:
- Fuse description and tag features (e.g., sum or concatenate) to
obtain the repo representation
.
- Fuse description and tag features (e.g., sum or concatenate) to
obtain the repo representation
Encoding paper text
- Encode paper abstracts with the same CNN encoder to obtain paper
embeddings
.
Constrained GCN
GCN recap
GCN performs neighborhood aggregation on graphs, combining node features with graph structure to learn node embeddings.
The layer-wise propagation is:
-
-
Why “ constrained ”?
Because papers and repositories are embedded on different graphs, naive training yields embeddings in different spaces. paper2repo introduces an alignment constraint that pulls “ bridged ” paper – repo pairs closer.
Alignment constraint
For each bridged paper embedding
So the constraint can be written as:
Loss function
The model uses a WARP (Weighted Approximate-Rank Pairwise) loss to push relevant repos higher in the ranking.
WARP loss
The WARP loss is defined as:
-
-
-
-
-
- It counts how many negatives violate the margin, i.e.
Optimization objective
The base objective is:
To incorporate the alignment constraint, the paper uses a
Lagrangian-style reformulation:
Because
This avoids hand-tuning
Training
Positive/negative sampling
Positives:
- A bridged paper and its matched bridged repo form a positive
pair.
- To expand positives, the paper treats frequent co-starred repos as additional “ related ” positives.
- A bridged paper and its matched bridged repo form a positive
pair.
Negatives:
- Randomly sample repos from the full repository set as
negatives.
- Use more negatives than positives to encourage discriminative learning.
- Randomly sample repos from the full repository set as
negatives.
Training procedure
Inputs:
- Text features for papers and repos (from the text encoder).
- The paper citation graph and the repo association graph.
- Text features for papers and repos (from the text encoder).
Objective:
- Minimize the new objective
$.
- Minimize the new objective
Optimizer:
- Use gradient-based optimization (e.g., Adam) to update the text encoder and GCN parameters.
Outputs:
- Paper and repo embeddings aligned in a shared space, where related pairs are closer.
Experiments and results
Datasets
Paper dataset:
- Source: Microsoft Academic API.
- Size: 32,029 papers (top venues, 2010–2018).
- Source: Microsoft Academic API.
Repository dataset:
- Source: GitHub API.
- Size: 7,571 repos, including 2,107 bridged repos.
- Source: GitHub API.
Setup
Metrics:
- HR@K: whether a relevant repo appears in the
top-K.
- MAP@K: average precision with ranking
sensitivity.
- MRR@K: mean reciprocal rank of the first relevant repo.
- HR@K: whether a relevant repo appears in the
top-K.
Baselines:
- NSCR: cross-domain recommendation with deep layers
and graph Laplacian terms.
- KGCN: knowledge-graph based GCN recommender.
- CDL: collaborative deep learning (representation +
CF).
- NCF: neural collaborative filtering.
- LINE: large-scale network embedding baseline.
- MF: matrix factorization baseline.
- BPR: Bayesian Personalized Ranking matrix factorization baseline.
- NSCR: cross-domain recommendation with deep layers
and graph Laplacian terms.
Results
Overall performance:
- paper2repo outperforms the baselines across metrics, with a particularly strong gain on HR@10.
Interpretation:
- Combining text signals with graph structure and explicitly
constraining alignment improves cross-platform matching accuracy.
- The constrained GCN helps map papers and repos into a shared space where similarity becomes more meaningful.
- Combining text signals with graph structure and explicitly
constraining alignment improves cross-platform matching accuracy.
Discussion and limitations
Cold start
- New repositories may lack stars/tags, weakening the repo association graph and hurting recommendation quality.
Number of bridged pairs
- The alignment relies on bridged paper – repo pairs; if there are too few, cross-platform mapping may be under-trained.
Computational cost
- GCNs can be expensive on large graphs, creating scalability bottlenecks.
Conclusion and future work
paper2repo is an effective cross-platform recommender that links academic papers to GitHub repositories by combining text encoding with constrained GCN-based alignment. The reported experiments show strong performance on cross-platform recommendation benchmarks.
Future directions:
Richer graphs: add more node/edge types (authors, institutions, venues) to build a heterogeneous graph and improve representation power.
Efficiency: adopt scalable GNN variants and sampling to improve large-graph training.
Cold start: incorporate generative or transfer-learning signals for new papers/repos.
Generalization: explore cross-platform embedding with fewer or no bridged pairs.
References
[1] Shao, H., Sun, D., Wu, J., Zhang, Z., Zhang, A., Yao, S., Liu, S., Wang, T., Zhang, C., & Abdelzaher, T. (2020). paper2repo: GitHub Repository Recommendation for Academic Papers. Proceedings of The Web Conference 2020, 580–590.
[2] Kipf, T. N., & Welling, M. (2017). Semi-Supervised Classification with Graph Convolutional Networks. International Conference on Learning Representations (ICLR).
[3] Weston, J., Bengio, S., & Usunier, N. (2011). WSABIE: Scaling Up to Large Vocabulary Image Annotation. Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2764–2770.
- Post title:paper2repo: GitHub Repository Recommendation for Academic Papers
- Post author:Chen Kai
- Create time:2024-10-01 00:00:00
- Post link:https://www.chenk.top/en/paper2repo-github-repository-recommendation/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.