paper2repo: GitHub Repository Recommendation for Academic Papers
Chen Kai BOSS

Finding the code behind a paper is often the most frustrating part of reproducing results: links are missing, names drift, and keyword search is noisy. paper2repo frames this as a cross-platform recommendation problem — matching academic papers to relevant GitHub repositories by aligning them in a shared embedding space. It combines text encoders with graph-based signals (e.g., citation/context relations and repository-side structure) via a constrained GCN to learn comparable representations and rank candidate repos. This note summarizes the motivation, how the joint graph is built, what the “ constrained ” alignment is doing, and which components seem to drive improvements in Hit@K / MAP / MRR.

Background and motivation

What makes GitHub repositories “ recommendable ”

GitHub is a social code hosting platform where users publish and share open-source projects. A repository typically exposes:

  • Description: a short summary of what the repo is about.
  • Tags/topics: keywords describing the repo ’ s theme or functionality.
  • Stars: implicit feedback that reflects interest/popularity.

Challenges in cross-platform matching

  • Heterogeneous data: papers and repositories have different formats and signals.
  • Missing explicit links: many papers do not provide a reliable code URL.
  • Representation alignment: you need comparable embeddings for papers and repos to compute similarity.

Model overview

paper2repo is a joint model with two main components:

  1. Text encoder: encodes paper abstracts and repo descriptions/tags into content features.
  2. Constrained GCN: runs GCNs on the paper graph and the repo graph, and aligns the two embedding spaces via a constraint on “ bridged ” paper – repo pairs.

Context graph construction

Paper citation graph

  • Nodes: papers.
  • Edges: citation links (treated as an undirected graph in the paper).
  • Node features: text-encoded abstract vectors.

Repository association graph

Because repositories do not have citations, the repo graph is built from implicit associations such as:

  • Co-starring: connect repos that are starred by the same user.
  • Tag overlap: connect repos that share high-TF-IDF tags (e.g. above 0.3).
  • Node features: text-encoded description and tag vectors.

Text encoding

Encoding repository text

  1. Word embeddings:

    • Tokenize the repo description/tags into a sequence .
    • Map each token to a-dimensional pretrained vector (e.g., GloVe),.
  2. Convolution layer:

    • Apply 1D convolutions with multiple window sizes, using kernels.

    • Convolution:

      -denotes the token vectors from positionto. -is a bias term andis a nonlinearity (e.g., ReLU).

  3. Max pooling:

    • Apply max-over-time pooling to obtain a fixed-size feature vector.
  4. Tag encoding:

    • Since tags are unordered, average the tag word vectors to get a tag representation.
    • Use a fully-connected layer to project it to the same dimensionality as the description features.
  5. Feature fusion:

    • Fuse description and tag features (e.g., sum or concatenate) to obtain the repo representation.

Encoding paper text

  • Encode paper abstracts with the same CNN encoder to obtain paper embeddings.

Constrained GCN

GCN recap

GCN performs neighborhood aggregation on graphs, combining node features with graph structure to learn node embeddings.

The layer-wise propagation is:

-: node representations at layer. -: adjacency with self-loops.
-: degree matrix of^{(l)}$: activation (e.g., ReLU).

Why “ constrained ”?

Because papers and repositories are embedded on different graphs, naive training yields embeddings in different spaces. paper2repo introduces an alignment constraint that pulls “ bridged ” paper – repo pairs closer.

Alignment constraint

For each bridged paper embeddingand its corresponding bridged repo embedding, enforce high cosine similarity: -With normalized embeddings (), cosine similarity reduces to a dot product.

So the constraint can be written as: -is a small tolerance (e.g., 0.001).

Loss function

The model uses a WARP (Weighted Approximate-Rank Pairwise) loss to push relevant repos higher in the ranking.

WARP loss

The WARP loss is defined as: -: a positive pair (paperand a relevant repo).
-: similarity score (dot product).
-: a negative repo.
-: margin hyperparameter, typically in.
-: hinge operator.
-: a weighting that maps estimated rank to loss: -: the (margin) rank estimate, defined as: -is an indicator function.
- It counts how many negatives violate the margin, i.e..

Optimization objective

The base objective is: -: the number of bridged paper – repo pairs.

To incorporate the alignment constraint, the paper uses a Lagrangian-style reformulation: -: Lagrange multiplier controlling the trade-off between ranking loss and constraint.

Becausechanges during training, tuningis difficult. The paper replaceswith the dynamically changing, and normalizes the constraint error, yielding a new objective of the form$(1 + C_e) $ -: mean constraint error, defined as: - With normalized vectors,, so.

This avoids hand-tuningand keeps the loss and constraint on a comparable scale.

Training

Positive/negative sampling

  • Positives:

    • A bridged paper and its matched bridged repo form a positive pair.
    • To expand positives, the paper treats frequent co-starred repos as additional “ related ” positives.
  • Negatives:

    • Randomly sample repos from the full repository set as negatives.
    • Use more negatives than positives to encourage discriminative learning.

Training procedure

  • Inputs:

    • Text features for papers and repos (from the text encoder).
    • The paper citation graph and the repo association graph.
  • Objective:

    • Minimize the new objective$.
  • Optimizer:

    • Use gradient-based optimization (e.g., Adam) to update the text encoder and GCN parameters.
  • Outputs:

    • Paper and repo embeddings aligned in a shared space, where related pairs are closer.

Experiments and results

Datasets

  • Paper dataset:

    • Source: Microsoft Academic API.
    • Size: 32,029 papers (top venues, 2010–2018).
  • Repository dataset:

    • Source: GitHub API.
    • Size: 7,571 repos, including 2,107 bridged repos.

Setup

  • Metrics:

    • HR@K: whether a relevant repo appears in the top-K.
    • MAP@K: average precision with ranking sensitivity.
    • MRR@K: mean reciprocal rank of the first relevant repo.
  • Baselines:

    • NSCR: cross-domain recommendation with deep layers and graph Laplacian terms.
    • KGCN: knowledge-graph based GCN recommender.
    • CDL: collaborative deep learning (representation + CF).
    • NCF: neural collaborative filtering.
    • LINE: large-scale network embedding baseline.
    • MF: matrix factorization baseline.
    • BPR: Bayesian Personalized Ranking matrix factorization baseline.

Results

  • Overall performance:

    • paper2repo outperforms the baselines across metrics, with a particularly strong gain on HR@10.
  • Interpretation:

    • Combining text signals with graph structure and explicitly constraining alignment improves cross-platform matching accuracy.
    • The constrained GCN helps map papers and repos into a shared space where similarity becomes more meaningful.

Discussion and limitations

Cold start

  • New repositories may lack stars/tags, weakening the repo association graph and hurting recommendation quality.

Number of bridged pairs

  • The alignment relies on bridged paper – repo pairs; if there are too few, cross-platform mapping may be under-trained.

Computational cost

  • GCNs can be expensive on large graphs, creating scalability bottlenecks.

Conclusion and future work

paper2repo is an effective cross-platform recommender that links academic papers to GitHub repositories by combining text encoding with constrained GCN-based alignment. The reported experiments show strong performance on cross-platform recommendation benchmarks.

Future directions:

  • Richer graphs: add more node/edge types (authors, institutions, venues) to build a heterogeneous graph and improve representation power.

  • Efficiency: adopt scalable GNN variants and sampling to improve large-graph training.

  • Cold start: incorporate generative or transfer-learning signals for new papers/repos.

  • Generalization: explore cross-platform embedding with fewer or no bridged pairs.

References

[1] Shao, H., Sun, D., Wu, J., Zhang, Z., Zhang, A., Yao, S., Liu, S., Wang, T., Zhang, C., & Abdelzaher, T. (2020). paper2repo: GitHub Repository Recommendation for Academic Papers. Proceedings of The Web Conference 2020, 580–590.

[2] Kipf, T. N., & Welling, M. (2017). Semi-Supervised Classification with Graph Convolutional Networks. International Conference on Learning Representations (ICLR).

[3] Weston, J., Bengio, S., & Usunier, N. (2011). WSABIE: Scaling Up to Large Vocabulary Image Annotation. Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2764–2770.

  • Post title:paper2repo: GitHub Repository Recommendation for Academic Papers
  • Post author:Chen Kai
  • Create time:2024-10-01 00:00:00
  • Post link:https://www.chenk.top/en/paper2repo-github-repository-recommendation/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments