This article will cover the fundamental intuition behind link prediction, and showcase an embedding based approach on how to solve recommendation system problems using link prediction.
I will also outline a practice problem and guide the reader through on how to implement the proposed solution in Python. It’s a common approach when solving problems in recommendation systems using network analysis.
The focus of this article is going to be to show the reader on how to implement an algorithmic approach to solving link prediction problems using node2vec embeddings and cosine similarity or node2vec embeddings and a modelling based approach. If you’re unfamiliar with node2vec, I’ve written an article on it before, you can check it out below.
Node2Vec in the context of recommendation systems can be used for neighbourhood based applications. This is because node2vec preserves the initial structure of the network, the embeddings from node2vec is a good way to quantify if there should be an edge connecting a pair of nodes or not.
Table of Contents
What is Link Prediction?
Applications of Link Prediction
Problem Statement
Solution Architecture
Requirements
Data
Creating Network
Applying Node2Vec
Recommendations w/ Distance Measures
Modelling Based Recommendations
Create Training Data
Train Model
Evaluate Model
Generate Predictions
Concluding Remarks
Resources
What is Link Prediction?
There are many ways to solve problems in recommendation engines. These solutions range from algorithmic approaches, link prediction algorithms, embedding based solutions, etc. Link prediction is also referred to as graph completion, a common problem in graph theory. In the simplest form, given a network, you want to know if there should be an edge between a pair of nodes. This definition changes slightly depending on the type of network you’re working with. A directed / multi graph can have slightly different interpretations but the fundamental concept of identifying missing edges in a network remains.
Problems in link prediction are also quite common when dealing with temporal networks (networks which change over time). Given a network G at time step t, you would want to predict the edges of the graph G at time step t+1.
Applications of Link Prediction
The main application of using link prediction to solve your problems is in the context of building recommendation systems. Top companies like Amazon, Facebook and Twitter most likely have tried to implement / have implemented a form of link prediction into their recommendation engines. From trying to recommend products to their users, to recommending other users you might be interested in following. This is even more so an interesting approach to solving problems when your network changes and grows over time.
Be aware that a good recommendation is only as good as how you define the problem and the recommendation you want to give. You must keep in mind a variety of metrics which impact the users behaviour, these metrics reflect what users like, dislike, the interests associated with a user and how these things are intertwined with each other.
Problem Statement
Given topic data about a variety of research papers published on arXiv, we want to build a pipeline which will connect two papers based on their similarities.
Solution Architecture
To do this we will create a network with nodes as articles and edges connecting these nodes based on all topics connecting these pairs of articles. Upon creating this network we will use node2vec to generate node embeddings associated with each article. Finally, I will show you two different approaches in building recommendation systems through embeddings, firstly an algorithmic approach which relies on cosine similarity. Then a modelling based approach which will allow us to train a binary classification model which will indicate whether or not a pair of nodes should have an edge or not. This will be a supervised learning model, where the input will be a pair of nodes and the output will be binary, that is, connected if the node should be connected in the network or not.
If you don’t have the node2vec package installed, [here](https://pypi.org/project/arxiv/) is the library documentation to install it through command line. Similarly, you can install the arXiv package in Python with the following instructions here.
Data
We will be hitting the arXiv API to collect data on research papers and publications associated with a variety of keywords. Based on the arXiv terms of use for their API it is completely free and encouraged to use. For more information regarding their terms of use, please reference their documentation which you can find here.
In this article, I will show you how to hit the API through Python to collect the following information necessary for the model we’re building today. If you want to hit this API through other programming languages, or just want more information on how to use the API, I highly encourage you to reference their documentation which you can find here.
We want to hit the arXiv API to gather some information about the latest research papers based on the queries we’ve identified above. This will allow us to then create a network from this research paper data and then we can try to predict links on that network. For the purposes of this article, I will search for a maximum of 100 results per query, but you don’t have to set yourself to the same constraints. The arXiv API allows users to hit up to 300,000 results per query [1]. The function outlined below will generate a CSV fetching the following information :
title, date, article_id, url, main_topic, all_topics, authors, year
You are able to fetch more information like the links, summary, article but I decided not to since those features won’t really be used for the purposes of this analysis and tutorial.
The following is a sample of the output I got upon hitting the arXiv API. You might receive something different depending on the keywords & time you choose to query.
If you have trouble fetching the data, I uploaded a sample CSV to my GitHub for you to download, use and follow along with. You can find it here.
Creating Network
Now that we’ve fetched the data using the axXiv API, we can generate a network. The network will have the following structure, nodes will be the article_ids and the edges will be all topics connecting a pair of articles. For example, article_id 1 with the following topics astro-physics and statistics can be connected to article_id 10 with the topic of statistics and article_id 7 with the topics of astro-physics and math . This will be a multi-edge network where each edge will hold a weight of 1.
The following is my stats associated to the network generated:
This component will cover running node2vec on the graph generated above and creating the associated node embeddings for that network. These embeddings will play a crucial role coming up as they’re the main features necessary for building a link prediction model.
The following is a sample output of the embeddings generated from the research network. The values on the index correspond to the nodes while the columns represent each embedding dimension. I chose to generate a 16 dimensional embedding space, hence the column names from 0 to 15.
Now that we have an embedding vector representing each node in the network, we can then use distance measures like cosine similarity, euclidean distance, manhattan distance, etc. to measure the amount of distance between nodes. The assumption that we’re making by using these distance measures is that nodes in close proximity with each other should also have an edge connecting each other. This is a good assumption to make as node2vec tries to preserve the initial structure of the original input graph. Now we can essentially write out code to measure similarity levels between two vectors using cosine similarity (or a different distance measure) and identify pairs of nodes which don’t currently have an edge between them but do have a large similarity should create an edge between them. This interpretation can be different for multi / weighted / directed graphs. Pick and use a similarity measure appropriate to the network and problem you’re trying to solve. Also be aware that different measures have different interpretations, for this problem you want to pick maximal cosine similarity scores whereas if you were to use something like euclidean distance, you would want to pick the minimal distance between two vectors.
On a side note, I do want to mention that the curse of dimensionality is rampant when solving these types of problems. It’s especially problematic when using euclidean distance in particular to measure the distance between vectors in higher dimensions. The term higher dimensions is broad and open to interpretation, the threshold for a dimension to be "high" is not strictly defined and varies from problem to problem. Without going too deep into the mathematics behind things, euclidean distance is not a good measure to use for sparse or high dimensional vectors. You can reference these two posts (post1, post2) on stack exchange which outlines the mathematical reasoning as to why this is the case.
For more on the curse of dimensionality, refer to this paper by the Computer Science department from the University of Washington.
In this section we’re going to build a binary classifier to predict the probability of a pair of edges to be connected or not. To do this we first need to identify all pairs of nodes which can form an edge, and identify the subset of those pairs which already have an edge between them in the original network. We can then combine the embeddings (through vector addition) associated to all possible edges and pass that into a train test split function to generate training and testing partitions to pass into a classification model.
The model I’ll be using for this tutorial will be a gradient boosting classifier, for your own problems & experiments I advise you to try out a variety of different classifiers and select the overall best performing one. We are building a binary classifier, however in almost all situations regardless of the input data / network there will be a class imbalance on this classifier. Since we’re taking all possible combinations of edges which can be formed in the network, the result is N^N (where N is the number of nodes in the network), which is exponentially large. The only way that this wouldn’t be a binary classifier is if your graph is already almost fully connected.
Create Training Data
Train Model
Evaluate Model
When dealing with class imbalances we can’t look at traditional accuracy measures to evaluate the performance of the model. This is because if the model continuously predicts the class which has the greater amount of data, then regardless of the model never predicting the other class, it will still achieve a high level of accuracy. The way to combat this is to use measures like Matthews Correlation Coefficient.
The MCC is in essence a correlation coefficient value between -1 and +1. A coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction. The statistic is also known as the phi coefficient.
Clearly as you can see that this is a poor performing model for it’s given task (reflected by the MCC, ft-score, recall and precision). But that’s alright since this article was simply for an educational purpose. Not every approach you use to solve a given problem will pan out or work, often times its not at fault of the approach but rather the data. In the case of this tutorial, the data was definitely at fault since I was only using a small sample of an actual research network which would be available if I scraped arXiv for more data. But doing that would also increase the computational complexity to a lot of different components of this article (like running node2vec, generating all possible pairs of edges, training the model, etc.).
Concluding Remarks
Hopefully, this article shed light onto how one can build a link prediction pipeline through node embeddings. I’ve went over both an algorithmic approach to link prediction as well as a modelling based approach.
Please do be aware that the topics covered in this article are not trivial, they should not be your first option when it comes to solving problems in recommendation systems. The rule of thumb most data scientists follow when solving problems is that if the simple solutions work, then the more complex solutions should also work. Simpler solutions in recommendation systems involving collaborative filtering / content based are often quite easy to implement and yield relatively informative results and indications whether or not a more complex solution would work for solving this problem as well. Its often not the best to throw a neural network at the problem (like node2vec) for a variety of reasons (like training time, interpretability, model inference time, etc.).
For a guideline on how to implement the simpler solutions in recommendation systems, you can refer to another article I’ve written.