VOOZH about

URL: https://towardsdatascience.com/link-prediction-recommendation-engines-with-node2vec-c97c429351a8/

⇱ Link Prediction Recommendation Engines with Node2Vec | Towards Data Science


Link Prediction Recommendation Engines with Node2Vec

Using Node Embeddings for Link Prediction in Python

13 min read
👁 Image taken by Preethi Viswanathan from Unsplash
Image taken by Preethi Viswanathan from Unsplash

This article will cover the fundamental intuition behind link prediction, and showcase an embedding based approach on how to solve recommendation system problems using link prediction.

I will also outline a practice problem and guide the reader through on how to implement the proposed solution in Python. It’s a common approach when solving problems in recommendation systems using network analysis.

The focus of this article is going to be to show the reader on how to implement an algorithmic approach to solving link prediction problems using node2vec embeddings and cosine similarity or node2vec embeddings and a modelling based approach. If you’re unfamiliar with node2vec, I’ve written an article on it before, you can check it out below.

Node2Vec Explained

Node2Vec in the context of recommendation systems can be used for neighbourhood based applications. This is because node2vec preserves the initial structure of the network, the embeddings from node2vec is a good way to quantify if there should be an edge connecting a pair of nodes or not.


Table of Contents

  • What is Link Prediction?
  • Applications of Link Prediction
  • Problem Statement
  • Solution Architecture
  • Requirements
  • Data
  • Creating Network
  • Applying Node2Vec
  • Recommendations w/ Distance Measures
  • Modelling Based Recommendations
  • Create Training Data
  • Train Model
  • Evaluate Model
  • Generate Predictions
  • Concluding Remarks
  • Resources

What is Link Prediction?

There are many ways to solve problems in recommendation engines. These solutions range from algorithmic approaches, link prediction algorithms, embedding based solutions, etc. Link prediction is also referred to as graph completion, a common problem in graph theory. In the simplest form, given a network, you want to know if there should be an edge between a pair of nodes. This definition changes slightly depending on the type of network you’re working with. A directed / multi graph can have slightly different interpretations but the fundamental concept of identifying missing edges in a network remains.

👁 Given a graph G, let the solid lines represent edges which are currently present in the graph and the dotted lines represent edges which are edges which should be present in the graph but currently are not. Image provided by author.
Given a graph G, let the solid lines represent edges which are currently present in the graph and the dotted lines represent edges which are edges which should be present in the graph but currently are not. Image provided by author.

Problems in link prediction are also quite common when dealing with temporal networks (networks which change over time). Given a network G at time step t, you would want to predict the edges of the graph G at time step t+1.

Applications of Link Prediction

The main application of using link prediction to solve your problems is in the context of building recommendation systems. Top companies like Amazon, Facebook and Twitter most likely have tried to implement / have implemented a form of link prediction into their recommendation engines. From trying to recommend products to their users, to recommending other users you might be interested in following. This is even more so an interesting approach to solving problems when your network changes and grows over time.

Be aware that a good recommendation is only as good as how you define the problem and the recommendation you want to give. You must keep in mind a variety of metrics which impact the users behaviour, these metrics reflect what users like, dislike, the interests associated with a user and how these things are intertwined with each other.

Problem Statement

Given topic data about a variety of research papers published on arXiv, we want to build a pipeline which will connect two papers based on their similarities.

Solution Architecture

To do this we will create a network with nodes as articles and edges connecting these nodes based on all topics connecting these pairs of articles. Upon creating this network we will use node2vec to generate node embeddings associated with each article. Finally, I will show you two different approaches in building recommendation systems through embeddings, firstly an algorithmic approach which relies on cosine similarity. Then a modelling based approach which will allow us to train a binary classification model which will indicate whether or not a pair of nodes should have an edge or not. This will be a supervised learning model, where the input will be a pair of nodes and the output will be binary, that is, connected if the node should be connected in the network or not.

👁 The architecture representing the solution proposed. Image provided by author.
The architecture representing the solution proposed. Image provided by author.

Requirements

Python=3.8.8
arxiv=1.4.2
networkx=2.5
pandas=1.2.4
numpy=1.20.1
node2vec=0.4.4
sklearn=0.24.1

If you don’t have the node2vec package installed, [here](https://pypi.org/project/arxiv/) is the library documentation to install it through command line. Similarly, you can install the arXiv package in Python with the following instructions here.

Data

We will be hitting the arXiv API to collect data on research papers and publications associated with a variety of keywords. Based on the arXiv terms of use for their API it is completely free and encouraged to use. For more information regarding their terms of use, please reference their documentation which you can find here.

In this article, I will show you how to hit the API through Python to collect the following information necessary for the model we’re building today. If you want to hit this API through other programming languages, or just want more information on how to use the API, I highly encourage you to reference their documentation which you can find here.

We want to hit the arXiv API to gather some information about the latest research papers based on the queries we’ve identified above. This will allow us to then create a network from this research paper data and then we can try to predict links on that network. For the purposes of this article, I will search for a maximum of 100 results per query, but you don’t have to set yourself to the same constraints. The arXiv API allows users to hit up to 300,000 results per query [1]. The function outlined below will generate a CSV fetching the following information : title, date, article_id, url, main_topic, all_topics, authors, year You are able to fetch more information like the links, summary, article but I decided not to since those features won’t really be used for the purposes of this analysis and tutorial.

The following is a sample of the output I got upon hitting the arXiv API. You might receive something different depending on the keywords & time you choose to query.

👁 Sample CSV result from hitting arXiv API (Image provided by the author)
Sample CSV result from hitting arXiv API (Image provided by the author)

If you have trouble fetching the data, I uploaded a sample CSV to my GitHub for you to download, use and follow along with. You can find it here.

Creating Network

Now that we’ve fetched the data using the axXiv API, we can generate a network. The network will have the following structure, nodes will be the article_ids and the edges will be all topics connecting a pair of articles. For example, article_id 1 with the following topics astro-physics and statistics can be connected to article_id 10 with the topic of statistics and article_id 7 with the topics of astro-physics and math . This will be a multi-edge network where each edge will hold a weight of 1.

The following is my stats associated to the network generated:

👁 Research network stats. Image provided by the author.
Research network stats. Image provided by the author.

Applying Node2Vec

This component will cover running node2vec on the graph generated above and creating the associated node embeddings for that network. These embeddings will play a crucial role coming up as they’re the main features necessary for building a link prediction model.

The following is a sample output of the embeddings generated from the research network. The values on the index correspond to the nodes while the columns represent each embedding dimension. I chose to generate a 16 dimensional embedding space, hence the column names from 0 to 15.

👁 Sample output of network embeddings generated through node2vec. Image provided by the author.
Sample output of network embeddings generated through node2vec. Image provided by the author.

Recommendations w/ Distance Measures

Now that we have an embedding vector representing each node in the network, we can then use distance measures like cosine similarity, euclidean distance, manhattan distance, etc. to measure the amount of distance between nodes. The assumption that we’re making by using these distance measures is that nodes in close proximity with each other should also have an edge connecting each other. This is a good assumption to make as node2vec tries to preserve the initial structure of the original input graph. Now we can essentially write out code to measure similarity levels between two vectors using cosine similarity (or a different distance measure) and identify pairs of nodes which don’t currently have an edge between them but do have a large similarity should create an edge between them. This interpretation can be different for multi / weighted / directed graphs. Pick and use a similarity measure appropriate to the network and problem you’re trying to solve. Also be aware that different measures have different interpretations, for this problem you want to pick maximal cosine similarity scores whereas if you were to use something like euclidean distance, you would want to pick the minimal distance between two vectors.

On a side note, I do want to mention that the curse of dimensionality is rampant when solving these types of problems. It’s especially problematic when using euclidean distance in particular to measure the distance between vectors in higher dimensions. The term higher dimensions is broad and open to interpretation, the threshold for a dimension to be "high" is not strictly defined and varies from problem to problem. Without going too deep into the mathematics behind things, euclidean distance is not a good measure to use for sparse or high dimensional vectors. You can reference these two posts (post1, post2) on stack exchange which outlines the mathematical reasoning as to why this is the case.

For more on the curse of dimensionality, refer to this paper by the Computer Science department from the University of Washington.

👁 The recommended links to connect to article id 1. These are links which currently do not exist in the network. Image provided by author.
The recommended links to connect to article id 1. These are links which currently do not exist in the network. Image provided by author.

Modelling Based Recommendations

In this section we’re going to build a binary classifier to predict the probability of a pair of edges to be connected or not. To do this we first need to identify all pairs of nodes which can form an edge, and identify the subset of those pairs which already have an edge between them in the original network. We can then combine the embeddings (through vector addition) associated to all possible edges and pass that into a train test split function to generate training and testing partitions to pass into a classification model.

The model I’ll be using for this tutorial will be a gradient boosting classifier, for your own problems & experiments I advise you to try out a variety of different classifiers and select the overall best performing one. We are building a binary classifier, however in almost all situations regardless of the input data / network there will be a class imbalance on this classifier. Since we’re taking all possible combinations of edges which can be formed in the network, the result is N^N (where N is the number of nodes in the network), which is exponentially large. The only way that this wouldn’t be a binary classifier is if your graph is already almost fully connected.

Create Training Data

Train Model

Evaluate Model

When dealing with class imbalances we can’t look at traditional accuracy measures to evaluate the performance of the model. This is because if the model continuously predicts the class which has the greater amount of data, then regardless of the model never predicting the other class, it will still achieve a high level of accuracy. The way to combat this is to use measures like Matthews Correlation Coefficient.

The MCC is in essence a correlation coefficient value between -1 and +1. A coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction. The statistic is also known as the phi coefficient.

👁 MCC score on the testing data. Image provided by author.
MCC score on the testing data. Image provided by author.
👁 The classification report on the testing data. Image provided by author.
The classification report on the testing data. Image provided by author.
👁 Testing classification report. Image provided by author.
Testing classification report. Image provided by author.

Generate Predictions

Clearly as you can see that this is a poor performing model for it’s given task (reflected by the MCC, ft-score, recall and precision). But that’s alright since this article was simply for an educational purpose. Not every approach you use to solve a given problem will pan out or work, often times its not at fault of the approach but rather the data. In the case of this tutorial, the data was definitely at fault since I was only using a small sample of an actual research network which would be available if I scraped arXiv for more data. But doing that would also increase the computational complexity to a lot of different components of this article (like running node2vec, generating all possible pairs of edges, training the model, etc.).

Concluding Remarks

Hopefully, this article shed light onto how one can build a link prediction pipeline through node embeddings. I’ve went over both an algorithmic approach to link prediction as well as a modelling based approach.

Please do be aware that the topics covered in this article are not trivial, they should not be your first option when it comes to solving problems in recommendation systems. The rule of thumb most data scientists follow when solving problems is that if the simple solutions work, then the more complex solutions should also work. Simpler solutions in recommendation systems involving collaborative filtering / content based are often quite easy to implement and yield relatively informative results and indications whether or not a more complex solution would work for solving this problem as well. Its often not the best to throw a neural network at the problem (like node2vec) for a variety of reasons (like training time, interpretability, model inference time, etc.).

For a guideline on how to implement the simpler solutions in recommendation systems, you can refer to another article I’ve written.

Recommendation Systems Explained

Furthermore, if you want to follow through in the jupyter notebook associated with this project, you can reference my GitHub.

Resources


If you enjoyed this article, here are a few others which I’ve written that you might also enjoy.

Word2Vec Explained

Linear Algebra Explained Through Graph Theory

Mining & Modelling Character Networks – Part I

Mining & Modelling Character Networks – Part II

Bayesian A/B Testing Explained

Markov Chain Explained

Dynamic Time Warping Explained


Written By

Vatsal

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles