Random Walks with Restart Explained
Understand the Random Walk with Restart algorithm and its associated implementation in Python
The scope of this article is to explain and focus around the conceptual understanding behind the random walk with restart algorithm. A strong mathematical understanding will not be provided here, but I have left links to resources where those interested can investigate further into the mathematics behind this algorithm. I will also provide a documented Python script at the end of the article associated with the implementation of this algorithm.
Table of Content
- Introduction
- What is a Random Walk?
- Random Walk with Restart
- Advantages
- Disadvantages
- Python Implementation
- Resources
Introduction
This algorithm is highly applicable in research as well as the industry. If you’re working with time series data, or able to structure your data as a network then a random walk / random walk with reset can be implemented on it. In finance, it is often argued that the changes in stock prices follow a random walk. This algorithm works well with other stochastic processes like Markov Chains and Monte Carlo Simulations. This simple to understand algorithm was first formally introduced by Karl Pearson, a mathematician who dedicated the majority of his life to building a strong foundation around mathematical statistics.
In our examples below we will be running a random walk on either of the following things. 1) A connected graph G with vertices V and edges E connecting multiple nodes to each other OR 2) A NxN matrix Note : A matrix can be converted into a graph and vice versa
Common problems where this algorithm is applied are :
- Recommendation Systems
- Community Detection
- Clustering
- Image Segmentation
What is a Random Walk?
A random walk is a type of stochastic process. The simplest explanation of a random walk would be through walking. Imagine that each step you take is determined probabilistically. This implies that at each index of time, you have moved in a certain direction based on a probabilistic outcome. This algorithm explores the relationship to each step that you would take and its distance from the initial starting point.
If you started at the origin (large black circle in the image above) with an equal probability of moving in any direction at each time step. The image above outlines 3 possible walks after 4 time steps, each with a different ending point from the origin. If this process was simulated a large number of times with a large enough time step, then you can create a probability distribution associated with the outcome of each random walk ending at a specific location.
In this 2-dimensional example, after many iterations of a random walk, the distribution will look like a 3D bell curve (as shown in figure 2). This can be replicated in N dimensions.
Given this probability distribution, it can be represented as the closeness between a pair of positions.
Random Walk with Restart
Random walk with restart is exactly as a random walk but with one extra component to it. This component can be denoted as the restart probability. Essentially indicating that for every step taken in any direction there is a probability associated with going back to the initial starting position, the origin. In our example above of randomly moving in any direction, imagine now that there is a chance that you would instantly teleport back to the origin after every step based on this restart probability.
This article has an excellent explanation of the mathematics behind the random walk with restart algorithm.
Advantages
- Provides a strong representation of similarity between two nodes in a weighted & unweighted graph
- Has a variety of applications and can be used be used in joint with other stochastic models like Markov Chains
Disadvantages
- Using the same restart probability for every node
- There is no systematic way of choosing the restart probability for each application
Implementation
Here I’ve implemented the random walk with restart algorithm on a given network G. Note, that there are many variations of this algorithm and can be substantially improved depending on the problem you’re trying to solve. The resulting output of this script can be used on other ML algorithms and models like kNN and k means.
A very simple, 1 line implementation of this algorithm can be through a library called networkX. They have a built-in function called PageRank which is essentially a random walk with restart. Below I’ve included a simple example from the networkX documentation for PageRank.
import networkx as nx
G = nx.DiGraph(nx.path_graph(4)) # create the network
pr = nx.pagerank(G, alpha=0.9) # apply PageRank
Resources
- http://www.cs.cmu.edu/~christos/PUBLICATIONS/icdm06-rwr.pdf
- https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0213857#:~:text=However%2C%20RWR%20suffers%20from%20two,each%20application%20without%20theoretical%20justification.
- https://medium.com/@chaitanya_bhatia/random-walk-with-restart-and-its-applications-f53d7c98cb9
- https://www.youtube.com/watch?v=6wUD_gp5WeE&t=85s
- https://en.wikipedia.org/wiki/Random_walk
- https://networkx.org/documentation/networkx-1.10/reference/generated/networkx.algorithms.link_analysis.pagerank_alg.pagerank.html
- https://github.com/jinhongjung/pyrwr
If you liked this article, then you might also like my other articles where I’ve explained things like the Monte Carlo method and Markov Chains.
Link Prediction Recommendation Engines with Node2Vec
Mining & Modelling Character Networks – Part II
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS