![]() |
VOOZH | about |
The Cora dataset stands as a fundamental resource in the field of graph machine learning, widely utilized for the development and benchmarking of various algorithms. Comprising a network of scientific publications in machine learning, the dataset provides a rich structure that facilitates research into node classification, link prediction, and clustering. This article presents an overview of the Cora dataset, its structure, applications, and the features and labels that define it.
The Cora dataset is a citation network of 2,708 machine-learning papers, organized into seven distinct classes. These papers are interlinked by 5,429 citations, forming a directed graph that maps out how papers cite each other. Each paper is represented by a binary word vector, derived from a dictionary of 1,433 unique words, indicating the presence or absence of specific words in the paper.
The dataset is primarily divided into two files:
The Cora dataset is extensively used for evaluating graph-based machine learning algorithms. Its applications span several key areas:
Each paper in the Cora dataset is described by a binary word vector, which serves as the feature set for the dataset. The presence (1) or absence (0) of each word from a dictionary of 1,433 unique words is recorded in this vector. This high-dimensional feature space captures the content of each paper, enabling detailed analysis and classification.
The labels in the Cora dataset correspond to the seven classes of machine learning topics:
These labels provide a categorical classification for each paper, which is used as the target variable in various machine learning tasks.
Below are some of the methods to load cora dataset in Python:
PyTorch Geometric is a library specifically designed for deep learning on irregular structures like graphs. It provides a straightforward way to load the CORA dataset.
Install PyTorch Geometric
Here, we will install PyTorch Geometric by using the following command:
pip install torch-geometricLoad the CORA dataset
Output:
Downloading https://raw.githubusercontent.com/kimiyoung/planetoid/master/data/ind.cora.x
Downloading https://raw.githubusercontent.com/kimiyoung/planetoid/master/data/ind.cora.tx
Downloading https://raw.githubusercontent.com/kimiyoung/planetoid/master/data/ind.cora.allx
Downloading https://raw.githubusercontent.com/kimiyoung/planetoid/master/data/ind.cora.y
Downloading https://raw.githubusercontent.com/kimiyoung/planetoid/master/data/ind.cora.ty
Downloading https://raw.githubusercontent.com/kimiyoung/planetoid/master/data/ind.cora.ally
Downloading https://raw.githubusercontent.com/kimiyoung/planetoid/master/data/ind.cora.graph
Downloading https://raw.githubusercontent.com/kimiyoung/planetoid/master/data/ind.cora.test.index
Processing...
Done!
Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])
DGL is another powerful library designed for deep learning on graphs. It simplifies the process of building and training graph neural networks.
First, install DGL
pip install dglNetworkX is a library for the creation, manipulation, and study of complex networks of nodes and edges.
First, install NetworkX:
pip install networkxNetworkX does not have built-in support for the Cora dataset, but you can load it manually. Here is an example of how to do this:
TensorFlow also supports graph data through its tf.data and tf.keras APIs. While TensorFlow does not have a direct way to load the Cora dataset, we can still load and preprocess it manually.
First, install TensorFlow:
pip install tensorflowThe Cora dataset is an essential resource for the graph machine learning community, offering a robust platform for testing and developing innovative algorithms. Its structured complexity, combined with rich features and comprehensive labels, makes it an ideal benchmark for advancing the study of complex networks. As graph neural networks and related methodologies continue to evolve, the Cora dataset will remain a critical tool in driving research and education in this dynamic field.