ogbn-arxiv¶
Graph: The ogbn-arxiv dataset is a directed graph, representing the citation network between all Computer Science (CS) arXiv papers indexed by MAG [1]. Each node is an arXiv paper and each directed edge indicates that one paper cites another one. Each paper comes with a 128-dimensional feature vector obtained by averaging the embeddings of words in its title and abstract. The embeddings of individual words are computed by running the skip-gram model [2] over the MAG corpus. We also provide the mapping from MAG paper IDs into the raw texts of titles and abstracts here. In addition, all papers are also associated with the year that the corresponding paper was published.
Prediction task: The task is to predict the 40 subject areas of arXiv CS papers, e.g., cs.AI, cs.LG, and cs.OS, which are manually determined (i.e., labeled) by the paper’s authors and arXiv moderators. With the volume of scientific publications doubling every 12 years over the past century, it is practically important to automatically classify each publication’s areas and topics. Formally, the task is to predict the primary categories of the arXiv papers, which is formulated as a 40-class classification problem.
Dataset splitting: We consider a realistic data split based on the publication dates of the papers. The general setting is that the ML models are trained on existing papers and then used to predict the subject areas of newly-published papers, which supports the direct application of them into real-world scenarios, such as helping the arXiv moderators. Specifically, we propose to train on papers published until 2017, validate on those published in 2018, and test on those published since 2019.
Leaderboard¶
Setup¶
[1]:
import os
import torch_geometric.transforms as T
from ogb.nodeproppred import PygNodePropPredDataset
Load and Preprocess the Dataset¶
[2]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
dataset_name = 'ogbn-arxiv'
dataset = PygNodePropPredDataset(name=dataset_name,
transform=T.ToSparseTensor())
data = dataset[0]
# Make the adjacency matrix to symmetric
data.adj_t = data.adj_t.to_symmetric()
split_idx = dataset.get_idx_split()
train_idx = split_idx['train']
[3]:
split_idx
[3]:
{'train': tensor([ 0, 1, 2, ..., 169145, 169148, 169251]),
'valid': tensor([ 349, 357, 366, ..., 169185, 169261, 169296]),
'test': tensor([ 346, 398, 451, ..., 169340, 169341, 169342])}
[4]:
print(f'Dataset: {dataset}:')
print('======================')
print(f'Number of graphs: {len(dataset)}')
print(f'Number of features: {dataset.num_features}')
print(f'Number of classes: {dataset.num_classes}')
print('==============================================================')
# Gather some statistics about the graph.
print(f'Number of nodes: {data.num_nodes}')
print(f'Number of node features: {data.num_node_features}')
print(f'Number of edges: {data.num_edges}')
print(f'Number of edge features: {data.num_edge_features}')
print(f'Average node degree: {(2*data.num_edges) / data.num_nodes:.2f}')
print("============= split ==========")
print(f"Number of training nodes: {split_idx['train'].shape[0]}")
print(f"Training node label rate: {int(split_idx['train'].shape[0]) / data.num_nodes:.2f}")
print(f"Number of validation nodes: {split_idx['valid'].shape[0]}")
print(f"validation node label rate: {int(split_idx['valid'].shape[0]) / data.num_nodes:.2f}")
print(f"Number of test nodes: {split_idx['test'].shape[0]}")
print(f"test node label rate: {int(split_idx['test'].shape[0]) / data.num_nodes:.2f}")
print("============ properties ===========")
print(f'Contains isolated nodes: {data.has_isolated_nodes()}')
print(f'Contains self-loops: {data.has_self_loops()}')
print(f'Is undirected: {data.is_undirected()}')
print(f'Is directed: {data.is_directed()}')
Dataset: PygNodePropPredDataset():
======================
Number of graphs: 1
Number of features: 128
Number of classes: 40
==============================================================
Number of nodes: 169343
Number of node features: 128
Number of edges: 2315598
Number of edge features: 0
Average node degree: 27.35
============= split ==========
Number of training nodes: 90941
Training node label rate: 0.54
Number of validation nodes: 29799
validation node label rate: 0.18
Number of test nodes: 48603
test node label rate: 0.29
============ properties ===========
Contains isolated nodes: False
Contains self-loops: False
Is undirected: True
Is directed: False
The papar mention that the number of edges are 1166243, however, what the data shows is diffrent as 2315598. Since, the number of edges is changed, other related values also different from the paper.
[5]:
data
[5]:
Data(num_nodes=169343, x=[169343, 128], node_year=[169343, 1], y=[169343, 1], adj_t=[169343, 169343, nnz=2315598])
x=[169343, 128]: [num_nodes, num_node_features] Each of the 169343 nodes is assigned a 128-dim feature vector
data.num_node_features: the number of node features. Each paper comes with a 128-dimensional feature vector obtained by averaging the embeddings of words in its title and abstract. The embeddings of individual words are computed by running the WORD2VEC model (Mikolovet al., 2013) over the MAG corpus.
node_year: the year that the corresponding paper was published.
y=[169343, 1]: a node labels
adj_t: a adjaceny matrix
nnz: the number of non-zero entries in the adjacenct matrix.
[6]:
data.x[0]
[6]:
tensor([-0.0579, -0.0525, -0.0726, -0.0266, 0.1304, -0.2414, -0.4492, -0.0184,
-0.0872, 0.1123, -0.0921, -0.2896, -0.0810, 0.0745, -0.1562, -0.0974,
0.1194, 0.6458, 0.0774, -0.0939, -0.4004, 0.3114, -0.5418, 0.0805,
-0.0069, 0.5423, -0.0122, -0.1808, 0.0165, 0.0508, -0.2083, -0.0870,
0.0124, 0.2817, 0.1004, -0.1643, 0.0269, 0.0782, 0.0795, -0.0134,
0.2915, 0.0416, -0.1414, -0.1345, 0.0162, 0.2810, -0.0919, -0.2403,
0.4618, 0.1873, 0.1533, 0.0331, 0.0108, 0.0124, -0.1589, 0.0980,
0.0305, 0.0162, -0.0957, 0.0521, 0.3218, -0.1057, 0.2229, -0.1206,
-0.1723, 0.3954, 0.0883, -0.2219, 0.2310, -0.2096, -0.1125, -0.0644,
0.0697, -0.1574, 0.0223, -0.4190, 0.1344, 0.2605, 0.0417, -0.0935,
-0.0516, -0.0255, 0.7744, 0.0581, 0.0452, 0.0571, -0.5482, -0.0464,
0.8728, 0.0119, 0.3891, -0.0859, 0.1116, 0.0618, 0.0015, 0.0476,
0.0363, 0.2586, 0.2359, -0.0290, -0.1415, 0.7106, -0.0571, -0.1174,
0.3059, 0.1670, -0.1990, 0.1276, 0.0270, 0.5458, -0.1917, -0.0696,
-0.1111, 0.1142, 0.1162, -0.0159, 0.1159, -0.0624, 0.2115, -0.2261,
-0.1856, 0.0532, 0.3329, 0.1042, 0.0074, 0.1734, -0.1728, -0.1401])
[7]:
data.x[0].shape
[7]:
torch.Size([128])
[8]:
type(data.adj_t)
[8]:
torch_sparse.tensor.SparseTensor
[9]:
data.adj_t[0].size(dim=1)
[9]:
169343
[10]:
data.adj_t[0].nnz()
[10]:
291
[11]:
data.adj_t[0].density()
[11]:
0.001718405839036748
[12]:
data.y[0]
[12]:
tensor([4])
[13]:
data.node_year[0]
[13]:
tensor([2013])
[14]:
data.y.unique()
[14]:
tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39])
References¶
[1] Kuansan Wang, Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. Microsoft academic graph: When experts are not enough. Quantitative Science Studies, 1(1):396–413, 2020.
[2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representationsof words and phrases and their compositionality. In Advances in Neural Information Processing Systems (NeurIPS), pp. 3111–3119, 2013.
cs224w colab 2 https://colab.research.google.com/drive/1BRPw3WQjP8ANSFz-4Z1ldtNt9g7zm-bv?usp=sharing