CORA¶
The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.
Original source: web.archive.org
[1]:
from torch_geometric.datasets import Planetoid
[2]:
dataset = Planetoid(root='/tmp/cora', name='Cora')
[3]:
dataset
[3]:
Cora()
[4]:
from torch_geometric.data import DataLoader
[5]:
loader = DataLoader(dataset, batch_size=32, shuffle=False)
/home/user/anaconda3/envs/gnn/lib/python3.7/site-packages/torch_geometric/deprecation.py:12: UserWarning: 'data.DataLoader' is deprecated, use 'loader.DataLoader' instead
warnings.warn(out)
[6]:
batch_i = 0
for batch in loader:
print(batch)
batch_i = batch_i +1
print(batch_i)
DataBatch(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708], batch=[2708], ptr=[2])
1
[7]:
type(batch)
[7]:
torch_geometric.data.batch.DataBatch
[8]:
data = dataset[0]
print(f'Dataset: {dataset}:')
print('======================')
print(f'Number of graphs: {len(dataset)}')
print(f'Number of features: {dataset.num_features}')
print(f'Number of classes: {dataset.num_classes}')
print('==============================================================')
# Gather some statistics about the graph.
print(f'Number of nodes: {data.num_nodes}')
print(f'Number of node features: {data.num_node_features}')
print(f'Number of edges: {data.num_edges}')
print(f'Number of edge features: {data.num_edge_features}')
print(f'Average node degree: {(2*data.num_edges) / data.num_nodes:.2f}')
print("============= split ==========")
print(f'Number of training nodes: {data.train_mask.sum()}')
print(f'Training node label rate: {int(data.train_mask.sum()) / data.num_nodes:.2f}')
print(f'Number of validation nodes: {data.val_mask.sum()}')
print(f'validation node label rate: {int(data.val_mask.sum()) / data.num_nodes:.2f}')
print(f'Number of test nodes: {data.test_mask.sum()}')
print(f'test node label rate: {int(data.test_mask.sum()) / data.num_nodes:.2f}')
print("============ properties ===========")
print(f'Contains isolated nodes: {data.has_isolated_nodes()}')
print(f'Contains self-loops: {data.has_self_loops()}')
print(f'Is undirected: {data.is_undirected()}')
print(f'Is directed: {data.is_directed()}')
Dataset: Cora():
======================
Number of graphs: 1
Number of features: 1433
Number of classes: 7
==============================================================
Number of nodes: 2708
Number of node features: 1433
Number of edges: 10556
Number of edge features: 0
Average node degree: 7.80
============= split ==========
Number of training nodes: 140
Training node label rate: 0.05
Number of validation nodes: 500
validation node label rate: 0.18
Number of test nodes: 1000
test node label rate: 0.37
============ properties ===========
Contains isolated nodes: False
Contains self-loops: False
Is undirected: True
Is directed: False
[9]:
data
[9]:
Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])
x=[2708, 1433]: [num_nodes, num_node_features] Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.
edge_index=[2, 10556]: Graph connectivity in COO format with shape [2, num_edges] and type torch.long If :obj:edge_index is of type :obj:torch.LongTensor, its shape must be defined as :obj:[2, num_messages], where messages from nodes in :obj:edge_index[0] are sent to nodes in :obj:edge_index[1] source