CORA¶

The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.

Original source: web.archive.org

[1]:

from torch_geometric.datasets import Planetoid

[2]:

dataset = Planetoid(root='/tmp/cora', name='Cora')

[3]:

dataset

[3]:

Cora()

[4]:

from torch_geometric.data import DataLoader

[5]:

loader = DataLoader(dataset, batch_size=32, shuffle=False)

/home/user/anaconda3/envs/gnn/lib/python3.7/site-packages/torch_geometric/deprecation.py:12: UserWarning: 'data.DataLoader' is deprecated, use 'loader.DataLoader' instead
  warnings.warn(out)

[6]:

batch_i = 0
for batch in loader:
    print(batch)
    batch_i = batch_i +1

print(batch_i)

DataBatch(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708], batch=[2708], ptr=[2])
1

[7]:

type(batch)

[7]:

torch_geometric.data.batch.DataBatch

[8]:

data = dataset[0]

print(f'Dataset: {dataset}:')
print('======================')
print(f'Number of graphs: {len(dataset)}')
print(f'Number of features: {dataset.num_features}')
print(f'Number of classes: {dataset.num_classes}')

print('==============================================================')

# Gather some statistics about the graph.
print(f'Number of nodes: {data.num_nodes}')
print(f'Number of node features: {data.num_node_features}')
print(f'Number of edges: {data.num_edges}')
print(f'Number of edge features: {data.num_edge_features}')
print(f'Average node degree: {(2*data.num_edges) / data.num_nodes:.2f}')

print("============= split ==========")

print(f'Number of training nodes: {data.train_mask.sum()}')
print(f'Training node label rate: {int(data.train_mask.sum()) / data.num_nodes:.2f}')
print(f'Number of validation nodes: {data.val_mask.sum()}')
print(f'validation node label rate: {int(data.val_mask.sum()) / data.num_nodes:.2f}')
print(f'Number of test nodes: {data.test_mask.sum()}')
print(f'test node label rate: {int(data.test_mask.sum()) / data.num_nodes:.2f}')

print("============ properties ===========")
print(f'Contains isolated nodes: {data.has_isolated_nodes()}')
print(f'Contains self-loops: {data.has_self_loops()}')
print(f'Is undirected: {data.is_undirected()}')
print(f'Is directed: {data.is_directed()}')

Dataset: Cora():
======================
Number of graphs: 1
Number of features: 1433
Number of classes: 7
==============================================================
Number of nodes: 2708
Number of node features: 1433
Number of edges: 10556
Number of edge features: 0
Average node degree: 7.80
============= split ==========
Number of training nodes: 140
Training node label rate: 0.05
Number of validation nodes: 500
validation node label rate: 0.18
Number of test nodes: 1000
test node label rate: 0.37
============ properties ===========
Contains isolated nodes: False
Contains self-loops: False
Is undirected: True
Is directed: False

[9]:

data

[9]:

Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])

x=[2708, 1433]: [num_nodes, num_node_features] Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.

edge_index=[2, 10556]: Graph connectivity in COO format with shape [2, num_edges] and type torch.long If :obj:edge_index is of type :obj:torch.LongTensor, its shape must be defined as :obj:[2, num_messages], where messages from nodes in :obj:edge_index[0] are sent to nodes in :obj:edge_index[1] source