ogbg-molhiv

Graph: The ogbg-molhiv and ogbg-molpcba datasets are two molecular property prediction datasets of different sizes: ogbg-molhiv (small) and ogbg-molpcba (medium). They are adopted from the MoleculeNet [1], and are among the largest of the MoleculeNet datasets. All the molecules are pre-processed using RDKit [2]. Each graph represents a molecule, where nodes are atoms, and edges are chemical bonds. Input node features are 9-dimensional, containing atomic number and chirality, as well as other additional atom features such as formal charge and whether the atom is in the ring or not. The full description of the features is provided in code. The script to convert the SMILES string [3] to the above graph object can be found here. Note that the script requires RDkit to be installed. The script can be used to pre-process external molecule datasets so that those datasets share the same input feature space as the OGB molecule datasets. This is particularly useful for pre-training graph models, which has great potential to significantly increase generalization performance on the (downstream) OGB datasets [4].

Beside the two main datasets, we additionally provide 10 smaller datasets from MoleculeNet. They are ogbg-moltox21, ogbg-molbace, ogbg-molbbbp, ogbg-molclintox, ogbg-molmuv, ogbg-molsider, and ogbg-moltoxcast for (multi-task) binary classification, and ogbg-molesol, ogbg-molfreesolv, and ogbg-mollipo for regression. Evaluators are also provided for these datasets. These datasets can be used to stress-test molecule-specific methods or transfer learning [4].

For encoding these raw input features, we prepare simple modules called AtomEncoder and BondEncoder. They can be used as follows to embed raw atom and bond features to obtain atom_emb and bond_emb.

from ogb.graphproppred.mol_encoder import AtomEncoder, BondEncoder
atom_encoder = AtomEncoder(emb_dim = 100)
bond_encoder = BondEncoder(emb_dim = 100)

atom_emb = atom_encoder(x) # x is input atom feature
edge_emb = bond_encoder(edge_attr) # edge_attr is input edge feature

Datasets

Prediction task: The task is to predict the target molecular properties as accurately as possible, where the molecular properties are cast as binary labels, e.g, whether a molecule inhibits HIV virus replication or not. Note that some datasets (e.g., ogbg-molpcba) can have multiple tasks, and can contain nan that indicates the corresponding label is not assigned to the molecule. For evaluation metric, we closely follow [1]. Specifically, for ogbg-molhiv, we use ROC-AUC for evaluation. For ogbg-molpcba, as the class balance is extremely skewed (only 1.4% of data is positive) and the dataset contains multiple classification tasks, we use the Average Precision (AP) averaged over the tasks as the evaluation metric.

Dataset splitting: We adopt the scaffold splitting procedure that splits the molecules based on their two-dimensional structural frameworks. The scaffold splitting attempts to separate structurally different molecules into different subsets, which provides a more realistic estimate of model performance in prospective experimental settings [1].

Base on the paper, * Split Scheme: Scaffold * Split Ratio: 80/10/10 * Task Type: Binary Class * Metric: ROC-AUC * #Graph: 41127 * Average #Nodes: 25.5 * Average #Edges: 27.5 * Average Node Deg: 2.2 * Average Clust Coeff: 0.002 * MaxSCC Ratio: 0.993 * Graph Diameter: 12.0

The MaxSCC ratio shows the fraction of nodes in the largest strongly connected component of the graph.

baseline from [5]

Load and preprocess the dataset

[1]:
import os
from ogb.graphproppred import PygGraphPropPredDataset
from torch_geometric.data import DataLoader
[2]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
  # Load the dataset
  dataset = PygGraphPropPredDataset(name='ogbg-molhiv')

  split_idx = dataset.get_idx_split()

  # Check task type
  print('Task type: {}'.format(dataset.task_type))
Downloading http://snap.stanford.edu/ogb/data/graphproppred/csv_mol_download/hiv.zip
Downloaded 0.00 GB: 100%|██████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.25it/s]
Processing...
Extracting dataset/hiv.zip
Loading necessary files...
This might take a while.
Processing graphs...
100%|██████████████████████████████████████████████████████████| 41127/41127 [00:00<00:00, 179483.00it/s]
Converting graphs into PyG objects...
100%|███████████████████████████████████████████████████████████| 41127/41127 [00:00<00:00, 79541.81it/s]
Saving...
Task type: binary classification
Done!
[3]:
split_idx
[3]:
{'train': tensor([    3,     4,     5,  ..., 41124, 41125, 41126]),
 'valid': tensor([10127, 10129, 10132,  ..., 22785, 22786, 22788]),
 'test': tensor([    0,     1,     2,  ..., 10122, 10124, 10125])}
[6]:
data = dataset[0]
[9]:
print(f'Dataset: {dataset}:')
print('======================')
print(f'Number of graphs: {len(dataset)}')
print(f'Number of features: {dataset.num_features}')
print(f'Number of classes: {dataset.num_classes}')

print('Since there are many graphs, it worth nothing to check a data or graph')

# Gather some statistics about the graph.
# print(f'Number of nodes: {data.num_nodes}')
# print(f'Number of node features: {data.num_node_features}')
# print(f'Number of edges: {data.num_edges}')
# print(f'Number of edge features: {data.num_edge_features}')
# print(f'Average node degree: {(2*data.num_edges) / data.num_nodes:.2f}')

print("============= split ==========")

print(f"Number of training nodes: {split_idx['train'].shape[0]}")
print(f"Training graph rate: {int(split_idx['train'].shape[0]) / len(dataset):.2f}")
print(f"Number of validation nodes: {split_idx['valid'].shape[0]}")
print(f"validation node label rate: {int(split_idx['valid'].shape[0]) / len(dataset):.2f}")
print(f"Number of test nodes: {split_idx['test'].shape[0]}")
print(f"test node label rate: {int(split_idx['test'].shape[0]) / len(dataset):.2f}")

print("============ properties ===========")
print(f'Contains isolated nodes: {data.has_isolated_nodes()}')
print(f'Contains self-loops: {data.has_self_loops()}')
print(f'Is undirected: {data.is_undirected()}')
print(f'Is directed: {data.is_directed()}')
Dataset: PygGraphPropPredDataset(41127):
======================
Number of graphs: 41127
Number of features: 9
Number of classes: 2
Since there are many graphs, it worth nothing to check a data or graph
============= split ==========
Number of training nodes: 32901
Training graph rate: 0.80
Number of validation nodes: 4113
validation node label rate: 0.10
Number of test nodes: 4113
test node label rate: 0.10
============ properties ===========
Contains isolated nodes: False
Contains self-loops: False
Is undirected: True
Is directed: False

References

[1] Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh SPappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machine learning. Chemical Science, 9(2):513–530, 2018.

[2] Greg Landrum et al. RDKit: Open-source cheminformatics, 2006.

[3] Eric Anderson, Gilman D. Veith, and David Weininger. SMILES: a line notation and computerized interpreter for chemical structures, 1987.

[4] Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec.Strategies for pre-training graph neural networks. In International Conference on Learning Representations (ICLR), 2020.

[5] Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., … & Leskovec, J. (2020). Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33, 22118-22133.

cs224w colab 2 https://colab.research.google.com/drive/1BRPw3WQjP8ANSFz-4Z1ldtNt9g7zm-bv?usp=sharing

https://ogb.stanford.edu/docs/graphprop/#ogbg-mol

License: MIT