Hanqing Zeng*, Hongkuan Zhou*, Ajitesh Srivastava, Rajgopal Kannan, Viktor Prasanna
IMPORTANT: This a self-contained directory for the C++ implementation of our parallel algorithm proposed in IEEE/IPDPS '19 (journal version: JPDC 2021). If you are only interested in the GraphSAINT minibatch algorithm itself (ICLR '20), but not the parallelization techniques, please ignore this directory.
Contact
Hanqing Zeng ([email protected]), Hongkuan Zhou ([email protected])
Updates
- 2021/02/07: Fix a bug in forward / backward prop for different dimensions.
We rely on Intel MKL for optimized execution of dense matrix multiplication and we require compilation using Intel icc. The parallelization of sampler and subgraph feature aggregation are via OpenMP (provided with icc). You can install Intel Parallel Studio (which includes MKL, icc and OpenMP) following instructions on this page. Students can obtain a free license for the Parallel Studio. Below are the versions we have tested:
- icc >= 19.0.5.281
- mkl >= 2019 Update 5
To compile, first set the environment variable MKLROOT, so that the makefile knows the correct directory to search for the library. For example, if you install Parallel Studio under $HOME directory, then execute the following in your shell:
export MKLROOT=$HOME/intel/mkl
Finally, run from the current directory:
make
Now you are ready to perform parallel GNN training (use GraphSAGE as the backbone architecture).
If the above does not work, you may need to set other environment variables. See this script provided by Intel.
If speed is not critical to you and you just want to check the functionality without going through the trouble of installing MKL, ICC, etc., you can also use the non-MKL compilation. To do this:
- Go to
global.hand comment out line 42 (#define USE_MKL). - Replace the
makefilewithmakefile.nomkl. makein terminal
Currently available datasets:
- PPI
- Flickr
- Amazon
The datasets are available vis this Google Drive link. Download and move the data_cpp folder to the root directory (i.e., parent directory of this directory).
The datasets are stored in binary format, as explained below:
Eleven 32-bit integers, which specify the dimensions of various tensors stored in the other binary files. The following table lists the definition of the eleven integers, and the value of them using Reddit as an example:
| Definition | e.g., Reddit | |
|---|---|---|
| 0 | length of adj_train_indices |
10753238 |
| 1 | length of adj_train_indptr |
232966 (^) |
| 2 | length of adj_full_indices |
23213838 |
| 3 | length of adj_full_indptr |
232966 |
| 4 | number of training nodes | 151701 |
| 5 | number of validation nodes | 55334 |
| 6 | number of test nodes | 23699 |
| 7 | number of rows in the input feature matrix X | 232965 |
| 8 | number of cols in the input feature matrix X | 602 |
| 9 | number of rows in the output label matrix | 232965 |
| 10 | number of cols in the output labe matrix | 41 |
(^) Note: While length of adj_train_indptr equals number of nodes in the full graph plus 1, the training sampler will NOT obtain any node or edge information on the validation/test sets. The reasons are that:
- We remove all edges (u,v) such that u or v belongs to the validation/test sets.
- The initial frontier of the sampler are selected from the set of training nodes.
1D array of type int32. The indices array of the sparse adj_train (in CSR format).
The training adjacency matrix is of shape |V| by |V|, where V is the set of all nodes (training + validation + test). However, an element in adj_train is 1 if and only if there is an edge whose both end points are training nodes. The rest elements are all 0.
1D array of type int32. The indptr array of the sparse adj_train (in CSR format).
1D array of type int32. The indices array of the sparse adj_full (in CSR format).
The full adjacency matrix is of shape |V| by |V|, where V is the set of all nodes (training + validation + test). An element in adj_full is 1 if there is an edge in the full graph (consisting of training + validation + test nodes).
1D array of type int32. The indptr array of the sparse adj_full (in CSR format).
1D array of type int32. It stores node indices of all training nodes.
1D array of type int32. It stores node indices of all test nodes.
1D array of type int32. It stores node indices of all validation nodes.
2D column-major array of type float64. It stores the normalized input feature of all nodes (training + validation + test).
2D column-major array of type float64. It stores the labels of all nodes (training + validation + test).
To run the program after compilation, execute:
./train <dataset> <num_itr> <num_thread> <type_loss> <size_hid> <num_layer> <size_subg> <size_frontier> <rate_learn>
where the first 4 arguments are mandatory. If the other arguments are not provided, the program will use the default value set in ./include/global.h.
- Our C++ training by default uses double precision floating point numbers. You can change it by modifying
t_datain./include/global.h. Note that Tensorflow by default may usefloat32, so be sure to make the data type consistent when comparing training speed with Tensorflow. - You can set
<num_thread>to be the number of physical cores in your system. <type_loss>can be eithersigmoidorsoftmax, and can affect accuracy significantly if not set correctly.<num_itr>specifies number of iterations (NOT number of epochs). We define one iteration as a forward+backward pass using one subgraph.
Examples:
On Reddit:
./train reddit 100 40 softmax
On PPI:
./train ppi 200 40 sigmoid
NOTE: to get high accuracy, it would be necessary to tweak other hyper-parameters such as learning rate, sample size, dropout (to be supported soon), etc.
If accuracy is critical, we recommend you to use the better training algorithm as implemented in the ../graphsaint directory (see ../README.md for details). Alternatively, you can also go to this repo for a Tensorflow implementation of this C++ training algorithm (you may be able to configure hyper-parameters more easily using Tensorflow than using pure C++).
To train on your own data, you can use our script convert.py to convert from the format used in GraphSAINT (python+Tensorflow) to the format used by this C++ version.
python convert.py <dataset name>
Citation
@INPROCEEDINGS{graphsaint-ipdps19,
author={Hanqing Zeng and Hongkuan Zhou and Ajitesh Srivastava and Rajgopal Kannan and Viktor Prasanna},
booktitle={2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
title={Accurate, Efficient and Scalable Graph Embedding},
year={2019},
month={May},
doi={10.1109/IPDPS.2019.00056},
}
@article{graphsaint-jpdc21,
title = {Accurate, efficient and scalable training of Graph Neural Networks},
journal = {Journal of Parallel and Distributed Computing},
volume = {147},
pages = {166-183},
year = {2021},
issn = {0743-7315},
doi = {https://doi.org/10.1016/j.jpdc.2020.08.011},
url = {https://www.sciencedirect.com/science/article/pii/S0743731520303579},
author = {Hanqing Zeng and Hongkuan Zhou and Ajitesh Srivastava and Rajgopal Kannan and Viktor Prasanna},
}
TODO
- The current C++ code implements the algorithm described in our IEEE/IPDPS paper, which is a 'premature' version of our GraphSAINT algorithm. The major differences are that this C++ version does not perform normalization, and currently only supports Frontier (or MRW) sampling. We will update our code to be consistent with the GraphSAINT version.
- The current C++ version does not support dropout yet. We should be able to support dropout easily given the current code framework.
- We will modify the C++ version so that it shares the same interface with the python+Tensorflow version. i.e., users should be able to use the same
ymlfile to configure GNN architecture / hyper-parameters. Conversion of data formats should take place internally during runtime.