|
| 1 | +# GreaseLM: Graph REASoning Enhanced Language Models for Question Answering |
| 2 | + |
| 3 | +This repo provides the source code & data of our paper [GreaseLM: Graph REASoning Enhanced Language Models for Question Answering](https://arxiv.org/abs/2201.08860) (ICLR 2022 spotlight). If you use any of our code, processed data or pretrained models, please cite: |
| 4 | +```bib |
| 5 | +@inproceedings{zhang2021greaselm, |
| 6 | + title={GreaseLM: Graph REASoning Enhanced Language Models}, |
| 7 | + author={Zhang, Xikun and Bosselut, Antoine and Yasunaga, Michihiro and Ren, Hongyu and Liang, Percy and Manning, Christopher D and Leskovec, Jure}, |
| 8 | + booktitle={International Conference on Learning Representations}, |
| 9 | + year={2021} |
| 10 | +} |
| 11 | +``` |
| 12 | + |
| 13 | +<p align="center"> |
| 14 | + <img src="./figs/greaselm.png" width="600" title="GreaseLM model architecture" alt=""> |
| 15 | +</p> |
| 16 | + |
| 17 | +## 1. Dependencies |
| 18 | + |
| 19 | +- [Python](<https://www.python.org/>) == 3.8 |
| 20 | +- [PyTorch](<https://pytorch.org/get-started/locally/>) == 1.8.0 |
| 21 | +- [transformers](<https://github.com/huggingface/transformers/tree/v3.4.0>) == 3.4.0 |
| 22 | +- [torch-geometric](https://pytorch-geometric.readthedocs.io/) == 1.7.0 |
| 23 | + |
| 24 | +Run the following commands to create a conda environment (assuming CUDA 10.1): |
| 25 | +```bash |
| 26 | +conda create -y -n greaselm python=3.8 |
| 27 | +conda activate greaselm |
| 28 | +pip install numpy==1.18.3 tqdm |
| 29 | +pip install torch==1.8.0+cu101 torchvision -f https://download.pytorch.org/whl/torch_stable.html |
| 30 | +pip install transformers==3.4.0 nltk spacy |
| 31 | +pip install wandb |
| 32 | +conda install -y -c conda-forge tensorboardx |
| 33 | +conda install -y -c conda-forge tensorboard |
| 34 | + |
| 35 | +# for torch-geometric |
| 36 | +pip install torch-scatter==2.0.7 -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html |
| 37 | +pip install torch-cluster==1.5.9 -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html |
| 38 | +pip install torch-sparse==0.6.9 -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html |
| 39 | +pip install torch-spline-conv==1.2.1 -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html |
| 40 | +pip install torch-geometric==1.7.0 -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html |
| 41 | +``` |
| 42 | + |
| 43 | + |
| 44 | +## 2. Download data |
| 45 | + |
| 46 | +### Download and preprocess data yourself |
| 47 | +**Preprocessing the data yourself may take long, so if you want to directly download preprocessed data, please jump to the next subsection.** |
| 48 | + |
| 49 | +Download the raw ConceptNet, CommonsenseQA, OpenBookQA data by using |
| 50 | +``` |
| 51 | +./download_raw_data.sh |
| 52 | +``` |
| 53 | + |
| 54 | +You can preprocess these raw data by running |
| 55 | +``` |
| 56 | +CUDA_VISIBLE_DEVICES=0 python preprocess.py -p <num_processes> |
| 57 | +``` |
| 58 | +You can specify the GPU you want to use in the beginning of the command `CUDA_VISIBLE_DEVICES=...`. The script will: |
| 59 | +* Setup ConceptNet (e.g., extract English relations from ConceptNet, merge the original 42 relation types into 17 types) |
| 60 | +* Convert the QA datasets into .jsonl files (e.g., stored in `data/csqa/statement/`) |
| 61 | +* Identify all mentioned concepts in the questions and answers |
| 62 | +* Extract subgraphs for each q-a pair |
| 63 | + |
| 64 | +The script to download and preprocess the [MedQA-USMLE](https://github.com/jind11/MedQA) data and the biomedical knowledge graph based on Disease Database and DrugBank is provided in `utils_biomed/`. |
| 65 | + |
| 66 | +### Directly download preprocessed data |
| 67 | +For your convenience, if you don't want to preprocess the data yourself, you can download all the preprocessed data [here](https://drive.google.com/drive/folders/1T6B4nou5P3u-6jr0z6e3IkitO8fNVM6f?usp=sharing). Download them into the top-level directory of this repo and unzip them. Move the `medqa_usmle` and `ddb` folders into the `data/` directory. |
| 68 | + |
| 69 | +### Resulting file structure |
| 70 | + |
| 71 | +The resulting file structure should look like this: |
| 72 | + |
| 73 | +```plain |
| 74 | +. |
| 75 | +├── README.md |
| 76 | +├── data/ |
| 77 | + ├── cpnet/ (prerocessed ConceptNet) |
| 78 | + ├── csqa/ |
| 79 | + ├── train_rand_split.jsonl |
| 80 | + ├── dev_rand_split.jsonl |
| 81 | + ├── test_rand_split_no_answers.jsonl |
| 82 | + ├── statement/ (converted statements) |
| 83 | + ├── grounded/ (grounded entities) |
| 84 | + ├── graphs/ (extracted subgraphs) |
| 85 | + ├── ... |
| 86 | + ├── obqa/ |
| 87 | + ├── medqa_usmle/ |
| 88 | + └── ddb/ |
| 89 | +``` |
| 90 | + |
| 91 | +## 3. Training GreaseLM |
| 92 | +To train GreaseLM on CommonsenseQA, run |
| 93 | +``` |
| 94 | +CUDA_VISIBLE_DEVICES=0 ./run_greaselm.sh csqa --data_dir data/ |
| 95 | +``` |
| 96 | +You can specify up to 2 GPUs you want to use in the beginning of the command `CUDA_VISIBLE_DEVICES=...`. |
| 97 | + |
| 98 | +Similarly, to train GreaseLM on OpenbookQA, run |
| 99 | +``` |
| 100 | +CUDA_VISIBLE_DEVICES=0 ./run_greaselm.sh obqa --data_dir data/ |
| 101 | +``` |
| 102 | + |
| 103 | +To train GreaseLM on MedQA-USMLE, run |
| 104 | +``` |
| 105 | +CUDA_VISIBLE_DEVICES=0 ./run_greaselm__medqa_usmle.sh |
| 106 | +``` |
| 107 | + |
| 108 | +## 4. Pretrained model checkpoints |
| 109 | +You can download a pretrained GreaseLM model on CommonsenseQA [here](https://drive.google.com/file/d/1QPwLZFA6AQ-pFfDR6TWLdBAvm3c_HOUr/view?usp=sharing), which achieves an IH-dev acc. of `79.0` and an IH-test acc. of `74.0`. |
| 110 | + |
| 111 | +You can also download a pretrained GreaseLM model on OpenbookQA [here](https://drive.google.com/file/d/1-QqyiQuU9xlN20vwfIaqYQ_uJMP8d7Pv/view?usp=sharing), which achieves an test acc. of `84.8`. |
| 112 | + |
| 113 | +You can also download a pretrained GreaseLM model on MedQA-USMLE [here](https://drive.google.com/file/d/1j0QxiBiGbv0s9PhseSly6V6uiHWU5IEt/view?usp=sharing), which achieves an test acc. of `38.5`. |
| 114 | + |
| 115 | +## 5. Evaluating a pretrained model checkpoint |
| 116 | +To evaluate a pretrained GreaseLM model checkpoint on CommonsenseQA, run |
| 117 | +``` |
| 118 | +CUDA_VISIBLE_DEVICES=0 ./eval_greaselm.sh csqa --data_dir data/ --load_model_path /path/to/checkpoint |
| 119 | +``` |
| 120 | +Again you can specify up to 2 GPUs you want to use in the beginning of the command `CUDA_VISIBLE_DEVICES=...`. |
| 121 | + |
| 122 | +Similarly, to evaluate a pretrained GreaseLM model checkpoint on OpenbookQA, run |
| 123 | +``` |
| 124 | +CUDA_VISIBLE_DEVICES=0 ./eval_greaselm.sh obqa --data_dir data/ --load_model_path /path/to/checkpoint |
| 125 | +``` |
| 126 | +To evaluate a pretrained GreaseLM model checkpoint on MedQA-USMLE, run |
| 127 | +``` |
| 128 | +INHERIT_BERT=1 CUDA_VISIBLE_DEVICES=0 ./eval_greaselm.sh medqa_usmle --data_dir data/ --load_model_path /path/to/checkpoint |
| 129 | +``` |
| 130 | + |
| 131 | +## 6. Use your own dataset |
| 132 | +- Convert your dataset to `{train,dev,test}.statement.jsonl` in .jsonl format (see `data/csqa/statement/train.statement.jsonl`) |
| 133 | +- Create a directory in `data/{yourdataset}/` to store the .jsonl files |
| 134 | +- Modify `preprocess.py` and perform subgraph extraction for your data |
| 135 | +- Modify `utils/parser_utils.py` to support your own dataset |
| 136 | + |
| 137 | +## 7. Acknowledgment |
| 138 | +This repo is built upon the following work: |
| 139 | +``` |
| 140 | +QA-GNN: Question Answering using Language Models and Knowledge Graphs |
| 141 | +https://github.com/michiyasunaga/qagnn |
| 142 | +``` |
| 143 | +Many thanks to the authors and developers! |
0 commit comments