Update readme to make it more clear

XikunZhang · XikunZhang · commit 1898eae7c81e · 2022-03-04T21:30:07.000-08:00
diff --git a/README.md b/README.md
@@ -1,13 +1,12 @@
-# GreaseLM: Graph REASoning Enhanced Language Models
+# GreaseLM: Graph REASoning Enhanced Language Models for Question Answering
 
-This repo provides the source code & data of our paper "GreaseLM: Graph REASoning Enhanced Language Models".
+This repo provides the source code & data of our paper [GreaseLM: Graph REASoning Enhanced Language Models for Question Answering](https://arxiv.org/abs/2201.08860) (ICLR 2022 spotlight).
 
 <p align="center">
   <img src="./figs/greaselm.png" width="600" title="GreaseLM model architecture" alt="">
 </p>
 
-## Usage
-### 1. Dependencies
+## 1. Dependencies
 
 - [Python](<https://www.python.org/>) == 3.8
 - [PyTorch](<https://pytorch.org/get-started/locally/>) == 1.8.0
@@ -34,14 +33,17 @@ pip install torch-geometric==1.7.0 -f https://pytorch-geometric.com/whl/torch-1.
 ```
 
 
-### 2. Download data
+## 2. Download data
 
-Download all the raw data -- ConceptNet, CommonsenseQA, OpenBookQA -- by
+### Download and preprocess data yourself
+**Preprocessing the data yourself may take long, so if you want to directly download preprocessed data, please jump to the next subsection.**
+
+Download the raw ConceptNet, CommonsenseQA, OpenBookQA data by using
 ```
 ./download_raw_data.sh
 ```
 
-You can preprocess the raw data by running
+You can preprocess these raw data by running
 ```
 CUDA_VISIBLE_DEVICES=0 python preprocess.py -p <num_processes>
 ```
@@ -51,32 +53,34 @@ You can specify the GPU you want to use in the beginning of the command `CUDA_VI
 * Identify all mentioned concepts in the questions and answers
 * Extract subgraphs for each q-a pair
 
-**TL;DR**. The preprocessing may take long; for your convenience, you can download all the processed data [here](https://drive.google.com/drive/folders/1T6B4nou5P3u-6jr0z6e3IkitO8fNVM6f?usp=sharing) into the top-level directory of this repo and run
-```
-unzip data_preprocessed.zip
-```
+The script to download and preprocess the [MedQA-USMLE](https://github.com/jind11/MedQA) data and the biomedical knowledge graph based on Disease Database and DrugBank is provided in `utils_biomed/`.
 
-**Add MedQA-USMLE**. Besides the commonsense QA datasets (*CommonsenseQA*, *OpenBookQA*) with the ConceptNet knowledge graph, we added a biomedical QA dataset ([*MedQA-USMLE*](https://github.com/jind11/MedQA)) with a biomedical knowledge graph based on Disease Database and DrugBank. You can download all the data for this from [[here]](https://drive.google.com/file/d/1EqbiNt2ACXVrc9gmoXnzTEo9GJTe9Uor/view?usp=sharing). Unzip it and put the `medqa_usmle` and `ddb` folders inside the `data/` directory.
+### Directly download preprocessed data
+For your convenience, if you don't want to preprocess the data yourself, you can download all the preprocessed data [here](https://drive.google.com/drive/folders/1T6B4nou5P3u-6jr0z6e3IkitO8fNVM6f?usp=sharing). Download them into the top-level directory of this repo and unzip them. Move the `medqa_usmle` and `ddb` folders into the `data/` directory.
 
+### Resulting file structure
 
 The resulting file structure should look like this:
 
 ```plain
 .
 &#9500;&#9472;&#9472; README.md
-&#9492;&#9472;&#9472; data/
-    &#9500;&#9472;&#9472; cpnet/                 (preprocessed ConceptNet)
-    &#9492;&#9472;&#9472; csqa/
+&#9500;&#9472;&#9472; data/
+    &#9500;&#9472;&#9472; cpnet/                 (prerocessed ConceptNet)
+    &#9500;&#9472;&#9472; csqa/
         &#9500;&#9472;&#9472; train_rand_split.jsonl
         &#9500;&#9472;&#9472; dev_rand_split.jsonl
         &#9500;&#9472;&#9472; test_rand_split_no_answers.jsonl
         &#9500;&#9472;&#9472; statement/             (converted statements)
         &#9500;&#9472;&#9472; grounded/              (grounded entities)
         &#9500;&#9472;&#9472; graphs/                (extracted subgraphs)
         &#9500;&#9472;&#9472; ...
+    &#9500;&#9472;&#9472; obqa/
+    &#9500;&#9472;&#9472; medqa_usmle/
+    &#9492;&#9472;&#9472; ddb/
 ```
 
-### 3. Training GreaseLM
+## 3. Training GreaseLM
 To train GreaseLM on CommonsenseQA, run
 ```
 CUDA_VISIBLE_DEVICES=0 ./run_greaselm.sh csqa --data_dir data/
@@ -93,14 +97,14 @@ To train GreaseLM on MedQA-USMLE, run
 CUDA_VISIBLE_DEVICES=0 ./run_greaselm__medqa_usmle.sh
 ```
 
-### 4. Pretrained model checkpoints
+## 4. Pretrained model checkpoints
 You can download a pretrained GreaseLM model on CommonsenseQA [here](https://drive.google.com/file/d/1QPwLZFA6AQ-pFfDR6TWLdBAvm3c_HOUr/view?usp=sharing), which achieves an IH-dev acc. of `79.0` and an IH-test acc. of `74.0`.
 
 You can also download a pretrained GreaseLM model on OpenbookQA [here](https://drive.google.com/file/d/1-QqyiQuU9xlN20vwfIaqYQ_uJMP8d7Pv/view?usp=sharing), which achieves an test acc. of `84.8`.
 
 You can also download a pretrained GreaseLM model on MedQA-USMLE [here](https://drive.google.com/file/d/1x5nZEprV0Ht8IWViyz3d07uGLXtNjUN1/view?usp=sharing), which achieves an test acc. of `38.5`.
 
-### 5. Evaluating a pretrained model checkpoint
+## 5. Evaluating a pretrained model checkpoint
 To evaluate a pretrained GreaseLM model checkpoint on CommonsenseQA, run
 ```
 CUDA_VISIBLE_DEVICES=0 ./eval_greaselm.sh csqa --data_dir data/ --load_model_path /path/to/checkpoint
@@ -112,13 +116,13 @@ SimilarlyTo evaluate a pretrained GreaseLM model checkpoint on OpenbookQA, run
 CUDA_VISIBLE_DEVICES=0 ./eval_greaselm.sh obqa --data_dir data/ --load_model_path /path/to/checkpoint
 ```
 
-### 6. Use your own dataset
+## 6. Use your own dataset
 - Convert your dataset to  `{train,dev,test}.statement.jsonl`  in .jsonl format (see `data/csqa/statement/train.statement.jsonl`)
 - Create a directory in `data/{yourdataset}/` to store the .jsonl files
 - Modify `preprocess.py` and perform subgraph extraction for your data
 - Modify `utils/parser_utils.py` to support your own dataset
 
-## Acknowledgment
+## 7. Acknowledgment
 This repo is built upon the following work:
 ```
 QA-GNN: Question Answering using Language Models and Knowledge Graphs