This repository implements and evaluates an efficient Retrieval-Augmented Generation (RAG) pipeline on HotpotQA.
The core research question is:
Can we reduce RAG latency and prompt cost while preserving answer quality?
We explore two lightweight system optimizations:
-
Early-Exit Retrieval
Stop retrieval early when the retrieved evidence appears sufficiently confident. -
Token-Budgeted Context Compression
Select only the most relevant evidence sentences under a fixed prompt token budget.
The final system is evaluated with FLAN-T5-small and FLAN-T5-base, using answer quality, latency, token usage, retrieval depth, error analysis, evidence recall, fallback behavior, and Pareto frontier analysis.
Standard RAG systems often retrieve a fixed number of chunks and append them directly to the model prompt. This design is simple, but inefficient:
- Some queries do not require the full top-k retrieval depth.
- Long prompts increase inference latency and token cost.
- Irrelevant retrieved text can distract the model.
- Multi-hop QA tasks such as HotpotQA require enough evidence coverage, so aggressive pruning may hurt quality.
This project studies the tradeoff between:
Efficiency: latency, prompt tokens, retrieval depth
Quality: EM, F1, evidence recall
The goal is not to build a production vector database. Instead, the goal is to build a reproducible semester-scale RAG system that exposes real efficiency-quality tradeoffs.
This project intentionally avoids FAISS to improve reproducibility. It uses dense embeddings and cosine similarity, which are sufficient for the project-scale HotpotQA experiments.
The following commands reproduce a minimal full-system run with google/flan-t5-base.
# 1. Install dependencies
pip install -r requirements.txt
# 2. Prepare HotpotQA subset
python -m scripts.prepare_hotpotqa --input data/hotpot_dev_distractor_v1.json --output-dir data/hotpot_dev_small --chunk-size 3 --stride 2 --limit 300
# 3. Build retrieval index
python -m scripts.build_index --corpus data/hotpot_dev_small/hotpot_corpus.jsonl --index-dir artifacts/hotpot_index --device cpu
# 4. Run the full system
python -m scripts.run_eval --index-dir artifacts/hotpot_index --eval data/hotpot_dev_small/hotpot_eval.json --generator hf --hf-model google/flan-t5-base --device cpu --mode full_system --output artifacts/base/full_system_hf.jsonl --max-examples 100Use --device cuda instead of --device cpu if a GPU is available.
| Goal | Script |
|---|---|
| Prepare HotpotQA subset | scripts.prepare_hotpotqa |
| Build retrieval index | scripts.build_index |
| Run one RAG variant | scripts.run_eval |
| Compare four variants | scripts.compare_rag_results |
| Generate error analysis | scripts.error_analysis_hotpot |
| Generate evidence recall report | scripts.evidence_recall_hotpot |
| Run parameter sweep | scripts.run_sweep |
| Plot Pareto frontier | scripts.plot_pareto |
Recommended main workflow:
prepare_hotpotqa → build_index → run_eval × 4 → compare_rag_results
→ error_analysis_hotpot
→ evidence_recall_hotpot
Optional workflow:
run_sweep → plot_pareto
This project provides:
- A runnable RAG pipeline for HotpotQA.
- Four system variants for ablation:
baselineearly_exit_onlytoken_budget_onlyfull_system
- A fallback experiment for low-confidence retrieval.
- Automatic evaluation scripts for:
- EM / F1
- latency
- prompt tokens
- retrieval depth
- error categories
- evidence recall
- Pareto frontier
- Comparative analysis between
google/flan-t5-smallandgoogle/flan-t5-base.
Question
↓
Dense Retriever
↓
Early-Exit Controller
↓
Retrieved Chunks
↓
Sentence Split + Relevance Scoring
↓
Token-Budgeted Evidence Packing
↓
Prompt Construction
↓
FLAN-T5 small / base
↓
Answer
| System | Description |
|---|---|
baseline |
Fixed top-k retrieval, no context compression |
early_exit_only |
Adaptive retrieval stopping only |
token_budget_only |
Sentence-level context compression only |
full_system |
Early-exit retrieval + token-budgeted compression |
full_system + fallback |
Full system with low-confidence fallback enabled |
rag_cost_reducer/
├── README.md
├── requirements.txt
├── data/
│ ├── hotpot_dev_distractor_v1.json
│ └── hotpot_dev_small/
│ ├── hotpot_corpus.jsonl
│ └── hotpot_eval.json
├── artifacts/
│ ├── hotpot_index/
│ ├── base/
│ ├── small/
│ ├── reports/
│ └── sweep_base/
├── rag/
│ ├── config.py
│ ├── corpus.py
│ ├── embedder.py
│ ├── index.py
│ ├── retriever.py
│ ├── context_builder.py
│ ├── generator.py
│ ├── pipeline.py
│ └── metrics.py
└── scripts/
├── prepare_hotpotqa.py
├── build_index.py
├── run_eval.py
├── compare_rag_results.py
├── error_analysis_hotpot.py
├── evidence_recall_hotpot.py
├── run_sweep.py
└── plot_pareto.py
Tested with Python 3.10. Python 3.9+ should work.
Create a virtual environment:
python -m venv .venv
source .venv/bin/activateInstall dependencies:
pip install --upgrade pip
pip install -r requirements.txtFor CPU-only environments, install CPU PyTorch first:
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install -r requirements.txtCPU is sufficient for reproducing the 100-example experiments, although GPU inference is significantly faster.
Place the raw HotpotQA file here:
data/hotpot_dev_distractor_v1.json
Prepare a smaller experimental subset:
python -m scripts.prepare_hotpotqa --input data/hotpot_dev_distractor_v1.json --output-dir data/hotpot_dev_small --chunk-size 3 --stride 2 --limit 300This creates:
data/hotpot_dev_small/hotpot_corpus.jsonl
data/hotpot_dev_small/hotpot_eval.json
If the raw HotpotQA file is too large for submission, do not include it in the final artifact. Instead, keep the prepared subset or document where the original dataset should be placed.
python -m scripts.build_index --corpus data/hotpot_dev_small/hotpot_corpus.jsonl --index-dir artifacts/hotpot_index --device cpuUse --device cuda if a GPU is available.
The following commands run the four main variants with google/flan-t5-base.
python -m scripts.run_eval --index-dir artifacts/hotpot_index --eval data/hotpot_dev_small/hotpot_eval.json --generator hf --hf-model google/flan-t5-base --mode baseline --output artifacts/base/baseline_hf.jsonl --max-examples 100python -m scripts.run_eval --index-dir artifacts/hotpot_index --eval data/hotpot_dev_small/hotpot_eval.json --generator hf --hf-model google/flan-t5-base --mode early_exit_only --output artifacts/base/early_exit_only_hf.jsonl --max-examples 100python -m scripts.run_eval --index-dir artifacts/hotpot_index --eval data/hotpot_dev_small/hotpot_eval.json --generator hf --hf-model google/flan-t5-base --mode token_budget_only --output artifacts/base/token_budget_only_hf.jsonl --max-examples 100python -m scripts.run_eval --index-dir artifacts/hotpot_index --eval data/hotpot_dev_small/hotpot_eval.json --generator hf --hf-model google/flan-t5-base --mode full_system --output artifacts/base/full_system_hf.jsonl --max-examples 100Each run prints:
- EM
- F1
- evidence hit rate
- evidence recall
- average latency
- prompt tokens
- average retrieval depth
- SLO checks
- fallback statistics, if enabled
After running the four main variants:
python -m scripts.compare_rag_results --baseline artifacts/base/baseline_hf.jsonl --early-exit-only artifacts/base/early_exit_only_hf.jsonl --token-budget-only artifacts/base/token_budget_only_hf.jsonl --full-system artifacts/base/full_system_hf.jsonl --out-dir artifacts/reports/base_ablation --title "HotpotQA RAG Ablation Report - FLAN-T5 Base"Outputs:
artifacts/reports/base_ablation/summary.tsv
artifacts/reports/base_ablation/report_one_page.png
artifacts/reports/base_ablation/report_one_page.pdf
Error analysis categorizes prediction failures into retrieval, reasoning, ambiguity, partial-answer, and yes/no errors.
Example:
python -m scripts.error_analysis_hotpot --input artifacts/base/full_system_hf.jsonl --out-dir artifacts/reports/error_full_system --title "Error Analysis - Full System"Recommended commands:
python -m scripts.error_analysis_hotpot --input artifacts/base/baseline_hf.jsonl --out-dir artifacts/reports/error_baseline --title "Error Analysis - Baseline"
python -m scripts.error_analysis_hotpot --input artifacts/base/early_exit_only_hf.jsonl --out-dir artifacts/reports/error_early_exit --title "Error Analysis - Early Exit"
python -m scripts.error_analysis_hotpot --input artifacts/base/token_budget_only_hf.jsonl --out-dir artifacts/reports/error_token_budget --title "Error Analysis - Token Budget"
python -m scripts.error_analysis_hotpot --input artifacts/base/full_system_hf.jsonl --out-dir artifacts/reports/error_full_system --title "Error Analysis - Full System"Outputs:
error_summary.tsv
error_examples.txt
error_report.png
error_report.pdf
Main categories:
| Category | Meaning |
|---|---|
correct |
Prediction exactly matches the gold answer |
retrieval_failure |
Gold answer is missing from retrieved context |
partial_answer |
Prediction overlaps with gold but is incomplete |
ambiguity_or_wrong_entity |
Model selects the wrong entity |
reasoning_or_extraction_failure |
Evidence exists, but model fails to use it |
unknown_output |
Model outputs unknown/unanswerable |
yesno_reasoning_error |
Incorrect yes/no decision |
Evidence recall measures whether the gold answer appears in the retrieved context.
Example:
python -m scripts.evidence_recall_hotpot --input artifacts/base/full_system_hf.jsonl --out-dir artifacts/reports/evidence_full_system --title "Evidence Recall - Full System"Recommended commands:
python -m scripts.evidence_recall_hotpot --input artifacts/base/baseline_hf.jsonl --out-dir artifacts/reports/evidence_baseline --title "Evidence Recall - Baseline"
python -m scripts.evidence_recall_hotpot --input artifacts/base/early_exit_only_hf.jsonl --out-dir artifacts/reports/evidence_early_exit --title "Evidence Recall - Early Exit"
python -m scripts.evidence_recall_hotpot --input artifacts/base/token_budget_only_hf.jsonl --out-dir artifacts/reports/evidence_token_budget --title "Evidence Recall - Token Budget"
python -m scripts.evidence_recall_hotpot --input artifacts/base/full_system_hf.jsonl --out-dir artifacts/reports/evidence_full_system --title "Evidence Recall - Full System"Outputs:
evidence_recall_summary.json
evidence_recall_summary.tsv
evidence_recall_report.png
evidence_recall_report.pdf
Fallback is designed for low-confidence queries. If the retriever is uncertain, it continues retrieval to a more conservative minimum depth.
python -m scripts.run_eval --index-dir artifacts/hotpot_index --eval data/hotpot_dev_small/hotpot_eval.json --generator hf --hf-model google/flan-t5-base --device cpu --mode full_system --enable-fallback 0 --output artifacts/base/full_system_no_fallback.jsonl --max-examples 100python -m scripts.run_eval --index-dir artifacts/hotpot_index --eval data/hotpot_dev_small/hotpot_eval.json --generator hf --hf-model google/flan-t5-base --device cpu --mode full_system --enable-fallback 1 --fallback-min-k 6 --low-conf-top1-threshold 0.50 --low-conf-margin-threshold 0.10 --output artifacts/base/full_system_with_fallback.jsonl --max-examples 100Negative result: In the current experiment, fallback is triggered frequently but provides limited quality improvement because the system already operates close to max_k = 8. This result is useful because it shows that fallback policies must be designed together with the available retrieval budget.
To evaluate multiple budgets and retrieval settings:
python -m scripts.run_sweep --index-dir artifacts/hotpot_index --eval data/hotpot_dev_small/hotpot_eval.json --generator hf --hf-model google/flan-t5-base --device cpu --out-dir artifacts/sweep_base --max-examples 100After running a sweep:
python -m scripts.plot_pareto --summary-dir artifacts/sweep_base --out artifacts/sweep_base/pareto.pngThe plot uses:
x-axis: average latency
y-axis: EM
This plot identifies Pareto-efficient configurations that are not dominated in both latency and accuracy.
To test model capacity effects, replace the model name and output directory:
python -m scripts.run_eval --index-dir artifacts/hotpot_index --eval data/hotpot_dev_small/hotpot_eval.json --generator hf --hf-model google/flan-t5-small --device cpu --mode full_system --output artifacts/small/full_system_hf.jsonl --max-examples 100Repeat for:
baseline
early_exit_only
token_budget_only
full_system
Then run the same comparison, error-analysis, and evidence-recall scripts on artifacts/small/.
| File | Meaning |
|---|---|
*.jsonl |
Per-example predictions and metrics |
summary.tsv |
Aggregated metrics across system variants |
report_one_page.png/pdf |
Main ablation report |
error_report.png/pdf |
Error category visualization |
error_examples.txt |
Example errors grouped by category |
evidence_recall_report.png/pdf |
Evidence recall visualization |
evidence_recall_summary.json/tsv |
Evidence recall statistics |
pareto.png |
Pareto frontier plot |
| System | EM | F1 | Avg Latency (s) | Avg k | Prompt Tokens |
|---|---|---|---|---|---|
| baseline | 0.260 | 0.375 | 0.719 | 8.00 | 431.6 |
| early_exit_only | 0.240 | 0.349 | 0.684 | 7.28 | 420.3 |
| token_budget_only | 0.290 | 0.416 | 0.580 | 8.00 | 286.7 |
| full_system | 0.270 | 0.392 | 0.595 | 7.28 | 286.5 |
Interpretation:
token_budget_onlyachieves the best EM and lowest latency because compression removes noisy context.early_exit_onlyreduces retrieval depth but slightly hurts answer quality.full_systemprovides a strong efficiency-quality tradeoff.
We use the following practical SLOs:
Latency ≤ 0.6s
Prompt Tokens ≤ 300
EM ≥ 0.25
| System | Latency ≤ 0.6s | Prompt Tokens ≤ 300 | EM ≥ 0.25 |
|---|---|---|---|
| baseline | No | No | Yes |
| early_exit_only | No | No | No |
| token_budget_only | Yes | Yes | Yes |
| full_system | Yes / near target | Yes | Yes |
The full system is close to the latency SLO and satisfies the token and quality SLOs.
| System | EM | F1 | Avg Latency (s) | Avg k | Fallback Rate |
|---|---|---|---|---|---|
| full_system_no_fallback | 0.270 | 0.392 | 0.6166 | 7.28 | 0.00 |
| full_system_with_fallback | 0.270 | 0.392 | 0.6408 | 7.28 | 0.66 |
Fallback was triggered in 66% of examples, but it did not improve EM/F1 under the current max_k = 8 setting. This suggests that fallback would require a larger retrieval budget or stronger policy to be effective.
| System | Overall Recall |
|---|---|
| baseline | 0.59 |
| early_exit_only | 0.57 |
| token_budget_only | 0.54 |
| full_system | 0.52 |
Evidence recall decreases as efficiency optimization becomes stronger, confirming that missing evidence is the main failure mode.
-
Token-budgeted compression improves both efficiency and quality.
It reduces prompt tokens by roughly one third and improves EM forflan-t5-base. -
Early-exit retrieval reduces latency but can hurt evidence coverage.
It lowers retrieval depth but increases retrieval-related failures. -
The full system achieves a strong tradeoff.
It reduces latency and prompt tokens while maintaining competitive answer quality. -
Evidence recall decreases under stronger efficiency optimization.
This confirms that missing evidence is the main failure mode. -
Model capacity matters.
flan-t5-baseis more robust to compressed context thanflan-t5-small. -
Fallback detection works, but the current fallback strategy has limited benefit.
It triggers frequently but does not improve accuracy under the currentmax_k = 8setting.
- Dataset: HotpotQA subset
- Main example count: 100
- Tested Python: 3.10
- Models:
google/flan-t5-smallgoogle/flan-t5-base
- Main metrics:
- EM
- F1
- latency
- prompt tokens
- retrieval depth
- evidence recall
- Recommended hardware:
- CPU is sufficient for small-scale reproduction
- GPU is recommended for faster FLAN-T5 inference
Efficient RAG is not only about reducing token count. It is about preserving useful evidence while removing noise.
Token-budgeted context compression improves the signal-to-noise ratio of the prompt, while early-exit retrieval provides latency savings at the cost of possible evidence loss. The best design balances retrieval coverage, prompt compactness, and model reasoning capacity.