RAG Cost Reducer: Early-Exit Retrieval + Token-Budgeted Context Compression

This repository implements and evaluates an efficient Retrieval-Augmented Generation (RAG) pipeline on HotpotQA.

The core research question is:

Can we reduce RAG latency and prompt cost while preserving answer quality?

We explore two lightweight system optimizations:

Early-Exit Retrieval
Stop retrieval early when the retrieved evidence appears sufficiently confident.
Token-Budgeted Context Compression
Select only the most relevant evidence sentences under a fixed prompt token budget.

The final system is evaluated with FLAN-T5-small and FLAN-T5-base, using answer quality, latency, token usage, retrieval depth, error analysis, evidence recall, fallback behavior, and Pareto frontier analysis.

1. Motivation

Standard RAG systems often retrieve a fixed number of chunks and append them directly to the model prompt. This design is simple, but inefficient:

Some queries do not require the full top-k retrieval depth.
Long prompts increase inference latency and token cost.
Irrelevant retrieved text can distract the model.
Multi-hop QA tasks such as HotpotQA require enough evidence coverage, so aggressive pruning may hurt quality.

This project studies the tradeoff between:

Efficiency: latency, prompt tokens, retrieval depth
Quality: EM, F1, evidence recall

The goal is not to build a production vector database. Instead, the goal is to build a reproducible semester-scale RAG system that exposes real efficiency-quality tradeoffs.

This project intentionally avoids FAISS to improve reproducibility. It uses dense embeddings and cosine similarity, which are sufficient for the project-scale HotpotQA experiments.

2. Quick Start

The following commands reproduce a minimal full-system run with google/flan-t5-base.

# 1. Install dependencies
pip install -r requirements.txt

# 2. Prepare HotpotQA subset
python -m scripts.prepare_hotpotqa   --input data/hotpot_dev_distractor_v1.json   --output-dir data/hotpot_dev_small   --chunk-size 3   --stride 2   --limit 300

# 3. Build retrieval index
python -m scripts.build_index   --corpus data/hotpot_dev_small/hotpot_corpus.jsonl   --index-dir artifacts/hotpot_index   --device cpu

# 4. Run the full system
python -m scripts.run_eval   --index-dir artifacts/hotpot_index   --eval data/hotpot_dev_small/hotpot_eval.json   --generator hf   --hf-model google/flan-t5-base   --device cpu   --mode full_system   --output artifacts/base/full_system_hf.jsonl   --max-examples 100

Use --device cuda instead of --device cpu if a GPU is available.

3. What to Run

Goal	Script
Prepare HotpotQA subset	`scripts.prepare_hotpotqa`
Build retrieval index	`scripts.build_index`
Run one RAG variant	`scripts.run_eval`
Compare four variants	`scripts.compare_rag_results`
Generate error analysis	`scripts.error_analysis_hotpot`
Generate evidence recall report	`scripts.evidence_recall_hotpot`
Run parameter sweep	`scripts.run_sweep`
Plot Pareto frontier	`scripts.plot_pareto`

Recommended main workflow:

prepare_hotpotqa → build_index → run_eval × 4 → compare_rag_results
                 → error_analysis_hotpot
                 → evidence_recall_hotpot

Optional workflow:

run_sweep → plot_pareto

4. Main Contributions

This project provides:

A runnable RAG pipeline for HotpotQA.
Four system variants for ablation:
- baseline
- early_exit_only
- token_budget_only
- full_system
A fallback experiment for low-confidence retrieval.
Automatic evaluation scripts for:
- EM / F1
- latency
- prompt tokens
- retrieval depth
- error categories
- evidence recall
- Pareto frontier
Comparative analysis between google/flan-t5-small and google/flan-t5-base.

5. System Overview

Question
   ↓
Dense Retriever
   ↓
Early-Exit Controller
   ↓
Retrieved Chunks
   ↓
Sentence Split + Relevance Scoring
   ↓
Token-Budgeted Evidence Packing
   ↓
Prompt Construction
   ↓
FLAN-T5 small / base
   ↓
Answer

System Variants

System	Description
`baseline`	Fixed top-k retrieval, no context compression
`early_exit_only`	Adaptive retrieval stopping only
`token_budget_only`	Sentence-level context compression only
`full_system`	Early-exit retrieval + token-budgeted compression
`full_system` + fallback	Full system with low-confidence fallback enabled

6. Repository Structure

rag_cost_reducer/
├── README.md
├── requirements.txt
├── data/
│   ├── hotpot_dev_distractor_v1.json
│   └── hotpot_dev_small/
│       ├── hotpot_corpus.jsonl
│       └── hotpot_eval.json
├── artifacts/
│   ├── hotpot_index/
│   ├── base/
│   ├── small/
│   ├── reports/
│   └── sweep_base/
├── rag/
│   ├── config.py
│   ├── corpus.py
│   ├── embedder.py
│   ├── index.py
│   ├── retriever.py
│   ├── context_builder.py
│   ├── generator.py
│   ├── pipeline.py
│   └── metrics.py
└── scripts/
    ├── prepare_hotpotqa.py
    ├── build_index.py
    ├── run_eval.py
    ├── compare_rag_results.py
    ├── error_analysis_hotpot.py
    ├── evidence_recall_hotpot.py
    ├── run_sweep.py
    └── plot_pareto.py

7. Installation

Tested with Python 3.10. Python 3.9+ should work.

Create a virtual environment:

python -m venv .venv
source .venv/bin/activate

Install dependencies:

pip install --upgrade pip
pip install -r requirements.txt

For CPU-only environments, install CPU PyTorch first:

pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install -r requirements.txt

CPU is sufficient for reproducing the 100-example experiments, although GPU inference is significantly faster.

8. Prepare HotpotQA Data

Place the raw HotpotQA file here:

data/hotpot_dev_distractor_v1.json

Prepare a smaller experimental subset:

python -m scripts.prepare_hotpotqa   --input data/hotpot_dev_distractor_v1.json   --output-dir data/hotpot_dev_small   --chunk-size 3   --stride 2   --limit 300

This creates:

data/hotpot_dev_small/hotpot_corpus.jsonl
data/hotpot_dev_small/hotpot_eval.json

If the raw HotpotQA file is too large for submission, do not include it in the final artifact. Instead, keep the prepared subset or document where the original dataset should be placed.

9. Build the Retrieval Index

python -m scripts.build_index   --corpus data/hotpot_dev_small/hotpot_corpus.jsonl   --index-dir artifacts/hotpot_index   --device cpu

Use --device cuda if a GPU is available.

10. Run the Main Experiments

The following commands run the four main variants with google/flan-t5-base.

10.1 Baseline

python -m scripts.run_eval   --index-dir artifacts/hotpot_index   --eval data/hotpot_dev_small/hotpot_eval.json   --generator hf   --hf-model google/flan-t5-base   --mode baseline   --output artifacts/base/baseline_hf.jsonl   --max-examples 100

10.2 Early Exit Only

python -m scripts.run_eval   --index-dir artifacts/hotpot_index   --eval data/hotpot_dev_small/hotpot_eval.json   --generator hf   --hf-model google/flan-t5-base   --mode early_exit_only   --output artifacts/base/early_exit_only_hf.jsonl   --max-examples 100

10.3 Token Budget Only

python -m scripts.run_eval   --index-dir artifacts/hotpot_index   --eval data/hotpot_dev_small/hotpot_eval.json   --generator hf   --hf-model google/flan-t5-base   --mode token_budget_only   --output artifacts/base/token_budget_only_hf.jsonl   --max-examples 100

10.4 Full System

python -m scripts.run_eval   --index-dir artifacts/hotpot_index   --eval data/hotpot_dev_small/hotpot_eval.json   --generator hf   --hf-model google/flan-t5-base   --mode full_system   --output artifacts/base/full_system_hf.jsonl   --max-examples 100

Each run prints:

EM
F1
evidence hit rate
evidence recall
average latency
prompt tokens
average retrieval depth
SLO checks
fallback statistics, if enabled

11. Generate the Main Ablation Report

After running the four main variants:

python -m scripts.compare_rag_results   --baseline artifacts/base/baseline_hf.jsonl   --early-exit-only artifacts/base/early_exit_only_hf.jsonl   --token-budget-only artifacts/base/token_budget_only_hf.jsonl   --full-system artifacts/base/full_system_hf.jsonl   --out-dir artifacts/reports/base_ablation   --title "HotpotQA RAG Ablation Report - FLAN-T5 Base"

Outputs:

artifacts/reports/base_ablation/summary.tsv
artifacts/reports/base_ablation/report_one_page.png
artifacts/reports/base_ablation/report_one_page.pdf

12. Error Analysis

Error analysis categorizes prediction failures into retrieval, reasoning, ambiguity, partial-answer, and yes/no errors.

Example:

python -m scripts.error_analysis_hotpot   --input artifacts/base/full_system_hf.jsonl   --out-dir artifacts/reports/error_full_system   --title "Error Analysis - Full System"

Recommended commands:

python -m scripts.error_analysis_hotpot --input artifacts/base/baseline_hf.jsonl --out-dir artifacts/reports/error_baseline --title "Error Analysis - Baseline"
python -m scripts.error_analysis_hotpot --input artifacts/base/early_exit_only_hf.jsonl --out-dir artifacts/reports/error_early_exit --title "Error Analysis - Early Exit"
python -m scripts.error_analysis_hotpot --input artifacts/base/token_budget_only_hf.jsonl --out-dir artifacts/reports/error_token_budget --title "Error Analysis - Token Budget"
python -m scripts.error_analysis_hotpot --input artifacts/base/full_system_hf.jsonl --out-dir artifacts/reports/error_full_system --title "Error Analysis - Full System"

Outputs:

error_summary.tsv
error_examples.txt
error_report.png
error_report.pdf

Main categories:

Category	Meaning
`correct`	Prediction exactly matches the gold answer
`retrieval_failure`	Gold answer is missing from retrieved context
`partial_answer`	Prediction overlaps with gold but is incomplete
`ambiguity_or_wrong_entity`	Model selects the wrong entity
`reasoning_or_extraction_failure`	Evidence exists, but model fails to use it
`unknown_output`	Model outputs unknown/unanswerable
`yesno_reasoning_error`	Incorrect yes/no decision

13. Evidence Recall Analysis

Evidence recall measures whether the gold answer appears in the retrieved context.

Example:

python -m scripts.evidence_recall_hotpot   --input artifacts/base/full_system_hf.jsonl   --out-dir artifacts/reports/evidence_full_system   --title "Evidence Recall - Full System"

Recommended commands:

python -m scripts.evidence_recall_hotpot --input artifacts/base/baseline_hf.jsonl --out-dir artifacts/reports/evidence_baseline --title "Evidence Recall - Baseline"
python -m scripts.evidence_recall_hotpot --input artifacts/base/early_exit_only_hf.jsonl --out-dir artifacts/reports/evidence_early_exit --title "Evidence Recall - Early Exit"
python -m scripts.evidence_recall_hotpot --input artifacts/base/token_budget_only_hf.jsonl --out-dir artifacts/reports/evidence_token_budget --title "Evidence Recall - Token Budget"
python -m scripts.evidence_recall_hotpot --input artifacts/base/full_system_hf.jsonl --out-dir artifacts/reports/evidence_full_system --title "Evidence Recall - Full System"

Outputs:

evidence_recall_summary.json
evidence_recall_summary.tsv
evidence_recall_report.png
evidence_recall_report.pdf

14. Optional: Fallback Experiment

Fallback is designed for low-confidence queries. If the retriever is uncertain, it continues retrieval to a more conservative minimum depth.

Full System without Fallback

python -m scripts.run_eval   --index-dir artifacts/hotpot_index   --eval data/hotpot_dev_small/hotpot_eval.json   --generator hf   --hf-model google/flan-t5-base   --device cpu   --mode full_system   --enable-fallback 0   --output artifacts/base/full_system_no_fallback.jsonl   --max-examples 100

Full System with Fallback

python -m scripts.run_eval   --index-dir artifacts/hotpot_index   --eval data/hotpot_dev_small/hotpot_eval.json   --generator hf   --hf-model google/flan-t5-base   --device cpu   --mode full_system   --enable-fallback 1   --fallback-min-k 6   --low-conf-top1-threshold 0.50   --low-conf-margin-threshold 0.10   --output artifacts/base/full_system_with_fallback.jsonl   --max-examples 100

Negative result: In the current experiment, fallback is triggered frequently but provides limited quality improvement because the system already operates close to max_k = 8. This result is useful because it shows that fallback policies must be designed together with the available retrieval budget.

15. Optional: Parameter Sweep

To evaluate multiple budgets and retrieval settings:

python -m scripts.run_sweep   --index-dir artifacts/hotpot_index   --eval data/hotpot_dev_small/hotpot_eval.json   --generator hf   --hf-model google/flan-t5-base   --device cpu   --out-dir artifacts/sweep_base   --max-examples 100

16. Optional: Pareto Frontier

After running a sweep:

python -m scripts.plot_pareto   --summary-dir artifacts/sweep_base   --out artifacts/sweep_base/pareto.png

The plot uses:

x-axis: average latency
y-axis: EM

This plot identifies Pareto-efficient configurations that are not dominated in both latency and accuracy.

17. Optional: Run FLAN-T5-Small Experiments

To test model capacity effects, replace the model name and output directory:

python -m scripts.run_eval   --index-dir artifacts/hotpot_index   --eval data/hotpot_dev_small/hotpot_eval.json   --generator hf   --hf-model google/flan-t5-small   --device cpu   --mode full_system   --output artifacts/small/full_system_hf.jsonl   --max-examples 100

Repeat for:

baseline
early_exit_only
token_budget_only
full_system

Then run the same comparison, error-analysis, and evidence-recall scripts on artifacts/small/.

18. Generated Artifacts

File	Meaning
`*.jsonl`	Per-example predictions and metrics
`summary.tsv`	Aggregated metrics across system variants
`report_one_page.png/pdf`	Main ablation report
`error_report.png/pdf`	Error category visualization
`error_examples.txt`	Example errors grouped by category
`evidence_recall_report.png/pdf`	Evidence recall visualization
`evidence_recall_summary.json/tsv`	Evidence recall statistics
`pareto.png`	Pareto frontier plot

19. Experimental Results

19.1 Main Results with FLAN-T5-Base

System	EM	F1	Avg Latency (s)	Avg k	Prompt Tokens
baseline	0.260	0.375	0.719	8.00	431.6
early_exit_only	0.240	0.349	0.684	7.28	420.3
token_budget_only	0.290	0.416	0.580	8.00	286.7
full_system	0.270	0.392	0.595	7.28	286.5

Interpretation:

token_budget_only achieves the best EM and lowest latency because compression removes noisy context.
early_exit_only reduces retrieval depth but slightly hurts answer quality.
full_system provides a strong efficiency-quality tradeoff.

19.2 SLO Summary

We use the following practical SLOs:

Latency ≤ 0.6s
Prompt Tokens ≤ 300
EM ≥ 0.25

System	Latency ≤ 0.6s	Prompt Tokens ≤ 300	EM ≥ 0.25
baseline	No	No	Yes
early_exit_only	No	No	No
token_budget_only	Yes	Yes	Yes
full_system	Yes / near target	Yes	Yes

The full system is close to the latency SLO and satisfies the token and quality SLOs.

19.3 Fallback Comparison

System	EM	F1	Avg Latency (s)	Avg k	Fallback Rate
full_system_no_fallback	0.270	0.392	0.6166	7.28	0.00
full_system_with_fallback	0.270	0.392	0.6408	7.28	0.66

Fallback was triggered in 66% of examples, but it did not improve EM/F1 under the current max_k = 8 setting. This suggests that fallback would require a larger retrieval budget or stronger policy to be effective.

19.4 Evidence Recall

System	Overall Recall
baseline	0.59
early_exit_only	0.57
token_budget_only	0.54
full_system	0.52

Evidence recall decreases as efficiency optimization becomes stronger, confirming that missing evidence is the main failure mode.

20. Key Findings

Token-budgeted compression improves both efficiency and quality.
It reduces prompt tokens by roughly one third and improves EM for flan-t5-base.
Early-exit retrieval reduces latency but can hurt evidence coverage.
It lowers retrieval depth but increases retrieval-related failures.
The full system achieves a strong tradeoff.
It reduces latency and prompt tokens while maintaining competitive answer quality.
Evidence recall decreases under stronger efficiency optimization.
This confirms that missing evidence is the main failure mode.
Model capacity matters.
flan-t5-base is more robust to compressed context than flan-t5-small.
Fallback detection works, but the current fallback strategy has limited benefit.
It triggers frequently but does not improve accuracy under the current max_k = 8 setting.

21. Reproducibility Notes

Dataset: HotpotQA subset
Main example count: 100
Tested Python: 3.10
Models:
- google/flan-t5-small
- google/flan-t5-base
Main metrics:
- EM
- F1
- latency
- prompt tokens
- retrieval depth
- evidence recall
Recommended hardware:
- CPU is sufficient for small-scale reproduction
- GPU is recommended for faster FLAN-T5 inference

22. Final Takeaway

Efficient RAG is not only about reducing token count. It is about preserving useful evidence while removing noise.

Token-budgeted context compression improves the signal-to-noise ratio of the prompt, while early-exit retrieval provides latency savings at the cost of possible evidence loss. The best design balances retrieval coverage, prompt compactness, and model reasoning capacity.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
artifacts		artifacts
data		data
rag		rag
scripts		scripts
.DS_Store		.DS_Store
README.md		README.md
compare_rag_results.py		compare_rag_results.py
error_analysis_hotpot.py		error_analysis_hotpot.py
evidence_recall_hotpot.py		evidence_recall_hotpot.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

RAG Cost Reducer: Early-Exit Retrieval + Token-Budgeted Context Compression

1. Motivation

2. Quick Start

3. What to Run

4. Main Contributions

5. System Overview

System Variants

6. Repository Structure

7. Installation

8. Prepare HotpotQA Data

9. Build the Retrieval Index

10. Run the Main Experiments

10.1 Baseline

10.2 Early Exit Only

10.3 Token Budget Only

10.4 Full System

11. Generate the Main Ablation Report

12. Error Analysis

13. Evidence Recall Analysis

14. Optional: Fallback Experiment

Full System without Fallback

Full System with Fallback

15. Optional: Parameter Sweep

16. Optional: Pareto Frontier

17. Optional: Run FLAN-T5-Small Experiments

18. Generated Artifacts

19. Experimental Results

19.1 Main Results with FLAN-T5-Base

19.2 SLO Summary

19.3 Fallback Comparison

19.4 Evidence Recall

20. Key Findings

21. Reproducibility Notes

22. Final Takeaway

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages