Skip to content

android24/RAGCostReducer

Repository files navigation

RAG Cost Reducer: Early-Exit Retrieval + Token-Budgeted Context Compression

This repository implements and evaluates an efficient Retrieval-Augmented Generation (RAG) pipeline on HotpotQA.

The core research question is:

Can we reduce RAG latency and prompt cost while preserving answer quality?

We explore two lightweight system optimizations:

  1. Early-Exit Retrieval
    Stop retrieval early when the retrieved evidence appears sufficiently confident.

  2. Token-Budgeted Context Compression
    Select only the most relevant evidence sentences under a fixed prompt token budget.

The final system is evaluated with FLAN-T5-small and FLAN-T5-base, using answer quality, latency, token usage, retrieval depth, error analysis, evidence recall, fallback behavior, and Pareto frontier analysis.


1. Motivation

Standard RAG systems often retrieve a fixed number of chunks and append them directly to the model prompt. This design is simple, but inefficient:

  • Some queries do not require the full top-k retrieval depth.
  • Long prompts increase inference latency and token cost.
  • Irrelevant retrieved text can distract the model.
  • Multi-hop QA tasks such as HotpotQA require enough evidence coverage, so aggressive pruning may hurt quality.

This project studies the tradeoff between:

Efficiency: latency, prompt tokens, retrieval depth
Quality: EM, F1, evidence recall

The goal is not to build a production vector database. Instead, the goal is to build a reproducible semester-scale RAG system that exposes real efficiency-quality tradeoffs.

This project intentionally avoids FAISS to improve reproducibility. It uses dense embeddings and cosine similarity, which are sufficient for the project-scale HotpotQA experiments.


2. Quick Start

The following commands reproduce a minimal full-system run with google/flan-t5-base.

# 1. Install dependencies
pip install -r requirements.txt

# 2. Prepare HotpotQA subset
python -m scripts.prepare_hotpotqa   --input data/hotpot_dev_distractor_v1.json   --output-dir data/hotpot_dev_small   --chunk-size 3   --stride 2   --limit 300

# 3. Build retrieval index
python -m scripts.build_index   --corpus data/hotpot_dev_small/hotpot_corpus.jsonl   --index-dir artifacts/hotpot_index   --device cpu

# 4. Run the full system
python -m scripts.run_eval   --index-dir artifacts/hotpot_index   --eval data/hotpot_dev_small/hotpot_eval.json   --generator hf   --hf-model google/flan-t5-base   --device cpu   --mode full_system   --output artifacts/base/full_system_hf.jsonl   --max-examples 100

Use --device cuda instead of --device cpu if a GPU is available.


3. What to Run

Goal Script
Prepare HotpotQA subset scripts.prepare_hotpotqa
Build retrieval index scripts.build_index
Run one RAG variant scripts.run_eval
Compare four variants scripts.compare_rag_results
Generate error analysis scripts.error_analysis_hotpot
Generate evidence recall report scripts.evidence_recall_hotpot
Run parameter sweep scripts.run_sweep
Plot Pareto frontier scripts.plot_pareto

Recommended main workflow:

prepare_hotpotqa → build_index → run_eval × 4 → compare_rag_results
                 → error_analysis_hotpot
                 → evidence_recall_hotpot

Optional workflow:

run_sweep → plot_pareto

4. Main Contributions

This project provides:

  • A runnable RAG pipeline for HotpotQA.
  • Four system variants for ablation:
    • baseline
    • early_exit_only
    • token_budget_only
    • full_system
  • A fallback experiment for low-confidence retrieval.
  • Automatic evaluation scripts for:
    • EM / F1
    • latency
    • prompt tokens
    • retrieval depth
    • error categories
    • evidence recall
    • Pareto frontier
  • Comparative analysis between google/flan-t5-small and google/flan-t5-base.

5. System Overview

Question
   ↓
Dense Retriever
   ↓
Early-Exit Controller
   ↓
Retrieved Chunks
   ↓
Sentence Split + Relevance Scoring
   ↓
Token-Budgeted Evidence Packing
   ↓
Prompt Construction
   ↓
FLAN-T5 small / base
   ↓
Answer

System Variants

System Description
baseline Fixed top-k retrieval, no context compression
early_exit_only Adaptive retrieval stopping only
token_budget_only Sentence-level context compression only
full_system Early-exit retrieval + token-budgeted compression
full_system + fallback Full system with low-confidence fallback enabled

6. Repository Structure

rag_cost_reducer/
├── README.md
├── requirements.txt
├── data/
│   ├── hotpot_dev_distractor_v1.json
│   └── hotpot_dev_small/
│       ├── hotpot_corpus.jsonl
│       └── hotpot_eval.json
├── artifacts/
│   ├── hotpot_index/
│   ├── base/
│   ├── small/
│   ├── reports/
│   └── sweep_base/
├── rag/
│   ├── config.py
│   ├── corpus.py
│   ├── embedder.py
│   ├── index.py
│   ├── retriever.py
│   ├── context_builder.py
│   ├── generator.py
│   ├── pipeline.py
│   └── metrics.py
└── scripts/
    ├── prepare_hotpotqa.py
    ├── build_index.py
    ├── run_eval.py
    ├── compare_rag_results.py
    ├── error_analysis_hotpot.py
    ├── evidence_recall_hotpot.py
    ├── run_sweep.py
    └── plot_pareto.py

7. Installation

Tested with Python 3.10. Python 3.9+ should work.

Create a virtual environment:

python -m venv .venv
source .venv/bin/activate

Install dependencies:

pip install --upgrade pip
pip install -r requirements.txt

For CPU-only environments, install CPU PyTorch first:

pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install -r requirements.txt

CPU is sufficient for reproducing the 100-example experiments, although GPU inference is significantly faster.


8. Prepare HotpotQA Data

Place the raw HotpotQA file here:

data/hotpot_dev_distractor_v1.json

Prepare a smaller experimental subset:

python -m scripts.prepare_hotpotqa   --input data/hotpot_dev_distractor_v1.json   --output-dir data/hotpot_dev_small   --chunk-size 3   --stride 2   --limit 300

This creates:

data/hotpot_dev_small/hotpot_corpus.jsonl
data/hotpot_dev_small/hotpot_eval.json

If the raw HotpotQA file is too large for submission, do not include it in the final artifact. Instead, keep the prepared subset or document where the original dataset should be placed.


9. Build the Retrieval Index

python -m scripts.build_index   --corpus data/hotpot_dev_small/hotpot_corpus.jsonl   --index-dir artifacts/hotpot_index   --device cpu

Use --device cuda if a GPU is available.


10. Run the Main Experiments

The following commands run the four main variants with google/flan-t5-base.

10.1 Baseline

python -m scripts.run_eval   --index-dir artifacts/hotpot_index   --eval data/hotpot_dev_small/hotpot_eval.json   --generator hf   --hf-model google/flan-t5-base   --mode baseline   --output artifacts/base/baseline_hf.jsonl   --max-examples 100

10.2 Early Exit Only

python -m scripts.run_eval   --index-dir artifacts/hotpot_index   --eval data/hotpot_dev_small/hotpot_eval.json   --generator hf   --hf-model google/flan-t5-base   --mode early_exit_only   --output artifacts/base/early_exit_only_hf.jsonl   --max-examples 100

10.3 Token Budget Only

python -m scripts.run_eval   --index-dir artifacts/hotpot_index   --eval data/hotpot_dev_small/hotpot_eval.json   --generator hf   --hf-model google/flan-t5-base   --mode token_budget_only   --output artifacts/base/token_budget_only_hf.jsonl   --max-examples 100

10.4 Full System

python -m scripts.run_eval   --index-dir artifacts/hotpot_index   --eval data/hotpot_dev_small/hotpot_eval.json   --generator hf   --hf-model google/flan-t5-base   --mode full_system   --output artifacts/base/full_system_hf.jsonl   --max-examples 100

Each run prints:

  • EM
  • F1
  • evidence hit rate
  • evidence recall
  • average latency
  • prompt tokens
  • average retrieval depth
  • SLO checks
  • fallback statistics, if enabled

11. Generate the Main Ablation Report

After running the four main variants:

python -m scripts.compare_rag_results   --baseline artifacts/base/baseline_hf.jsonl   --early-exit-only artifacts/base/early_exit_only_hf.jsonl   --token-budget-only artifacts/base/token_budget_only_hf.jsonl   --full-system artifacts/base/full_system_hf.jsonl   --out-dir artifacts/reports/base_ablation   --title "HotpotQA RAG Ablation Report - FLAN-T5 Base"

Outputs:

artifacts/reports/base_ablation/summary.tsv
artifacts/reports/base_ablation/report_one_page.png
artifacts/reports/base_ablation/report_one_page.pdf

12. Error Analysis

Error analysis categorizes prediction failures into retrieval, reasoning, ambiguity, partial-answer, and yes/no errors.

Example:

python -m scripts.error_analysis_hotpot   --input artifacts/base/full_system_hf.jsonl   --out-dir artifacts/reports/error_full_system   --title "Error Analysis - Full System"

Recommended commands:

python -m scripts.error_analysis_hotpot --input artifacts/base/baseline_hf.jsonl --out-dir artifacts/reports/error_baseline --title "Error Analysis - Baseline"
python -m scripts.error_analysis_hotpot --input artifacts/base/early_exit_only_hf.jsonl --out-dir artifacts/reports/error_early_exit --title "Error Analysis - Early Exit"
python -m scripts.error_analysis_hotpot --input artifacts/base/token_budget_only_hf.jsonl --out-dir artifacts/reports/error_token_budget --title "Error Analysis - Token Budget"
python -m scripts.error_analysis_hotpot --input artifacts/base/full_system_hf.jsonl --out-dir artifacts/reports/error_full_system --title "Error Analysis - Full System"

Outputs:

error_summary.tsv
error_examples.txt
error_report.png
error_report.pdf

Main categories:

Category Meaning
correct Prediction exactly matches the gold answer
retrieval_failure Gold answer is missing from retrieved context
partial_answer Prediction overlaps with gold but is incomplete
ambiguity_or_wrong_entity Model selects the wrong entity
reasoning_or_extraction_failure Evidence exists, but model fails to use it
unknown_output Model outputs unknown/unanswerable
yesno_reasoning_error Incorrect yes/no decision

13. Evidence Recall Analysis

Evidence recall measures whether the gold answer appears in the retrieved context.

Example:

python -m scripts.evidence_recall_hotpot   --input artifacts/base/full_system_hf.jsonl   --out-dir artifacts/reports/evidence_full_system   --title "Evidence Recall - Full System"

Recommended commands:

python -m scripts.evidence_recall_hotpot --input artifacts/base/baseline_hf.jsonl --out-dir artifacts/reports/evidence_baseline --title "Evidence Recall - Baseline"
python -m scripts.evidence_recall_hotpot --input artifacts/base/early_exit_only_hf.jsonl --out-dir artifacts/reports/evidence_early_exit --title "Evidence Recall - Early Exit"
python -m scripts.evidence_recall_hotpot --input artifacts/base/token_budget_only_hf.jsonl --out-dir artifacts/reports/evidence_token_budget --title "Evidence Recall - Token Budget"
python -m scripts.evidence_recall_hotpot --input artifacts/base/full_system_hf.jsonl --out-dir artifacts/reports/evidence_full_system --title "Evidence Recall - Full System"

Outputs:

evidence_recall_summary.json
evidence_recall_summary.tsv
evidence_recall_report.png
evidence_recall_report.pdf

14. Optional: Fallback Experiment

Fallback is designed for low-confidence queries. If the retriever is uncertain, it continues retrieval to a more conservative minimum depth.

Full System without Fallback

python -m scripts.run_eval   --index-dir artifacts/hotpot_index   --eval data/hotpot_dev_small/hotpot_eval.json   --generator hf   --hf-model google/flan-t5-base   --device cpu   --mode full_system   --enable-fallback 0   --output artifacts/base/full_system_no_fallback.jsonl   --max-examples 100

Full System with Fallback

python -m scripts.run_eval   --index-dir artifacts/hotpot_index   --eval data/hotpot_dev_small/hotpot_eval.json   --generator hf   --hf-model google/flan-t5-base   --device cpu   --mode full_system   --enable-fallback 1   --fallback-min-k 6   --low-conf-top1-threshold 0.50   --low-conf-margin-threshold 0.10   --output artifacts/base/full_system_with_fallback.jsonl   --max-examples 100

Negative result: In the current experiment, fallback is triggered frequently but provides limited quality improvement because the system already operates close to max_k = 8. This result is useful because it shows that fallback policies must be designed together with the available retrieval budget.


15. Optional: Parameter Sweep

To evaluate multiple budgets and retrieval settings:

python -m scripts.run_sweep   --index-dir artifacts/hotpot_index   --eval data/hotpot_dev_small/hotpot_eval.json   --generator hf   --hf-model google/flan-t5-base   --device cpu   --out-dir artifacts/sweep_base   --max-examples 100

16. Optional: Pareto Frontier

After running a sweep:

python -m scripts.plot_pareto   --summary-dir artifacts/sweep_base   --out artifacts/sweep_base/pareto.png

The plot uses:

x-axis: average latency
y-axis: EM

This plot identifies Pareto-efficient configurations that are not dominated in both latency and accuracy.


17. Optional: Run FLAN-T5-Small Experiments

To test model capacity effects, replace the model name and output directory:

python -m scripts.run_eval   --index-dir artifacts/hotpot_index   --eval data/hotpot_dev_small/hotpot_eval.json   --generator hf   --hf-model google/flan-t5-small   --device cpu   --mode full_system   --output artifacts/small/full_system_hf.jsonl   --max-examples 100

Repeat for:

baseline
early_exit_only
token_budget_only
full_system

Then run the same comparison, error-analysis, and evidence-recall scripts on artifacts/small/.


18. Generated Artifacts

File Meaning
*.jsonl Per-example predictions and metrics
summary.tsv Aggregated metrics across system variants
report_one_page.png/pdf Main ablation report
error_report.png/pdf Error category visualization
error_examples.txt Example errors grouped by category
evidence_recall_report.png/pdf Evidence recall visualization
evidence_recall_summary.json/tsv Evidence recall statistics
pareto.png Pareto frontier plot

19. Experimental Results

19.1 Main Results with FLAN-T5-Base

System EM F1 Avg Latency (s) Avg k Prompt Tokens
baseline 0.260 0.375 0.719 8.00 431.6
early_exit_only 0.240 0.349 0.684 7.28 420.3
token_budget_only 0.290 0.416 0.580 8.00 286.7
full_system 0.270 0.392 0.595 7.28 286.5

Interpretation:

  • token_budget_only achieves the best EM and lowest latency because compression removes noisy context.
  • early_exit_only reduces retrieval depth but slightly hurts answer quality.
  • full_system provides a strong efficiency-quality tradeoff.

19.2 SLO Summary

We use the following practical SLOs:

Latency ≤ 0.6s
Prompt Tokens ≤ 300
EM ≥ 0.25
System Latency ≤ 0.6s Prompt Tokens ≤ 300 EM ≥ 0.25
baseline No No Yes
early_exit_only No No No
token_budget_only Yes Yes Yes
full_system Yes / near target Yes Yes

The full system is close to the latency SLO and satisfies the token and quality SLOs.

19.3 Fallback Comparison

System EM F1 Avg Latency (s) Avg k Fallback Rate
full_system_no_fallback 0.270 0.392 0.6166 7.28 0.00
full_system_with_fallback 0.270 0.392 0.6408 7.28 0.66

Fallback was triggered in 66% of examples, but it did not improve EM/F1 under the current max_k = 8 setting. This suggests that fallback would require a larger retrieval budget or stronger policy to be effective.

19.4 Evidence Recall

System Overall Recall
baseline 0.59
early_exit_only 0.57
token_budget_only 0.54
full_system 0.52

Evidence recall decreases as efficiency optimization becomes stronger, confirming that missing evidence is the main failure mode.


20. Key Findings

  1. Token-budgeted compression improves both efficiency and quality.
    It reduces prompt tokens by roughly one third and improves EM for flan-t5-base.

  2. Early-exit retrieval reduces latency but can hurt evidence coverage.
    It lowers retrieval depth but increases retrieval-related failures.

  3. The full system achieves a strong tradeoff.
    It reduces latency and prompt tokens while maintaining competitive answer quality.

  4. Evidence recall decreases under stronger efficiency optimization.
    This confirms that missing evidence is the main failure mode.

  5. Model capacity matters.
    flan-t5-base is more robust to compressed context than flan-t5-small.

  6. Fallback detection works, but the current fallback strategy has limited benefit.
    It triggers frequently but does not improve accuracy under the current max_k = 8 setting.


21. Reproducibility Notes

  • Dataset: HotpotQA subset
  • Main example count: 100
  • Tested Python: 3.10
  • Models:
    • google/flan-t5-small
    • google/flan-t5-base
  • Main metrics:
    • EM
    • F1
    • latency
    • prompt tokens
    • retrieval depth
    • evidence recall
  • Recommended hardware:
    • CPU is sufficient for small-scale reproduction
    • GPU is recommended for faster FLAN-T5 inference

22. Final Takeaway

Efficient RAG is not only about reducing token count. It is about preserving useful evidence while removing noise.

Token-budgeted context compression improves the signal-to-noise ratio of the prompt, while early-exit retrieval provides latency savings at the cost of possible evidence loss. The best design balances retrieval coverage, prompt compactness, and model reasoning capacity.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages