A graph-based algorithm for merging partially overlapping rows (entities) into a coherent table using shared label-value tuples, selected-context gain, tuple weighting, and path penalty.
This repository contains the reference implementation and demo for the paper:
A Context-Aware Graph-Based Algorithm for Merging Rows (Entities) with Selected-Match Gain and Path Penalty
Original paper: contex-aware-row-merge.pdf
Revised paper (v2): contex-aware-row-merge-v2.pdf
The goal is to merge sparse, partially overlapping rows into more complete rows by traversing a graph of related rows.
Each row is a set of label-value tuples, for example:
R1 = [(L1, V1)]
R5 = [(L1, V1), (L3, V3), (L4, V4)]
R3 = [(L3, V3), (L4, V4), (L5, V5)]
Rows are connected when they share tuples. The algorithm scores candidate paths through the graph and resolves missing labels using:
- local strength — overlap between connected rows
- selected-match gain — preference for rows consistent with already selected tuples
- tuple weight — importance of specific tuples
- path penalty — discourages unnecessarily long paths
The result is a deterministic, interpretable merge process with provenance.
This approach is intended for cases where extracted entities or records are:
- sparse
- partially overlapping
- spread across multiple files or sources
- too structured for LLM-only/RAG-style consolidation
- required to be explainable and reproducible
Typical use cases include:
- entity resolution
- record linkage
- structured data enrichment
- log/event consolidation
- graph-based AI workflows
R = Row
L = Label
V = Value
T = Tuple (L, V)
B = Bucket
Input rows:
R1 = [(L1, V1)]
R2 = [(L1, V2)]
R3 = [(L3, V3), (L4, V4), (L5, V5)]
R4 = [(L3, V3), (L4, V5), (L5, V6)]
R5 = [(L1, V1), (L3, V3), (L4, V4)]
R6 = [(L1, V2), (L3, V3), (L4, V5)]
Connections:
R1is connected toR5via(L1, V1)R5is connected toR3via(L3, V3)and(L4, V4)
So merging R1, R5, and R3 produces:
[(L1, V1), (L3, V3), (L4, V4), (L5, V5)]
Similarly:
R2is connected toR6via(L1, V2)R6is connected toR4via(L3, V3)and(L4, V5)
So merging R2, R6, and R4 produces:
[(L1, V2), (L3, V3), (L4, V5), (L5, V6)]
Expected output:
R1: [(L1, V1), (L3, V3), (L4, V4), (L5, V5)]
R2: [(L1, V2), (L3, V3), (L4, V5), (L5, V6)]
Imagine a collection of structured or unstructured files related to a domain such as finance.
Examples:
- CSV exports
- JSON records
- raw text reports
- logs
- OCR/NER outputs
After extracting entities from these files, each extracted object can be represented as a sparse row of label-value tuples.
Example:
R1 = [(Company, "XYZ Corp"), (Amount, "$500M")]
Another file may contain:
R2 = [(Company, "XYZ Corp"), (Revenue, "$50M"), (Country, "USA")]
The algorithm can merge overlapping rows and infer more complete records by following graph connections between shared tuples.
Resulting unified table:
-----------------------------------------------------
| Company | Market Cap | Revenue | Country |
-----------------------------------------------------
| XYZ Corp | $500M | $50M | USA |
| ABC Inc. | $300M | $25M | UK |
-----------------------------------------------------
This makes the output easier to store in a database and query later.
- Deterministic — no model hallucination in the merge step
- Interpretable — the result is driven by explicit graph paths and scores
- Configurable — scoring can be tuned with tuple weights and path penalty
- Practical — works well for sparse, partially overlapping extracted records
- Extensible — can be parallelized or adapted to larger/distributed settings
At a high level:
- Build a graph where rows are nodes.
- Connect rows that share one or more tuples.
- Score candidate paths using:
- local overlap
- selected-match gain
- tuple weights
- path penalty
- Resolve missing labels by choosing the best-scoring reachable supporting row.
- Merge rows into a more complete output table.
The revised paper also discusses a stricter compatibility-aware interpretation and related proof refinements.
- Original paper: contex-aware-row-merge.pdf
- Revised paper (v2): contex-aware-row-merge-v2.pdf
Suggested note:
The revised paper clarifies the formalization, improves pseudocode and proofs, and adds a simplified special-case variant for comparison with the original scoring-based formulation.
- Clone the repository
- Run:
python3 ./main.py \
--entity-files ./demo/transactions_entities.json,./demo/log_entities.json,./demo/inventory_entities.json \
--labels TRANSACTION_ID,TRANSACTION_DATE,TRANSACTION_AMOUNT,USER,USER_ID,PRODUCT_MAKE,PRODUCT_MODEL,PRODUCT_PRICEThis repo is a reference implementation and experiment around graph-based row/entity merging from sparse evidence.
It is intended as:
- a demo of the algorithm
- a companion to the paper
- a starting point for experimentation in entity consolidation workflows
This repository focuses on deterministic graph-based merging rather than generative synthesis. It is best suited for workflows where:
- extracted fields are already available
- provenance matters
- reproducibility matters
- users want explicit control over how merges happen