Context-Aware Graph-Based Row Merging

A graph-based algorithm for merging partially overlapping rows (entities) into a coherent table using shared label-value tuples, selected-context gain, tuple weighting, and path penalty.

This repository contains the reference implementation and demo for the paper:

A Context-Aware Graph-Based Algorithm for Merging Rows (Entities) with Selected-Match Gain and Path Penalty

Original paper: contex-aware-row-merge.pdf
Revised paper (v2): contex-aware-row-merge-v2.pdf

Overview

The goal is to merge sparse, partially overlapping rows into more complete rows by traversing a graph of related rows.

Each row is a set of label-value tuples, for example:

R1 = [(L1, V1)]
R5 = [(L1, V1), (L3, V3), (L4, V4)]
R3 = [(L3, V3), (L4, V4), (L5, V5)]

Rows are connected when they share tuples. The algorithm scores candidate paths through the graph and resolves missing labels using:

local strength — overlap between connected rows
selected-match gain — preference for rows consistent with already selected tuples
tuple weight — importance of specific tuples
path penalty — discourages unnecessarily long paths

The result is a deterministic, interpretable merge process with provenance.

Why this is useful

This approach is intended for cases where extracted entities or records are:

sparse
partially overlapping
spread across multiple files or sources
too structured for LLM-only/RAG-style consolidation
required to be explainable and reproducible

Typical use cases include:

entity resolution
record linkage
structured data enrichment
log/event consolidation
graph-based AI workflows

Glossary

R  = Row
L  = Label
V  = Value
T  = Tuple (L, V)
B  = Bucket

Example

Input rows:

R1 = [(L1, V1)]
R2 = [(L1, V2)]
R3 = [(L3, V3), (L4, V4), (L5, V5)]
R4 = [(L3, V3), (L4, V5), (L5, V6)]
R5 = [(L1, V1), (L3, V3), (L4, V4)]
R6 = [(L1, V2), (L3, V3), (L4, V5)]

Connections:

R1 is connected to R5 via (L1, V1)
R5 is connected to R3 via (L3, V3) and (L4, V4)

So merging R1, R5, and R3 produces:

[(L1, V1), (L3, V3), (L4, V4), (L5, V5)]

Similarly:

R2 is connected to R6 via (L1, V2)
R6 is connected to R4 via (L3, V3) and (L4, V5)

So merging R2, R6, and R4 produces:

[(L1, V2), (L3, V3), (L4, V5), (L5, V6)]

Expected output:

R1: [(L1, V1), (L3, V3), (L4, V4), (L5, V5)]
R2: [(L1, V2), (L3, V3), (L4, V5), (L5, V6)]

Real-world example: consolidating entities from text files

Imagine a collection of structured or unstructured files related to a domain such as finance.

Examples:

CSV exports
JSON records
raw text reports
logs
OCR/NER outputs

After extracting entities from these files, each extracted object can be represented as a sparse row of label-value tuples.

Example:

R1 = [(Company, "XYZ Corp"), (Amount, "$500M")]

Another file may contain:

R2 = [(Company, "XYZ Corp"), (Revenue, "$50M"), (Country, "USA")]

The algorithm can merge overlapping rows and infer more complete records by following graph connections between shared tuples.

Resulting unified table:

-----------------------------------------------------
| Company    | Market Cap   | Revenue   | Country   |
-----------------------------------------------------
| XYZ Corp   | $500M        | $50M      | USA       |
| ABC Inc.   | $300M        | $25M      | UK        |
-----------------------------------------------------

This makes the output easier to store in a database and query later.

Benefits

Deterministic — no model hallucination in the merge step
Interpretable — the result is driven by explicit graph paths and scores
Configurable — scoring can be tuned with tuple weights and path penalty
Practical — works well for sparse, partially overlapping extracted records
Extensible — can be parallelized or adapted to larger/distributed settings

Algorithm summary

At a high level:

Build a graph where rows are nodes.
Connect rows that share one or more tuples.
Score candidate paths using:
- local overlap
- selected-match gain
- tuple weights
- path penalty
Resolve missing labels by choosing the best-scoring reachable supporting row.
Merge rows into a more complete output table.

The revised paper also discusses a stricter compatibility-aware interpretation and related proof refinements.

Papers

Original paper: contex-aware-row-merge.pdf
Revised paper (v2): contex-aware-row-merge-v2.pdf

Suggested note:

The revised paper clarifies the formalization, improves pseudocode and proofs, and adds a simplified special-case variant for comparison with the original scoring-based formulation.

How to run the demo

Clone the repository
Run:

python3 ./main.py \
  --entity-files ./demo/transactions_entities.json,./demo/log_entities.json,./demo/inventory_entities.json \
  --labels TRANSACTION_ID,TRANSACTION_DATE,TRANSACTION_AMOUNT,USER,USER_ID,PRODUCT_MAKE,PRODUCT_MODEL,PRODUCT_PRICE

Repository purpose

This repo is a reference implementation and experiment around graph-based row/entity merging from sparse evidence.

It is intended as:

a demo of the algorithm
a companion to the paper
a starting point for experimentation in entity consolidation workflows

Notes

This repository focuses on deterministic graph-based merging rather than generative synthesis. It is best suited for workflows where:

extracted fields are already available
provenance matters
reproducibility matters
users want explicit control over how merges happen

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
demo		demo
tests		tests
README.md		README.md
contex-aware-row-merge-v2.pdf		contex-aware-row-merge-v2.pdf
contex-aware-row-merge.pdf		contex-aware-row-merge.pdf
core.py		core.py
main.py		main.py
paper_engineering.pdf		paper_engineering.pdf
paper_engineering.tex		paper_engineering.tex
report_noise_evaluation.pdf		report_noise_evaluation.pdf
report_noise_evaluation.tex		report_noise_evaluation.tex
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Context-Aware Graph-Based Row Merging

Overview

Why this is useful

Glossary

Example

Real-world example: consolidating entities from text files

Benefits

Algorithm summary

Papers

How to run the demo

Repository purpose

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Context-Aware Graph-Based Row Merging

Overview

Why this is useful

Glossary

Example

Real-world example: consolidating entities from text files

Benefits

Algorithm summary

Papers

How to run the demo

Repository purpose

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages