Skip to content

dmgcodevil/ContextLens

Repository files navigation

Context-Aware Graph-Based Row Merging

A graph-based algorithm for merging partially overlapping rows (entities) into a coherent table using shared label-value tuples, selected-context gain, tuple weighting, and path penalty.

This repository contains the reference implementation and demo for the paper:

A Context-Aware Graph-Based Algorithm for Merging Rows (Entities) with Selected-Match Gain and Path Penalty

Original paper: contex-aware-row-merge.pdf
Revised paper (v2): contex-aware-row-merge-v2.pdf


Overview

The goal is to merge sparse, partially overlapping rows into more complete rows by traversing a graph of related rows.

Each row is a set of label-value tuples, for example:

R1 = [(L1, V1)]
R5 = [(L1, V1), (L3, V3), (L4, V4)]
R3 = [(L3, V3), (L4, V4), (L5, V5)]

Rows are connected when they share tuples. The algorithm scores candidate paths through the graph and resolves missing labels using:

  • local strength — overlap between connected rows
  • selected-match gain — preference for rows consistent with already selected tuples
  • tuple weight — importance of specific tuples
  • path penalty — discourages unnecessarily long paths

The result is a deterministic, interpretable merge process with provenance.


Why this is useful

This approach is intended for cases where extracted entities or records are:

  • sparse
  • partially overlapping
  • spread across multiple files or sources
  • too structured for LLM-only/RAG-style consolidation
  • required to be explainable and reproducible

Typical use cases include:

  • entity resolution
  • record linkage
  • structured data enrichment
  • log/event consolidation
  • graph-based AI workflows

Glossary

R  = Row
L  = Label
V  = Value
T  = Tuple (L, V)
B  = Bucket

Example

Input rows:

R1 = [(L1, V1)]
R2 = [(L1, V2)]
R3 = [(L3, V3), (L4, V4), (L5, V5)]
R4 = [(L3, V3), (L4, V5), (L5, V6)]
R5 = [(L1, V1), (L3, V3), (L4, V4)]
R6 = [(L1, V2), (L3, V3), (L4, V5)]

Connections:

  • R1 is connected to R5 via (L1, V1)
  • R5 is connected to R3 via (L3, V3) and (L4, V4)

So merging R1, R5, and R3 produces:

[(L1, V1), (L3, V3), (L4, V4), (L5, V5)]

Similarly:

  • R2 is connected to R6 via (L1, V2)
  • R6 is connected to R4 via (L3, V3) and (L4, V5)

So merging R2, R6, and R4 produces:

[(L1, V2), (L3, V3), (L4, V5), (L5, V6)]

Expected output:

R1: [(L1, V1), (L3, V3), (L4, V4), (L5, V5)]
R2: [(L1, V2), (L3, V3), (L4, V5), (L5, V6)]

Real-world example: consolidating entities from text files

Imagine a collection of structured or unstructured files related to a domain such as finance.

Examples:

  • CSV exports
  • JSON records
  • raw text reports
  • logs
  • OCR/NER outputs

After extracting entities from these files, each extracted object can be represented as a sparse row of label-value tuples.

Example:

R1 = [(Company, "XYZ Corp"), (Amount, "$500M")]

Another file may contain:

R2 = [(Company, "XYZ Corp"), (Revenue, "$50M"), (Country, "USA")]

The algorithm can merge overlapping rows and infer more complete records by following graph connections between shared tuples.

Resulting unified table:

-----------------------------------------------------
| Company    | Market Cap   | Revenue   | Country   |
-----------------------------------------------------
| XYZ Corp   | $500M        | $50M      | USA       |
| ABC Inc.   | $300M        | $25M      | UK        |
-----------------------------------------------------

This makes the output easier to store in a database and query later.


Benefits

  • Deterministic — no model hallucination in the merge step
  • Interpretable — the result is driven by explicit graph paths and scores
  • Configurable — scoring can be tuned with tuple weights and path penalty
  • Practical — works well for sparse, partially overlapping extracted records
  • Extensible — can be parallelized or adapted to larger/distributed settings

Algorithm summary

At a high level:

  1. Build a graph where rows are nodes.
  2. Connect rows that share one or more tuples.
  3. Score candidate paths using:
    • local overlap
    • selected-match gain
    • tuple weights
    • path penalty
  4. Resolve missing labels by choosing the best-scoring reachable supporting row.
  5. Merge rows into a more complete output table.

The revised paper also discusses a stricter compatibility-aware interpretation and related proof refinements.


Papers

Suggested note:

The revised paper clarifies the formalization, improves pseudocode and proofs, and adds a simplified special-case variant for comparison with the original scoring-based formulation.


How to run the demo

  1. Clone the repository
  2. Run:
python3 ./main.py \
  --entity-files ./demo/transactions_entities.json,./demo/log_entities.json,./demo/inventory_entities.json \
  --labels TRANSACTION_ID,TRANSACTION_DATE,TRANSACTION_AMOUNT,USER,USER_ID,PRODUCT_MAKE,PRODUCT_MODEL,PRODUCT_PRICE

Repository purpose

This repo is a reference implementation and experiment around graph-based row/entity merging from sparse evidence.

It is intended as:

  • a demo of the algorithm
  • a companion to the paper
  • a starting point for experimentation in entity consolidation workflows

Notes

This repository focuses on deterministic graph-based merging rather than generative synthesis. It is best suited for workflows where:

  • extracted fields are already available
  • provenance matters
  • reproducibility matters
  • users want explicit control over how merges happen

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors