Skip to content

feat: SQL COUNT(DISTINCT column) support via HyperLogLog Sketch #367

@akanksha-akkihal

Description

@akanksha-akkihal

Summary
Support SQL queries of the form:
SELECT srcip, COUNT(DISTINCT dstip) AS unique_peers FROM netflow_table WHERE time BETWEEN DATEADD(s, -11, NOW()) AND DATEADD(s, -10, NOW()) GROUP BY srcip

This should work end-to-end through ASAPQuery’s precompute/streaming engine using a HyperLogLog (HLL) sketch (asap_sketchlib::HllSketch).

The implementation should:

  • Support approximate distinct counting via HLL
  • Integrate with the precompute pipeline
  • Be configurable via versioned inference + streaming configs

Current Status
COUNT(DISTINCT column) is is currently not supported in the streaming/precompute path, due to gaps across multiple layers:

Layer Gap
SQL parser DISTINCT inside COUNT is ignored or not normalized; aggregation remains COUNT instead of cardinality
Statistic mapping No mapping from AggregationOperator::CardinalityStatistic::Cardinality
Capability matching Statistic::Cardinality only maps to SetAggregator / DeltaSetAggregator, not HLL
Precompute Missing HllAccumulator, factory dispatch, and serde support for SketchType::HLL
SQL matcher SQLPatternMatcher rejects "CARDINALITY" and enforces distinct targets to be value_columns only

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions