Skip to content

perf(postgres): Optimizing feast offline Store for date-range multi-FV retrieval#6057

Open
Vperiodt wants to merge 9 commits intofeast-dev:masterfrom
Vperiodt:patch-query
Open

perf(postgres): Optimizing feast offline Store for date-range multi-FV retrieval#6057
Vperiodt wants to merge 9 commits intofeast-dev:masterfrom
Vperiodt:patch-query

Conversation

@Vperiodt
Copy link

@Vperiodt Vperiodt commented Mar 4, 2026

What this PR does / why we need it:

  1. Replaces the date-range multi-FV path in the Postgres offline store (previously base_entities + one LEFT JOIN LATERAL per feature view) with a set-based LOCF (Last Observation Carried Forward) implementation

  2. Uses a single timeline: stack (UNION ALL of spine + feature rows), COUNT for group boundaries, FIRST_VALUE for forward-fill, then filter to spine and apply per-FV TTL.

Which issue(s) this PR fixes:

Fixes slow get_historical_features on the Postgres offline store for date-range retrieval with multiple feature views. The LATERAL/inequality joins had O(N×M) cost. This PR switches to a LOCF path: one stacked timeline of size L, sort O(L log L), and O(L) window passes, which makes total cost as O(L log L).

Misc


Open with Devin

Signed-off-by: Vanshika Vanshika <[email protected]>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
@Vperiodt Vperiodt marked this pull request as ready for review March 4, 2026 07:06
@Vperiodt Vperiodt requested a review from a team as a code owner March 4, 2026 07:06
devin-ai-integration[bot]

This comment was marked as resolved.

@tokoko
Copy link
Collaborator

tokoko commented Mar 4, 2026

@Vperiodt thanks for the PR. if it's not too much trouble, can you give a small example of result queries for say.. 2 fv scenario and highlight differences this PR introduces? there jinja templates are hard enough to read through even when one knows what the intent is 😄

@Vperiodt
Copy link
Author

Vperiodt commented Mar 4, 2026

@Vperiodt thanks for the PR. if it's not too much trouble, can you give a small example of result queries for say.. 2 fv scenario and highlight differences this PR introduces? there jinja templates are hard enough to read through even when one knows what the intent is 😄

sure ! will add it in the description itself

Vperiodt and others added 3 commits March 4, 2026 16:38
Signed-off-by: Vanshika Vanshika <[email protected]>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Vanshika Vanshika <[email protected]>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
devin-ai-integration[bot]

This comment was marked as resolved.

@Vperiodt Vperiodt marked this pull request as draft March 5, 2026 17:18
Vperiodt and others added 3 commits March 9, 2026 10:43
Signed-off-by: Vanshika Vanshika <[email protected]>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
@Vperiodt Vperiodt marked this pull request as ready for review March 9, 2026 21:20
devin-ai-integration[bot]

This comment was marked as resolved.

Signed-off-by: Vanshika Vanshika <[email protected]>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Copy link
Collaborator

@YassinNouh21 YassinNouh21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LOCF approach is a solid improvement over the LATERAL join path — O(L log L) window passes vs O(N×M) inequality joins is a real win for large date-range queries. The per-FV independent pipeline is clean and avoids cross-FV interference.

A few issues to address before merge — one pre-existing bug that this PR should fix since it's modifying the non-entity path, dead code in the template, and some test coverage gaps.

else start_date
)

entity_df = pd.DataFrame(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug (pre-existing, should fix in this PR): This still uses pd.date_range(start=start_date, ...)[:1] which takes start_date as the entity timestamp, not end_date. This was identified in the PR #6066 review — the ClickHouse implementation correctly uses [end_date].

Since this PR is already modifying the non-entity retrieval path, it's a good opportunity to fix this:

entity_df = pd.DataFrame({"event_timestamp": [end_date]})

In the current code, the entity_df isn't used by the LOCF query path (the spine replaces it), but it's still uploaded to a temp table via _upload_entity_df — so at minimum it wastes a round-trip to Postgres. Even better: skip creating/uploading the entity_df entirely when start_date and end_date are set, since the LOCF path never references left_table_query_string.

# Compute lookback_start_date for LOCF: pull feature data
# from (start_date - max_ttl) so window functions can
# forward-fill the last observation before start_date.
max_ttl_seconds = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate computation: When start_date is None, max_ttl_seconds is already computed in the block above (lines 147-151) to derive start_date. This block recomputes the exact same value. Consider hoisting max_ttl_seconds to a single computation before both uses:

max_ttl_seconds = max(
    (int(fv.ttl.total_seconds()) for fv in feature_views
     if fv.ttl and isinstance(fv.ttl, timedelta)),
    default=0,
)

{% endfor %}

{# Build list of output feature names per FV for consistent column ordering #}
{% set all_feature_cols = [] %}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code: all_feature_cols is computed here but never referenced in the rest of the template. The final SELECT iterates over featureview.features directly and computes col_name inline. This entire block (lines 569-579) can be removed.

"{{ featureview.name }}__grouped" AS (
SELECT *,
COUNT(feature_anchor) OVER (
PARTITION BY {% if featureview.entities %}{% for entity in featureview.entities %}"{{ entity }}"{% if not loop.last %}, {% endif %}{% endfor %}{% else %}1{% endif %}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entityless FV edge case: PARTITION BY 1 works in Postgres but is non-standard and some query planners may not optimize it well. An alternative is PARTITION BY (SELECT NULL) or simply omitting the PARTITION clause (which gives one partition over the whole result set).

Also worth adding a test case for entityless feature views to verify the PARTITION BY 1 and {% else %}1{% endif %} paths generate valid SQL.

LEFT JOIN "{{ featureview.name }}__filled" AS "{{ featureview.name }}__f"
ON spine.event_timestamp = "{{ featureview.name }}__f".event_timestamp
{% for entity in all_entities %}
AND spine."{{ entity }}" IS NOT DISTINCT FROM "{{ featureview.name }}__f"."{{ entity }}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good use of IS NOT DISTINCT FROM here. This fixes a bug in the old LATERAL join path where NULL = NULL would fail the join for FVs with different entity sets. Worth calling this out in the PR description as a bug fix.

{% set col_name = featureview.field_mapping.get(feature, feature) %}
{% endif %}
{% if featureview.ttl != 0 %}
CASE WHEN (spine.event_timestamp - "{{ featureview.name }}__f"."{{ featureview.name }}__filled_ts") <= {{ featureview.ttl }} * interval '1' second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TTL enforcement via CASE is correct — when filled_ts is NULL (no prior observation), the subtraction yields NULL, the comparison is UNKNOWN, and CASE falls through to ELSE NULL.

One edge case: if featureview.ttl is a very large number, {{ featureview.ttl }} * interval '1' second could overflow on some Postgres versions. Consider using make_interval(secs => {{ featureview.ttl }}) for safety, though in practice TTLs are unlikely to overflow.

WHERE "{{ featureview.timestamp_field }}" BETWEEN '{{ lookback_start_date or start_date }}' AND '{{ end_date }}'
),
{% if featureview.created_timestamp_column %}
"{{ featureview.name }}__data" AS (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: When there's no created_timestamp_column, this CTE is a no-op passthrough (SELECT * FROM __data_raw). Postgres will likely optimize it away, but you could simplify by using __data_raw as the alias directly:

{% if featureview.created_timestamp_column %}
"{{ featureview.name }}__data" AS (
    SELECT * FROM (
        ... dedup logic ...
    ) __dedup WHERE __rn = 1
),
{% endif %}

Then reference {{ featureview.name }}__data{% if not featureview.created_timestamp_column %}_raw{% endif %} downstream. Though I understand the current approach is simpler to maintain.

def test_lateral_join_ttl_constraints(self):
"""Test that LATERAL JOINs include proper TTL constraints"""
from jinja2 import BaseLoader, Environment
def test_locf_template_multi_fv_date_range(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good tests for the happy path and TTL=0 case. Consider adding coverage for:

  1. Different entity sets across FVs — e.g., fv1 has driver_id, fv2 has customer_id. This exercises the NULL AS "{{ entity }}" and IS NOT DISTINCT FROM paths which are the trickiest part of the LOCF approach.
  2. Entityless feature view — verifies the PARTITION BY 1 fallback generates valid SQL.
  3. created_timestamp_column set — verifies the dedup CTE (__data_raw__data via ROW_NUMBER) is generated correctly.
  4. full_feature_names=True — verifies the featureview.name ~ '__' ~ feature column naming.

These are the main code paths that differ from the happy path and are most likely to produce malformed SQL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants