Skip to content

feat: MongoDB offline stores#6138

Open
caseyclements wants to merge 30 commits intofeast-dev:masterfrom
caseyclements:FEAST-OfflineStore-INTPYTHON-297
Open

feat: MongoDB offline stores#6138
caseyclements wants to merge 30 commits intofeast-dev:masterfrom
caseyclements:FEAST-OfflineStore-INTPYTHON-297

Conversation

@caseyclements
Copy link
Contributor

@caseyclements caseyclements commented Mar 20, 2026

What this PR does / why we need it:

Summary

This PR introduces two MongoDB offline store implementations with different schema designs, optimized for different use cases:

Implementation Schema Best For
MongoDBOfflineStoreMany One collection per FeatureView Performance, small-medium feature stores
MongoDBOfflineStoreOne Single shared collection Memory efficiency, large feature stores

Motivation

MongoDB users need flexibility in how they structure their feature data:

  1. Many (collection-per-FV): Simpler schema, faster queries via Ibis memtables, but loads entire collections into memory
  2. One (single collection): Memory-bounded via chunked processing, filters by entity_id, but slower at scale

Schema Comparison

Many Schema (one collection per FeatureView)

// Collection: driver_stats
{
  "driver_id": 1001,
  "event_timestamp": ISODate("2026-01-20T12:00:00Z"),
  "created_at": ISODate("2026-01-20T12:00:05Z"),
  "rating": 4.91,
  "trips_last_7d": 132
}

One Schema (single shared collection)

// Collection: feature_history (shared by all FVs)
{
  "entity_id": Binary("..."),           // Serialized entity key
  "feature_view": "driver_stats",       // Discriminator
  "features": {                         // Nested subdocument
    "rating": 4.91,
    "trips_last_7d": 132
  },
  "event_timestamp": ISODate("2026-01-20T12:00:00Z"),
  "created_at": ISODate("2026-01-20T12:00:05Z")
}

Performance Benchmarks

Benchmarks with 10 features, 3 historical rows per entity:

Entity Rows Many Time One Time Many Memory One Memory Winner
1,000 0.44s 0.19s 5.7 MB 7.2 MB One
10,000 0.57s 1.25s 49.8 MB 69.5 MB Many
100,000 4.82s 14.47s 490.8 MB 352.2 MB Many
1,000,000 56.45s 296.42s 4,906 MB 494 MB Many

Key insight: Many is faster but uses 10x more memory at 1M rows. Choose based on your constraints.

Features

Both Implementations

  • get_historical_features with point-in-time correctness
  • pull_latest_from_table_or_query for materialization
  • pull_all_from_table_or_query for batch retrieval
  • ✅ TTL support per FeatureView
  • ✅ Auto-index creation during materialization
  • ✅ Created timestamp tie-breaking

Many-Specific

  • ✅ Universal test framework integration (DataSourceCreator)
  • ✅ Flat document schema (easy to query directly)

One-Specific

  • ✅ Chunked processing (bounded memory)
  • ✅ Entity-filtered queries (doesn't load entire collection)
  • ✅ Schema matches online store pattern

Configuration

Many

offline_store:
  type: feast.infra.offline_stores.contrib.mongodb_offline_store.mongodb_many.MongoDBOfflineStoreMany
  connection_string: mongodb://localhost:27017
  database: feast

One

offline_store:
  type: feast.infra.offline_stores.contrib.mongodb_offline_store.mongodb_one.MongoDBOfflineStoreOne
  connection_string: mongodb://localhost:27017
  database: feast
  collection: feature_history

File Structure

feast/infra/offline_stores/contrib/mongodb_offline_store/
├── __init__.py          # Shared DRIVER_METADATA
├── README.md            # Documentation with decision flowchart
├── mongodb_many.py      # MongoDBOfflineStoreMany, MongoDBSourceMany
└── mongodb_one.py       # MongoDBOfflineStoreOne, MongoDBSourceOne

tests/unit/infra/offline_stores/contrib/mongodb_offline_store/
├── test_many.py         # Unit tests for Many
├── test_one.py          # Unit tests for One
└── benchmark.py         # Performance comparison

tests/universal/feature_repos/universal/data_sources/
└── mongodb.py           # DataSourceCreators for universal tests

Testing

Unit Tests

pytest sdk/python/tests/unit/infra/offline_stores/contrib/mongodb_offline_store/ -v

Universal Integration Tests (Many only)

FEAST_LOCAL_ONLINE_CONTAINER=True pytest \
  sdk/python/tests/integration/offline_store/test_universal_historical_retrieval.py \
  --integration -k "MongoDBMany"

Benchmarks

pytest sdk/python/tests/unit/infra/offline_stores/contrib/mongodb_offline_store/benchmark.py \
  -v -s -k "test_summary_comparison"

Known Limitations

  1. One + Universal Tests: The MongoDBOneDataSourceCreator is not registered for universal tests. The One schema requires entity/join key information at data creation time, but DataSourceCreator.create_data_source() doesn't receive entity definitions. See TODO in mongodb.py.

  2. LoggingDestination: Neither implementation has feature logging support yet.

  3. SavedDatasetStorage: Only Many has SavedDatasetMongoDBStorageMany.

Misc

Reviewers

Please pay attention to:

  1. Schema design decisions (Many vs One trade-offs)
  2. Index creation strategy
  3. Universal test framework integration

Open with Devin

- MongoDBSource: DataSource backed by a MongoDB collection, schema
  sampled via \ aggregation (default N=100)
- MongoDBOfflineStoreConfig: connection_string + default database
- MongoDBOfflineStore: delegates to ibis PIT join engine via
  in-memory memtable approach
- SavedDatasetMongoDBStorage: persist training datasets to MongoDB
- _build_data_source_reader/_build_data_source_writer closures capture
  config (connection_string, database) for MongoDB access

Signed-off-by: Casey Clements <[email protected]>
- Update copyright headers to 2026
- Move mongodb_to_feast_value_type to feast/type_map.py, consistent
  with pg_type_to_feast_value_type and cb_columnar_type_to_feast_value_type
- Add docstrings to MongoDBOptions.to_proto/from_proto, MongoDBSource
  class, and get_table_column_names_and_types
- Replace dead 'assert name' with cast(str, ...) for type-checker safety
- Add explanatory comment to validate() stub
- Remove module-level warnings.simplefilter('once', RuntimeWarning),
  which was a process-wide side effect; per-call warnings.warn is enough
- Convert all assert isinstance(data_source, MongoDBSource) guards to
  ValueError with descriptive messages in both public API methods and
  the reader/writer closures
- Fix bug: add tz_aware=True to MongoClient in the writer closure,
  matching the reader, to ensure consistent timezone-aware datetime
  handling across read and write paths

Signed-off-by: Casey Clements <[email protected]>
…reIbis and MongoDBOfflineStoreNative

Signed-off-by: Casey Clements <[email protected]>
…mongo, skipping as natural.

Signed-off-by: Casey Clements <[email protected]>
…ed_at tie-breaker in sort

Signed-off-by: Casey Clements <[email protected]>
Signed-off-by: Casey Clements <[email protected]>
- Eliminate -based PIT join which scaled poorly (O(n×m))
- Use single  query to fetch all matching feature data
- Batch entity_ids into chunks of 1000 for large queries
- Flatten features subdoc with pd.json_normalize
- Apply pd.merge_asof for efficient PIT join per FeatureView
- Handle TTL filtering in pandas instead of MQL \
- Remove unused _ttl_to_ms and _build_ttl_gte_expr helpers

Performance improvement:
- Before: 10k rows in ~188s (53 rows/s)
- After:  10k rows in ~7.4s (1,354 rows/s)
- Now competitive with Ibis implementation

Signed-off-by: Casey Clements <[email protected]>
…tity_df

- Add CHUNK_SIZE (5000) for entity_df processing to bound memory usage
- Extract _run_single helper function for processing each chunk
- Add _chunk_dataframe generator for yielding DataFrame slices
- Preserve original row ordering via _row_idx column
- Exclude internal columns (prefixed with _) from entity key serialization
- Concat chunk results and restore ordering at the end

This allows processing arbitrarily large entity_df while keeping
memory bounded by processing in 5000-row chunks.

Signed-off-by: Casey Clements <[email protected]>
… sizes

Performance optimizations:
- Reuse MongoClient across chunks (was creating new client per chunk)
- Increase CHUNK_SIZE from 5,000 to 50,000 rows
- Increase MONGO_BATCH_SIZE from 1,000 to 10,000 entity_ids
- Pass collection to _run_single instead of creating client each time
- Make index creation idempotent (check for existing index)

Results (100k rows):
- Before: 21.7s
- After: 5.2s (4.2x faster)

Results (1M rows):
- Before: 1664s (28 min)
- After: 212s (3.5 min) (7.8x faster)

Signed-off-by: Casey Clements <[email protected]>
The Native implementation now lives exclusively in mongodb_native.py
with the single-collection schema. This removes the confusing duplicate
that used the Ibis collection-per-FV schema.

Signed-off-by: Casey Clements <[email protected]>
- Move MongoDBSource, MongoDBOptions, SavedDatasetMongoDBStorage into mongodb.py
- Move _infer_python_type_str helper into mongodb.py
- Update imports in tests and benchmarks
- Remove mongodb_source.py

This consolidates the collection-per-FV implementation into a single file,
making the codebase easier to navigate.

Signed-off-by: Casey Clements <[email protected]>
- Rename module: mongodb_offline_store/ → mongodb/
- Rename files: mongodb.py → mongodb_many.py, mongodb_native.py → mongodb_one.py

Class renames:
- MongoDBSource → MongoDBSourceMany
- MongoDBOptions → MongoDBOptionsMany
- SavedDatasetMongoDBStorage → SavedDatasetMongoDBStorageMany
- MongoDBOfflineStoreIbis → MongoDBOfflineStoreMany
- MongoDBOfflineStoreIbisConfig → MongoDBOfflineStoreManyConfig
- MongoDBSourceNative → MongoDBSourceOne
- MongoDBOfflineStoreNative → MongoDBOfflineStoreOne
- MongoDBOfflineStoreNativeConfig → MongoDBOfflineStoreOneConfig
- MongoDBNativeRetrievalJob → MongoDBOneRetrievalJob

The One/Many naming reflects the core architectural difference:
- One: Single shared collection for all FeatureViews
- Many: One collection per FeatureView

Signed-off-by: Casey Clements <[email protected]>
- Rename module: mongodb/ → mongodb_offline_store/ (follows naming convention)
- Move tests to mongodb_offline_store/ subdirectory:
  - test_mongodb_offline_retrieval.py → mongodb_offline_store/test_many.py
  - test_mongodb_offline_retrieval_native.py → mongodb_offline_store/test_one.py
  - benchmark_mongodb_offline_stores.py → mongodb_offline_store/benchmark.py
- Update all imports to use mongodb_offline_store path

Signed-off-by: Casey Clements <[email protected]>
Signed-off-by: Casey Clements <[email protected]>
- Clarify that indexes should be on join keys + timestamp
- Show example for compound join keys
- Note that Many does not auto-create indexes

Signed-off-by: Casey Clements <[email protected]>
- Add _ensure_index_many() function with module-level cache
- Call during pull_latest_from_table_or_query (materialization)
- Creates index on join_keys + timestamp + created_timestamp
- Checks for existing index before creating
- Update README to reflect auto-create behavior

Signed-off-by: Casey Clements <[email protected]>
- Rename functions: _generate_ibis_data → _generate_many_data, etc.
- Rename fixtures: ibis_config → many_config, native_config → one_config
- Rename tests: test_scale_rows_ibis → test_scale_rows_many, etc.
- Update all docstrings and print statements
- Update summary comparison output format

Signed-off-by: Casey Clements <[email protected]>
Documents:
- Collection structure (one per FeatureView)
- Index creation (auto-created during materialization)
- Document schema (flat, top-level features)
- Point-in-time join strategy (Ibis memtables)
- Performance characteristics and memory considerations
- When to use vs MongoDBOfflineStoreOne
- Comparison table with One implementation

Signed-off-by: Casey Clements <[email protected]>
Add missing documentation sections:
- Feature Freshness Semantics: document-level freshness, not per-feature
- Schema Evolution ('Feature Creep'): flexible schema implications
- Notes: entity keys as native types, PIT correctness, TTL constraints

Signed-off-by: Casey Clements <[email protected]>
Add DataSourceCreator implementations for MongoDB offline stores:

- MongoDBManyDataSourceCreator: Fully functional, passes universal tests.
  Creates one collection per FeatureView with flat document schema.

- MongoDBOneDataSourceCreator: Implementation exists but NOT registered.
  The One schema requires knowing join keys vs features at data creation
  time, but DataSourceCreator.create_data_source() doesn't receive entity
  definitions. See TODO in mongodb.py for details on required interface
  changes.

Other changes:
- Fix data_source_class_type path in mongodb_one.py (mongodb_native -> mongodb_one)
- Improve datetime handling in mongodb_one.py for non-datetime columns
- Add 'mongodb' marker to pytest.ini
- Register MongoDBManyDataSourceCreator in repo_configuration.py

Signed-off-by: Casey Clements <[email protected]>
@caseyclements caseyclements requested a review from a team as a code owner March 20, 2026 20:48
@caseyclements caseyclements changed the title Feast offline store intpython 297 feat: MongoDB offline stores Mar 20, 2026
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 potential issues.

View 6 additional findings in Devin Review.

Open in Devin Review

Comment on lines +668 to +673
# Get join keys (all columns except event_timestamp and internal columns)
entity_columns = [
c
for c in result.columns
if c != event_timestamp_col and not c.startswith("_")
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Entity columns inferred from entity_df columns instead of feature view definitions

In get_historical_features, entity columns used for entity_id serialization are derived from all non-timestamp, non-internal columns of entity_df (mongodb_one.py:669-673), rather than using get_expected_join_keys() from offline_utils like every other offline store in the codebase (BigQuery at bigquery.py:301, Couchbase at couchbase.py:174, Redshift at redshift.py:244, etc.). If entity_df contains any extra columns beyond join keys and event_timestamp (e.g., label columns, metadata), these get incorrectly included in the entity key serialization. The resulting _entity_id bytes won't match the entity_ids stored in MongoDB, causing ALL features to silently return as NULL with no error. This is particularly dangerous because the failure is silent — the query returns a valid DataFrame with the correct shape but all feature values are NULL.

Prompt for agents
In sdk/python/feast/infra/offline_stores/contrib/mongodb_offline_store/mongodb_one.py, the _run_single function at line 668-673 should use the Feast-standard get_expected_join_keys utility to determine entity columns, instead of inferring them from entity_df columns. 

1. Import get_expected_join_keys from feast.infra.offline_stores.offline_utils
2. In get_historical_features (around line 626), compute the expected join keys using:
   expected_join_keys = offline_utils.get_expected_join_keys(project, feature_views, registry)
3. Also add the standard validation:
   offline_utils.assert_expected_columns_in_entity_df(entity_schema, expected_join_keys, event_timestamp_col)
4. In _run_single, replace the entity_columns heuristic (lines 668-673) with:
   entity_columns = sorted(expected_join_keys)
   (the expected_join_keys variable is accessible as a closure variable)

This matches the pattern used by all other offline stores (BigQuery, Redshift, Couchbase, Postgres, etc.) and ensures correct entity_id serialization even when entity_df contains extra non-join-key columns.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

- ``created_at``: ingestion time
"""

_index_initialized: bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 _index_initialized class boolean skips index creation for different MongoDB configs

MongoDBOfflineStoreOne._index_initialized is a class-level bool that, once set to True by _get_client_and_ensure_indexes, prevents index creation for ALL subsequent calls regardless of config. If the code is used with different RepoConfig values in the same process (different databases, collections, or connection strings), indexes are only created for the first config. Compare with mongodb_many.py:599-615 which uses a per-{db_name}.{collection_name} cache set — that approach is also imperfect (doesn't key on connection string) but much more granular than a single boolean.

Suggested change
_index_initialized: bool = False
_indexes_ensured: set = set()
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

def _serialize_entity_key(self, row: pd.Series, join_keys: list[str]) -> bytes:
"""Serialize entity key columns to bytes."""
entity_key = EntityKeyProto()
for key in join_keys:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 _serialize_entity_key in data source creator doesn't sort join keys, causing entity_id mismatch

MongoDBOneDataSourceCreator._serialize_entity_key iterates join_keys without sorting (line 181: for key in join_keys:), but the query-side function _serialize_entity_key_from_row in mongodb_one.py:371 uses for key in sorted(join_keys). For compound join keys (e.g., ["user_id", "device_id"]), the entity_ids written during data ingestion may have keys in a different order than those generated at query time. Since entity key serialization is order-dependent, this produces different bytes and causes query mismatches (all features returned as NULL). Additionally, the isinstance(value, bool) check at line 191 is unreachable dead code because bool is a subclass of int in Python, so the isinstance(value, int) check at line 185 always matches first.

Suggested change
for key in join_keys:
for key in sorted(join_keys):
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Signed-off-by: Casey Clements <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant