feat: MongoDB offline stores#6138
Conversation
- MongoDBSource: DataSource backed by a MongoDB collection, schema sampled via \ aggregation (default N=100) - MongoDBOfflineStoreConfig: connection_string + default database - MongoDBOfflineStore: delegates to ibis PIT join engine via in-memory memtable approach - SavedDatasetMongoDBStorage: persist training datasets to MongoDB - _build_data_source_reader/_build_data_source_writer closures capture config (connection_string, database) for MongoDB access Signed-off-by: Casey Clements <[email protected]>
- Update copyright headers to 2026
- Move mongodb_to_feast_value_type to feast/type_map.py, consistent
with pg_type_to_feast_value_type and cb_columnar_type_to_feast_value_type
- Add docstrings to MongoDBOptions.to_proto/from_proto, MongoDBSource
class, and get_table_column_names_and_types
- Replace dead 'assert name' with cast(str, ...) for type-checker safety
- Add explanatory comment to validate() stub
- Remove module-level warnings.simplefilter('once', RuntimeWarning),
which was a process-wide side effect; per-call warnings.warn is enough
- Convert all assert isinstance(data_source, MongoDBSource) guards to
ValueError with descriptive messages in both public API methods and
the reader/writer closures
- Fix bug: add tz_aware=True to MongoClient in the writer closure,
matching the reader, to ensure consistent timezone-aware datetime
handling across read and write paths
Signed-off-by: Casey Clements <[email protected]>
…reIbis and MongoDBOfflineStoreNative Signed-off-by: Casey Clements <[email protected]>
Signed-off-by: Casey Clements <[email protected]>
…mongo, skipping as natural. Signed-off-by: Casey Clements <[email protected]>
Signed-off-by: Casey Clements <[email protected]>
Signed-off-by: Casey Clements <[email protected]>
Signed-off-by: Casey Clements <[email protected]>
Signed-off-by: Casey Clements <[email protected]>
…ed_at tie-breaker in sort Signed-off-by: Casey Clements <[email protected]>
Signed-off-by: Casey Clements <[email protected]>
Signed-off-by: Casey Clements <[email protected]>
…tores Signed-off-by: Casey Clements <[email protected]>
- Eliminate -based PIT join which scaled poorly (O(n×m)) - Use single query to fetch all matching feature data - Batch entity_ids into chunks of 1000 for large queries - Flatten features subdoc with pd.json_normalize - Apply pd.merge_asof for efficient PIT join per FeatureView - Handle TTL filtering in pandas instead of MQL \ - Remove unused _ttl_to_ms and _build_ttl_gte_expr helpers Performance improvement: - Before: 10k rows in ~188s (53 rows/s) - After: 10k rows in ~7.4s (1,354 rows/s) - Now competitive with Ibis implementation Signed-off-by: Casey Clements <[email protected]>
…tity_df - Add CHUNK_SIZE (5000) for entity_df processing to bound memory usage - Extract _run_single helper function for processing each chunk - Add _chunk_dataframe generator for yielding DataFrame slices - Preserve original row ordering via _row_idx column - Exclude internal columns (prefixed with _) from entity key serialization - Concat chunk results and restore ordering at the end This allows processing arbitrarily large entity_df while keeping memory bounded by processing in 5000-row chunks. Signed-off-by: Casey Clements <[email protected]>
… sizes Performance optimizations: - Reuse MongoClient across chunks (was creating new client per chunk) - Increase CHUNK_SIZE from 5,000 to 50,000 rows - Increase MONGO_BATCH_SIZE from 1,000 to 10,000 entity_ids - Pass collection to _run_single instead of creating client each time - Make index creation idempotent (check for existing index) Results (100k rows): - Before: 21.7s - After: 5.2s (4.2x faster) Results (1M rows): - Before: 1664s (28 min) - After: 212s (3.5 min) (7.8x faster) Signed-off-by: Casey Clements <[email protected]>
The Native implementation now lives exclusively in mongodb_native.py with the single-collection schema. This removes the confusing duplicate that used the Ibis collection-per-FV schema. Signed-off-by: Casey Clements <[email protected]>
- Move MongoDBSource, MongoDBOptions, SavedDatasetMongoDBStorage into mongodb.py - Move _infer_python_type_str helper into mongodb.py - Update imports in tests and benchmarks - Remove mongodb_source.py This consolidates the collection-per-FV implementation into a single file, making the codebase easier to navigate. Signed-off-by: Casey Clements <[email protected]>
- Rename module: mongodb_offline_store/ → mongodb/ - Rename files: mongodb.py → mongodb_many.py, mongodb_native.py → mongodb_one.py Class renames: - MongoDBSource → MongoDBSourceMany - MongoDBOptions → MongoDBOptionsMany - SavedDatasetMongoDBStorage → SavedDatasetMongoDBStorageMany - MongoDBOfflineStoreIbis → MongoDBOfflineStoreMany - MongoDBOfflineStoreIbisConfig → MongoDBOfflineStoreManyConfig - MongoDBSourceNative → MongoDBSourceOne - MongoDBOfflineStoreNative → MongoDBOfflineStoreOne - MongoDBOfflineStoreNativeConfig → MongoDBOfflineStoreOneConfig - MongoDBNativeRetrievalJob → MongoDBOneRetrievalJob The One/Many naming reflects the core architectural difference: - One: Single shared collection for all FeatureViews - Many: One collection per FeatureView Signed-off-by: Casey Clements <[email protected]>
Signed-off-by: Casey Clements <[email protected]>
- Rename module: mongodb/ → mongodb_offline_store/ (follows naming convention) - Move tests to mongodb_offline_store/ subdirectory: - test_mongodb_offline_retrieval.py → mongodb_offline_store/test_many.py - test_mongodb_offline_retrieval_native.py → mongodb_offline_store/test_one.py - benchmark_mongodb_offline_stores.py → mongodb_offline_store/benchmark.py - Update all imports to use mongodb_offline_store path Signed-off-by: Casey Clements <[email protected]>
Signed-off-by: Casey Clements <[email protected]>
Signed-off-by: Casey Clements <[email protected]>
- Clarify that indexes should be on join keys + timestamp - Show example for compound join keys - Note that Many does not auto-create indexes Signed-off-by: Casey Clements <[email protected]>
- Add _ensure_index_many() function with module-level cache - Call during pull_latest_from_table_or_query (materialization) - Creates index on join_keys + timestamp + created_timestamp - Checks for existing index before creating - Update README to reflect auto-create behavior Signed-off-by: Casey Clements <[email protected]>
- Rename functions: _generate_ibis_data → _generate_many_data, etc. - Rename fixtures: ibis_config → many_config, native_config → one_config - Rename tests: test_scale_rows_ibis → test_scale_rows_many, etc. - Update all docstrings and print statements - Update summary comparison output format Signed-off-by: Casey Clements <[email protected]>
Documents: - Collection structure (one per FeatureView) - Index creation (auto-created during materialization) - Document schema (flat, top-level features) - Point-in-time join strategy (Ibis memtables) - Performance characteristics and memory considerations - When to use vs MongoDBOfflineStoreOne - Comparison table with One implementation Signed-off-by: Casey Clements <[email protected]>
Add missing documentation sections:
- Feature Freshness Semantics: document-level freshness, not per-feature
- Schema Evolution ('Feature Creep'): flexible schema implications
- Notes: entity keys as native types, PIT correctness, TTL constraints
Signed-off-by: Casey Clements <[email protected]>
Add DataSourceCreator implementations for MongoDB offline stores: - MongoDBManyDataSourceCreator: Fully functional, passes universal tests. Creates one collection per FeatureView with flat document schema. - MongoDBOneDataSourceCreator: Implementation exists but NOT registered. The One schema requires knowing join keys vs features at data creation time, but DataSourceCreator.create_data_source() doesn't receive entity definitions. See TODO in mongodb.py for details on required interface changes. Other changes: - Fix data_source_class_type path in mongodb_one.py (mongodb_native -> mongodb_one) - Improve datetime handling in mongodb_one.py for non-datetime columns - Add 'mongodb' marker to pytest.ini - Register MongoDBManyDataSourceCreator in repo_configuration.py Signed-off-by: Casey Clements <[email protected]>
| # Get join keys (all columns except event_timestamp and internal columns) | ||
| entity_columns = [ | ||
| c | ||
| for c in result.columns | ||
| if c != event_timestamp_col and not c.startswith("_") | ||
| ] |
There was a problem hiding this comment.
🔴 Entity columns inferred from entity_df columns instead of feature view definitions
In get_historical_features, entity columns used for entity_id serialization are derived from all non-timestamp, non-internal columns of entity_df (mongodb_one.py:669-673), rather than using get_expected_join_keys() from offline_utils like every other offline store in the codebase (BigQuery at bigquery.py:301, Couchbase at couchbase.py:174, Redshift at redshift.py:244, etc.). If entity_df contains any extra columns beyond join keys and event_timestamp (e.g., label columns, metadata), these get incorrectly included in the entity key serialization. The resulting _entity_id bytes won't match the entity_ids stored in MongoDB, causing ALL features to silently return as NULL with no error. This is particularly dangerous because the failure is silent — the query returns a valid DataFrame with the correct shape but all feature values are NULL.
Prompt for agents
In sdk/python/feast/infra/offline_stores/contrib/mongodb_offline_store/mongodb_one.py, the _run_single function at line 668-673 should use the Feast-standard get_expected_join_keys utility to determine entity columns, instead of inferring them from entity_df columns.
1. Import get_expected_join_keys from feast.infra.offline_stores.offline_utils
2. In get_historical_features (around line 626), compute the expected join keys using:
expected_join_keys = offline_utils.get_expected_join_keys(project, feature_views, registry)
3. Also add the standard validation:
offline_utils.assert_expected_columns_in_entity_df(entity_schema, expected_join_keys, event_timestamp_col)
4. In _run_single, replace the entity_columns heuristic (lines 668-673) with:
entity_columns = sorted(expected_join_keys)
(the expected_join_keys variable is accessible as a closure variable)
This matches the pattern used by all other offline stores (BigQuery, Redshift, Couchbase, Postgres, etc.) and ensures correct entity_id serialization even when entity_df contains extra non-join-key columns.
Was this helpful? React with 👍 or 👎 to provide feedback.
| - ``created_at``: ingestion time | ||
| """ | ||
|
|
||
| _index_initialized: bool = False |
There was a problem hiding this comment.
🟡 _index_initialized class boolean skips index creation for different MongoDB configs
MongoDBOfflineStoreOne._index_initialized is a class-level bool that, once set to True by _get_client_and_ensure_indexes, prevents index creation for ALL subsequent calls regardless of config. If the code is used with different RepoConfig values in the same process (different databases, collections, or connection strings), indexes are only created for the first config. Compare with mongodb_many.py:599-615 which uses a per-{db_name}.{collection_name} cache set — that approach is also imperfect (doesn't key on connection string) but much more granular than a single boolean.
| _index_initialized: bool = False | |
| _indexes_ensured: set = set() |
Was this helpful? React with 👍 or 👎 to provide feedback.
| def _serialize_entity_key(self, row: pd.Series, join_keys: list[str]) -> bytes: | ||
| """Serialize entity key columns to bytes.""" | ||
| entity_key = EntityKeyProto() | ||
| for key in join_keys: |
There was a problem hiding this comment.
🟡 _serialize_entity_key in data source creator doesn't sort join keys, causing entity_id mismatch
MongoDBOneDataSourceCreator._serialize_entity_key iterates join_keys without sorting (line 181: for key in join_keys:), but the query-side function _serialize_entity_key_from_row in mongodb_one.py:371 uses for key in sorted(join_keys). For compound join keys (e.g., ["user_id", "device_id"]), the entity_ids written during data ingestion may have keys in a different order than those generated at query time. Since entity key serialization is order-dependent, this produces different bytes and causes query mismatches (all features returned as NULL). Additionally, the isinstance(value, bool) check at line 191 is unreachable dead code because bool is a subclass of int in Python, so the isinstance(value, int) check at line 185 always matches first.
| for key in join_keys: | |
| for key in sorted(join_keys): |
Was this helpful? React with 👍 or 👎 to provide feedback.
Signed-off-by: Casey Clements <[email protected]>
What this PR does / why we need it:
Summary
This PR introduces two MongoDB offline store implementations with different schema designs, optimized for different use cases:
Motivation
MongoDB users need flexibility in how they structure their feature data:
Schema Comparison
Many Schema (one collection per FeatureView)
One Schema (single shared collection)
Performance Benchmarks
Benchmarks with 10 features, 3 historical rows per entity:
Key insight: Many is faster but uses 10x more memory at 1M rows. Choose based on your constraints.
Features
Both Implementations
get_historical_featureswith point-in-time correctnesspull_latest_from_table_or_queryfor materializationpull_all_from_table_or_queryfor batch retrievalMany-Specific
One-Specific
Configuration
Many
One
File Structure
Testing
Unit Tests
Universal Integration Tests (Many only)
FEAST_LOCAL_ONLINE_CONTAINER=True pytest \ sdk/python/tests/integration/offline_store/test_universal_historical_retrieval.py \ --integration -k "MongoDBMany"Benchmarks
pytest sdk/python/tests/unit/infra/offline_stores/contrib/mongodb_offline_store/benchmark.py \ -v -s -k "test_summary_comparison"Known Limitations
One + Universal Tests: The
MongoDBOneDataSourceCreatoris not registered for universal tests. The One schema requires entity/join key information at data creation time, butDataSourceCreator.create_data_source()doesn't receive entity definitions. See TODO inmongodb.py.LoggingDestination: Neither implementation has feature logging support yet.
SavedDatasetStorage: Only Many has
SavedDatasetMongoDBStorageMany.Misc
Reviewers
Please pay attention to: