feat: MongoDB offline stores by caseyclements · Pull Request #6138 · feast-dev/feast

caseyclements · 2026-03-20T20:48:05Z

What this PR does / why we need it:

Summary

This PR introduces two MongoDB offline store implementations with different schema designs, optimized for different use cases:

Implementation	Schema	Best For
MongoDBOfflineStoreMany	One collection per FeatureView	Performance, small-medium feature stores
MongoDBOfflineStoreOne	Single shared collection	Memory efficiency, large feature stores

Motivation

MongoDB users need flexibility in how they structure their feature data:

Many (collection-per-FV): Simpler schema, faster queries via Ibis memtables, but loads entire collections into memory
One (single collection): Memory-bounded via chunked processing, filters by entity_id, but slower at scale

Schema Comparison

Many Schema (one collection per FeatureView)

// Collection: driver_stats
{
  "driver_id": 1001,
  "event_timestamp": ISODate("2026-01-20T12:00:00Z"),
  "created_at": ISODate("2026-01-20T12:00:05Z"),
  "rating": 4.91,
  "trips_last_7d": 132
}

One Schema (single shared collection)

// Collection: feature_history (shared by all FVs)
{
  "entity_id": Binary("..."),           // Serialized entity key
  "feature_view": "driver_stats",       // Discriminator
  "features": {                         // Nested subdocument
    "rating": 4.91,
    "trips_last_7d": 132
  },
  "event_timestamp": ISODate("2026-01-20T12:00:00Z"),
  "created_at": ISODate("2026-01-20T12:00:05Z")
}

Performance Benchmarks

Benchmarks with 10 features, 3 historical rows per entity:

Entity Rows	Many Time	One Time	Many Memory	One Memory	Winner
1,000	0.44s	0.19s	5.7 MB	7.2 MB	One
10,000	0.57s	1.25s	49.8 MB	69.5 MB	Many
100,000	4.82s	14.47s	490.8 MB	352.2 MB	Many
1,000,000	56.45s	296.42s	4,906 MB	494 MB	Many

Key insight: Many is faster but uses 10x more memory at 1M rows. Choose based on your constraints.

Features

Both Implementations

✅ get_historical_features with point-in-time correctness
✅ pull_latest_from_table_or_query for materialization
✅ pull_all_from_table_or_query for batch retrieval
✅ TTL support per FeatureView
✅ Auto-index creation during materialization
✅ Created timestamp tie-breaking

Many-Specific

✅ Universal test framework integration (DataSourceCreator)
✅ Flat document schema (easy to query directly)

One-Specific

✅ Chunked processing (bounded memory)
✅ Entity-filtered queries (doesn't load entire collection)
✅ Schema matches online store pattern

Configuration

Many

offline_store:
  type: feast.infra.offline_stores.contrib.mongodb_offline_store.mongodb_many.MongoDBOfflineStoreMany
  connection_string: mongodb://localhost:27017
  database: feast

One

offline_store:
  type: feast.infra.offline_stores.contrib.mongodb_offline_store.mongodb_one.MongoDBOfflineStoreOne
  connection_string: mongodb://localhost:27017
  database: feast
  collection: feature_history

File Structure

feast/infra/offline_stores/contrib/mongodb_offline_store/
├── __init__.py          # Shared DRIVER_METADATA
├── README.md            # Documentation with decision flowchart
├── mongodb_many.py      # MongoDBOfflineStoreMany, MongoDBSourceMany
└── mongodb_one.py       # MongoDBOfflineStoreOne, MongoDBSourceOne

tests/unit/infra/offline_stores/contrib/mongodb_offline_store/
├── test_many.py         # Unit tests for Many
├── test_one.py          # Unit tests for One
└── benchmark.py         # Performance comparison

tests/universal/feature_repos/universal/data_sources/
└── mongodb.py           # DataSourceCreators for universal tests

Testing

Unit Tests

pytest sdk/python/tests/unit/infra/offline_stores/contrib/mongodb_offline_store/ -v

Universal Integration Tests (Many only)

FEAST_LOCAL_ONLINE_CONTAINER=True pytest \
  sdk/python/tests/integration/offline_store/test_universal_historical_retrieval.py \
  --integration -k "MongoDBMany"

Benchmarks

pytest sdk/python/tests/unit/infra/offline_stores/contrib/mongodb_offline_store/benchmark.py \
  -v -s -k "test_summary_comparison"

Known Limitations

One + Universal Tests: The MongoDBOneDataSourceCreator is not registered for universal tests. The One schema requires entity/join key information at data creation time, but DataSourceCreator.create_data_source() doesn't receive entity definitions. See TODO in mongodb.py.
LoggingDestination: Neither implementation has feature logging support yet.
SavedDatasetStorage: Only Many has SavedDatasetMongoDBStorageMany.

Misc

Reviewers

Please pay attention to:

Schema design decisions (Many vs One trade-offs)
Index creation strategy
Universal test framework integration

- MongoDBSource: DataSource backed by a MongoDB collection, schema sampled via \ aggregation (default N=100) - MongoDBOfflineStoreConfig: connection_string + default database - MongoDBOfflineStore: delegates to ibis PIT join engine via in-memory memtable approach - SavedDatasetMongoDBStorage: persist training datasets to MongoDB - _build_data_source_reader/_build_data_source_writer closures capture config (connection_string, database) for MongoDB access Signed-off-by: Casey Clements <[email protected]>

- Update copyright headers to 2026 - Move mongodb_to_feast_value_type to feast/type_map.py, consistent with pg_type_to_feast_value_type and cb_columnar_type_to_feast_value_type - Add docstrings to MongoDBOptions.to_proto/from_proto, MongoDBSource class, and get_table_column_names_and_types - Replace dead 'assert name' with cast(str, ...) for type-checker safety - Add explanatory comment to validate() stub - Remove module-level warnings.simplefilter('once', RuntimeWarning), which was a process-wide side effect; per-call warnings.warn is enough - Convert all assert isinstance(data_source, MongoDBSource) guards to ValueError with descriptive messages in both public API methods and the reader/writer closures - Fix bug: add tz_aware=True to MongoClient in the writer closure, matching the reader, to ensure consistent timezone-aware datetime handling across read and write paths Signed-off-by: Casey Clements <[email protected]>

…reIbis and MongoDBOfflineStoreNative Signed-off-by: Casey Clements <[email protected]>

Signed-off-by: Casey Clements <[email protected]>

…mongo, skipping as natural. Signed-off-by: Casey Clements <[email protected]>

Signed-off-by: Casey Clements <[email protected]>

…ed_at tie-breaker in sort Signed-off-by: Casey Clements <[email protected]>

Signed-off-by: Casey Clements <[email protected]>

…tores Signed-off-by: Casey Clements <[email protected]>

- Eliminate -based PIT join which scaled poorly (O(n×m)) - Use single query to fetch all matching feature data - Batch entity_ids into chunks of 1000 for large queries - Flatten features subdoc with pd.json_normalize - Apply pd.merge_asof for efficient PIT join per FeatureView - Handle TTL filtering in pandas instead of MQL \ - Remove unused _ttl_to_ms and _build_ttl_gte_expr helpers Performance improvement: - Before: 10k rows in ~188s (53 rows/s) - After: 10k rows in ~7.4s (1,354 rows/s) - Now competitive with Ibis implementation Signed-off-by: Casey Clements <[email protected]>

…tity_df - Add CHUNK_SIZE (5000) for entity_df processing to bound memory usage - Extract _run_single helper function for processing each chunk - Add _chunk_dataframe generator for yielding DataFrame slices - Preserve original row ordering via _row_idx column - Exclude internal columns (prefixed with _) from entity key serialization - Concat chunk results and restore ordering at the end This allows processing arbitrarily large entity_df while keeping memory bounded by processing in 5000-row chunks. Signed-off-by: Casey Clements <[email protected]>

… sizes Performance optimizations: - Reuse MongoClient across chunks (was creating new client per chunk) - Increase CHUNK_SIZE from 5,000 to 50,000 rows - Increase MONGO_BATCH_SIZE from 1,000 to 10,000 entity_ids - Pass collection to _run_single instead of creating client each time - Make index creation idempotent (check for existing index) Results (100k rows): - Before: 21.7s - After: 5.2s (4.2x faster) Results (1M rows): - Before: 1664s (28 min) - After: 212s (3.5 min) (7.8x faster) Signed-off-by: Casey Clements <[email protected]>

The Native implementation now lives exclusively in mongodb_native.py with the single-collection schema. This removes the confusing duplicate that used the Ibis collection-per-FV schema. Signed-off-by: Casey Clements <[email protected]>

- Move MongoDBSource, MongoDBOptions, SavedDatasetMongoDBStorage into mongodb.py - Move _infer_python_type_str helper into mongodb.py - Update imports in tests and benchmarks - Remove mongodb_source.py This consolidates the collection-per-FV implementation into a single file, making the codebase easier to navigate. Signed-off-by: Casey Clements <[email protected]>

- Rename module: mongodb_offline_store/ → mongodb/ - Rename files: mongodb.py → mongodb_many.py, mongodb_native.py → mongodb_one.py Class renames: - MongoDBSource → MongoDBSourceMany - MongoDBOptions → MongoDBOptionsMany - SavedDatasetMongoDBStorage → SavedDatasetMongoDBStorageMany - MongoDBOfflineStoreIbis → MongoDBOfflineStoreMany - MongoDBOfflineStoreIbisConfig → MongoDBOfflineStoreManyConfig - MongoDBSourceNative → MongoDBSourceOne - MongoDBOfflineStoreNative → MongoDBOfflineStoreOne - MongoDBOfflineStoreNativeConfig → MongoDBOfflineStoreOneConfig - MongoDBNativeRetrievalJob → MongoDBOneRetrievalJob The One/Many naming reflects the core architectural difference: - One: Single shared collection for all FeatureViews - Many: One collection per FeatureView Signed-off-by: Casey Clements <[email protected]>

Signed-off-by: Casey Clements <[email protected]>

- Rename module: mongodb/ → mongodb_offline_store/ (follows naming convention) - Move tests to mongodb_offline_store/ subdirectory: - test_mongodb_offline_retrieval.py → mongodb_offline_store/test_many.py - test_mongodb_offline_retrieval_native.py → mongodb_offline_store/test_one.py - benchmark_mongodb_offline_stores.py → mongodb_offline_store/benchmark.py - Update all imports to use mongodb_offline_store path Signed-off-by: Casey Clements <[email protected]>

Signed-off-by: Casey Clements <[email protected]>

- Clarify that indexes should be on join keys + timestamp - Show example for compound join keys - Note that Many does not auto-create indexes Signed-off-by: Casey Clements <[email protected]>

- Add _ensure_index_many() function with module-level cache - Call during pull_latest_from_table_or_query (materialization) - Creates index on join_keys + timestamp + created_timestamp - Checks for existing index before creating - Update README to reflect auto-create behavior Signed-off-by: Casey Clements <[email protected]>

- Rename functions: _generate_ibis_data → _generate_many_data, etc. - Rename fixtures: ibis_config → many_config, native_config → one_config - Rename tests: test_scale_rows_ibis → test_scale_rows_many, etc. - Update all docstrings and print statements - Update summary comparison output format Signed-off-by: Casey Clements <[email protected]>

Documents: - Collection structure (one per FeatureView) - Index creation (auto-created during materialization) - Document schema (flat, top-level features) - Point-in-time join strategy (Ibis memtables) - Performance characteristics and memory considerations - When to use vs MongoDBOfflineStoreOne - Comparison table with One implementation Signed-off-by: Casey Clements <[email protected]>

Add missing documentation sections: - Feature Freshness Semantics: document-level freshness, not per-feature - Schema Evolution ('Feature Creep'): flexible schema implications - Notes: entity keys as native types, PIT correctness, TTL constraints Signed-off-by: Casey Clements <[email protected]>

Add DataSourceCreator implementations for MongoDB offline stores: - MongoDBManyDataSourceCreator: Fully functional, passes universal tests. Creates one collection per FeatureView with flat document schema. - MongoDBOneDataSourceCreator: Implementation exists but NOT registered. The One schema requires knowing join keys vs features at data creation time, but DataSourceCreator.create_data_source() doesn't receive entity definitions. See TODO in mongodb.py for details on required interface changes. Other changes: - Fix data_source_class_type path in mongodb_one.py (mongodb_native -> mongodb_one) - Improve datetime handling in mongodb_one.py for non-datetime columns - Add 'mongodb' marker to pytest.ini - Register MongoDBManyDataSourceCreator in repo_configuration.py Signed-off-by: Casey Clements <[email protected]>

devin-ai-integration

Devin Review found 3 potential issues.

View 6 additional findings in Devin Review.

devin-ai-integration · 2026-03-20T20:56:56Z

sdk/python/feast/infra/offline_stores/contrib/mongodb_offline_store/mongodb_one.py

+            # Get join keys (all columns except event_timestamp and internal columns)
+            entity_columns = [
+                c
+                for c in result.columns
+                if c != event_timestamp_col and not c.startswith("_")
+            ]


🔴 Entity columns inferred from entity_df columns instead of feature view definitions

In get_historical_features, entity columns used for entity_id serialization are derived from all non-timestamp, non-internal columns of entity_df (mongodb_one.py:669-673), rather than using get_expected_join_keys() from offline_utils like every other offline store in the codebase (BigQuery at bigquery.py:301, Couchbase at couchbase.py:174, Redshift at redshift.py:244, etc.). If entity_df contains any extra columns beyond join keys and event_timestamp (e.g., label columns, metadata), these get incorrectly included in the entity key serialization. The resulting _entity_id bytes won't match the entity_ids stored in MongoDB, causing ALL features to silently return as NULL with no error. This is particularly dangerous because the failure is silent — the query returns a valid DataFrame with the correct shape but all feature values are NULL.

Prompt for agents

In sdk/python/feast/infra/offline_stores/contrib/mongodb_offline_store/mongodb_one.py, the _run_single function at line 668-673 should use the Feast-standard get_expected_join_keys utility to determine entity columns, instead of inferring them from entity_df columns. 1. Import get_expected_join_keys from feast.infra.offline_stores.offline_utils 2. In get_historical_features (around line 626), compute the expected join keys using: expected_join_keys = offline_utils.get_expected_join_keys(project, feature_views, registry) 3. Also add the standard validation: offline_utils.assert_expected_columns_in_entity_df(entity_schema, expected_join_keys, event_timestamp_col) 4. In _run_single, replace the entity_columns heuristic (lines 668-673) with: entity_columns = sorted(expected_join_keys) (the expected_join_keys variable is accessible as a closure variable) This matches the pattern used by all other offline stores (BigQuery, Redshift, Couchbase, Postgres, etc.) and ensures correct entity_id serialization even when entity_df contains extra non-join-key columns.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-03-20T20:56:57Z

sdk/python/feast/infra/offline_stores/contrib/mongodb_offline_store/mongodb_one.py

+    - ``created_at``: ingestion time
+    """
+
+    _index_initialized: bool = False


🟡 _index_initialized class boolean skips index creation for different MongoDB configs

MongoDBOfflineStoreOne._index_initialized is a class-level bool that, once set to True by _get_client_and_ensure_indexes, prevents index creation for ALL subsequent calls regardless of config. If the code is used with different RepoConfig values in the same process (different databases, collections, or connection strings), indexes are only created for the first config. Compare with mongodb_many.py:599-615 which uses a per-{db_name}.{collection_name} cache set — that approach is also imperfect (doesn't key on connection string) but much more granular than a single boolean.

Suggested change

_index_initialized: bool = False

_indexes_ensured: set = set()

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-03-20T20:56:58Z

sdk/python/tests/universal/feature_repos/universal/data_sources/mongodb.py

+    def _serialize_entity_key(self, row: pd.Series, join_keys: list[str]) -> bytes:
+        """Serialize entity key columns to bytes."""
+        entity_key = EntityKeyProto()
+        for key in join_keys:


🟡 _serialize_entity_key in data source creator doesn't sort join keys, causing entity_id mismatch

MongoDBOneDataSourceCreator._serialize_entity_key iterates join_keys without sorting (line 181: for key in join_keys:), but the query-side function _serialize_entity_key_from_row in mongodb_one.py:371 uses for key in sorted(join_keys). For compound join keys (e.g., ["user_id", "device_id"]), the entity_ids written during data ingestion may have keys in a different order than those generated at query time. Since entity key serialization is order-dependent, this produces different bytes and causes query mismatches (all features returned as NULL). Additionally, the isinstance(value, bool) check at line 191 is unreachable dead code because bool is a subclass of int in Python, so the isinstance(value, int) check at line 185 always matches first.

Suggested change

for key in join_keys:

for key in sorted(join_keys):

Was this helpful? React with 👍 or 👎 to provide feedback.

Signed-off-by: Casey Clements <[email protected]>

caseyclements added 29 commits March 20, 2026 16:36

Started work on full Mongo/MQL implementation. Kept MongoDBOfflineSto…

62695aa

…reIbis and MongoDBOfflineStoreNative Signed-off-by: Casey Clements <[email protected]>

refactor: rename alpha to preview, clarify MQL pipeline comments

812d03d

Signed-off-by: Casey Clements <[email protected]>

Added unit tests for offline store retrieval, requiring docker and py…

c3401ea

…mongo, skipping as natural. Signed-off-by: Casey Clements <[email protected]>

Added test of multiple feature views and compound join keys

ec2e7ba

Signed-off-by: Casey Clements <[email protected]>

Initial implementation of native single-collection offline store

a4d2886

Signed-off-by: Casey Clements <[email protected]>

Added DriverInfo to MongoDBClients

e9de6f3

Signed-off-by: Casey Clements <[email protected]>

Optimized MQL. Applied FV-level TTL

81d194c

Signed-off-by: Casey Clements <[email protected]>

filter TTL by relevant FVs only, cautiously reset df index; add creat…

ad85385

…ed_at tie-breaker in sort Signed-off-by: Casey Clements <[email protected]>

Updated docstrings

4d02feb

Signed-off-by: Casey Clements <[email protected]>

Lazy index creation via _get_client_and_ensure_indexes

8d86cdd

Signed-off-by: Casey Clements <[email protected]>

Add performance benchmarks comparing Ibis vs Native MongoDB offline s…

a1e3c93

…tores Signed-off-by: Casey Clements <[email protected]>

Add README.md documenting MongoDB offline store implementations

2c25494

Signed-off-by: Casey Clements <[email protected]>

Update docstring in benchmark.py

bae2648

Signed-off-by: Casey Clements <[email protected]>

Update README to show created_at tie-breaker in Many schema

e4c79bf

Signed-off-by: Casey Clements <[email protected]>

Update README index recommendations for Many implementation

548698b

- Clarify that indexes should be on join keys + timestamp - Show example for compound join keys - Note that Many does not auto-create indexes Signed-off-by: Casey Clements <[email protected]>

caseyclements requested a review from a team as a code owner March 20, 2026 20:48

caseyclements changed the title ~~Feast offline store intpython 297~~ feat: MongoDB offline stores Mar 20, 2026

devin-ai-integration bot reviewed Mar 20, 2026

View reviewed changes

Add .secrets.baseline

9dc9162

Signed-off-by: Casey Clements <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: MongoDB offline stores#6138

feat: MongoDB offline stores#6138
caseyclements wants to merge 30 commits intofeast-dev:masterfrom
caseyclements:FEAST-OfflineStore-INTPYTHON-297

caseyclements commented Mar 20, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

devin-ai-integration bot Mar 20, 2026

Uh oh!

devin-ai-integration bot Mar 20, 2026

Uh oh!

devin-ai-integration bot Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	_index_initialized: bool = False
	_indexes_ensured: set = set()

Conversation

caseyclements commented Mar 20, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Summary

Motivation

Schema Comparison

Many Schema (one collection per FeatureView)

One Schema (single shared collection)

Performance Benchmarks

Features

Both Implementations

Many-Specific

One-Specific

Configuration

Many

One

File Structure

Testing

Unit Tests

Universal Integration Tests (Many only)

Benchmarks

Known Limitations

Misc

Reviewers

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

caseyclements commented Mar 20, 2026 •

edited by devin-ai-integration bot

Loading