Here we provide tools for benchmarking Python-based feature server with 3 online stores: Redis, AWS DynamoDB, and GCP Datastore. Follow the instructions below to reproduce the benchmarks.
Tested with: feast 0.25.1
You need to have the following installed:
- Python
3.8+ - Feast
0.25+ - Docker
- Docker Compose
v2.x - Vegeta
parquet-tools
All these benchmarks are run on an EC2 instance (c5.4xlarge, 16vCPU, 32GiB memory) or a GCP GCE instance (c2-standard-16, 16 vCPU, 64GiB memory), on the same region as the target online stores.
Note: see here for details on how to provision the cloud instances to run the tests.
For all of the following benchmarks, you'll need to generate the data using data_generator.py under the top-level directory of this repo. Just cd to the main directory and run python data_generator.py.
- Apply feature definitions to create a Feast repo.
cd python/feature_repos/redis
feast apply
- Deploy Redis & feature servers using docker-compose
cd ../../docker/redis
docker-compose up -d
If everything goes well, you should see an output like this:
Creating redis_redis_1 ... done
Creating redis_feast_1 ... done
Creating redis_feast_2 ... done
Creating redis_feast_3 ... done
Creating redis_feast_4 ... done
Creating redis_feast_5 ... done
Creating redis_feast_6 ... done
Creating redis_feast_7 ... done
Creating redis_feast_8 ... done
Creating redis_feast_9 ... done
Creating redis_feast_10 ... done
Creating redis_feast_11 ... done
Creating redis_feast_12 ... done
Creating redis_feast_13 ... done
Creating redis_feast_14 ... done
Creating redis_feast_15 ... done
Creating redis_feast_16 ... done
- Materialize data to Redis
cd ../../feature_repos/redis
# This is unfortunately necessary because inside docker feature servers resolve
# Redis host name as `redis`, but since we're running materialization from shell,
# Redis is accessible on localhost:
sed -i 's/redis:6379/localhost:6379/g' feature_store.yaml
feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")
# Make sure to change this back, since it can mess up with feature servers
# if you run another docker-compose command later:
sed -i 's/localhost:6379/redis:6379/g' feature_store.yaml
- Check that feature servers are working & they have materialized data
cd ../../..
parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6
This should return something like this:
+----------+
| entity |
|----------|
| 94 |
| 1992 |
| 4475 |
Put these numbers into an env variable with:
TEST_ENTITY_IDS=`parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6 | tail -n 3 | sed 's/|//g' | paste -d, -s`
echo $TEST_ENTITY_IDS
(which should output something like 94 , 1992 , 4475 )
Query the feature server with
curl -X POST \
"http://127.0.0.1:6566/get-online-features" \
-H "accept: application/json" \
-d "{
\"feature_service\": \"feature_service_0\",
\"entities\": {
\"entity\": [$TEST_ENTITY_IDS]
}
}" | jq
In the output, make sure that "values" field contains none of the null
values. It should look something like this:
{
"values": [
4475,
1551,
9889,
- Run Benchmarks
cd python
./run-benchmark.sh
For this benchmark, you'll need to have AWS credentials configured in ~/.aws/credentials.
- Apply feature definitions to create a Feast repo.
cd feature_repos/dynamo
feast apply
- Deploy feature servers using docker-compose
cd ../../docker/dynamo
docker-compose up -d
If everything goes well, you should see an output like this:
Creating dynamo_feast_1 ... done
Creating dynamo_feast_2 ... done
Creating dynamo_feast_3 ... done
Creating dynamo_feast_4 ... done
Creating dynamo_feast_5 ... done
Creating dynamo_feast_6 ... done
Creating dynamo_feast_7 ... done
Creating dynamo_feast_8 ... done
Creating dynamo_feast_9 ... done
Creating dynamo_feast_10 ... done
Creating dynamo_feast_11 ... done
Creating dynamo_feast_12 ... done
Creating dynamo_feast_13 ... done
Creating dynamo_feast_14 ... done
Creating dynamo_feast_15 ... done
Creating dynamo_feast_16 ... done
- Materialize data to DynamoDB
cd ../../feature_repos/dynamo
feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")
- Check that feature servers are working & they have materialized data
cd ../../..
parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6
This should return something like this:
+----------+
| entity |
|----------|
| 94 |
| 1992 |
| 4475 |
Put these numbers into an env variable with:
TEST_ENTITY_IDS=`parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6 | tail -n 3 | sed 's/|//g' | paste -d, -s`
echo $TEST_ENTITY_IDS
(which should output something like 94 , 1992 , 4475 )
Query the feature server with
curl -X POST \
"http://127.0.0.1:6566/get-online-features" \
-H "accept: application/json" \
-d "{
\"feature_service\": \"feature_service_0\",
\"entities\": {
\"entity\": [$TEST_ENTITY_IDS]
}
}" | jq
In the output, make sure that "values" field contains none of the null values. It should look something like this:
{
"values": [
4475,
1551,
9889,
- Run Benchmarks
cd python
./run-benchmark.sh
For this benchmark, you need GCP credentials accessible. Here it is assumed that it's all in
${HOME}/.config/gcloud, which will be available to the docker containers running
the feature server. (Adjust as needed by inspecting the docker-compose.yml).
- Apply feature definitions to create a Feast repo.
cd feature_repos/datastore
feast apply
- Deploy feature servers using docker-compose
cd ../../docker/datastore
docker-compose up -d
If everything goes well, you should see an output like this:
Creating datastore_feast_1 ... done
Creating datastore_feast_2 ... done
Creating datastore_feast_3 ... done
Creating datastore_feast_4 ... done
Creating datastore_feast_5 ... done
Creating datastore_feast_6 ... done
Creating datastore_feast_7 ... done
Creating datastore_feast_8 ... done
Creating datastore_feast_9 ... done
Creating datastore_feast_10 ... done
Creating datastore_feast_11 ... done
Creating datastore_feast_12 ... done
Creating datastore_feast_13 ... done
Creating datastore_feast_14 ... done
Creating datastore_feast_15 ... done
Creating datastore_feast_16 ... done
Note: The Python google package requires not only the credentials to be accessible (in read-write mode, as can be seen in the Datastore docker-compose.yml), but also the google cloud SDK to be installed. For this reason there is an additional step in the Dockerfile for Datastore, which handles the installation. Reference.
- Materialize data to Datastore
cd ../../feature_repos/datastore
feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")
- Check that feature servers are working & they have materialized data
cd ../../..
parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6
This should return something like this:
+----------+
| entity |
|----------|
| 94 |
| 1992 |
| 4475 |
Put these numbers into an env variable with:
TEST_ENTITY_IDS=`parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6 | tail -n 3 | sed 's/|//g' | paste -d, -s`
echo $TEST_ENTITY_IDS
(which should output something like 94 , 1992 , 4475 )
Query the feature server with
curl -X POST \
"http://127.0.0.1:6566/get-online-features" \
-H "accept: application/json" \
-d "{
\"feature_service\": \"feature_service_0\",
\"entities\": {
\"entity\": [$TEST_ENTITY_IDS]
}
}" | jq
In the output, make sure that "values" field contains none of the null values. It should look something like this:
{
"values": [
4475,
1551,
9889,
- Run Benchmarks
cd python
./run-benchmark.sh
This runs on a single-node Cassandra cluster running in Docker alongside the benchmarking containers.
- Start the docker containers:
cd docker/cassandra
docker-compose up -d
If everything goes well, you should see an output like this:
⠿ Network cassandra_default Created 0.0s
⠿ Container cassandra-cassandra-1 Started 0.6s
⠿ Container cassandra-feast-16 Started 1.0s
⠿ Container cassandra-feast-1 Started 1.5s
⠿ Container cassandra-feast-8 Started 3.0s
⠿ Container cassandra-feast-4 Started 2.4s
⠿ Container cassandra-feast-2 Started 2.4s
⠿ Container cassandra-feast-14 Started 2.2s
⠿ Container cassandra-feast-5 Started 1.5s
⠿ Container cassandra-feast-3 Started 2.8s
⠿ Container cassandra-feast-13 Started 0.8s
⠿ Container cassandra-feast-9 Started 1.3s
⠿ Container cassandra-feast-11 Started 1.7s
⠿ Container cassandra-feast-15 Started 0.9s
⠿ Container cassandra-feast-6 Started 2.8s
⠿ Container cassandra-feast-12 Started 2.0s
⠿ Container cassandra-feast-7 Started 2.5s
⠿ Container cassandra-feast-10 Started 1.8s
Wait about 60-90 seconds for Cassandra to fully start. Then you can proceed (if not ready yet, the next command will error and you can retry it a little later).
- Create the destination keyspace in Cassandra: check the output of this command to make sure
feast_testis now here.
docker exec -it cassandra-cassandra-1 cqlsh -e \
"CREATE KEYSPACE feast_test WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1}; DESC KEYSPACES;"
- From the host machine, provision the feature store:
cd ../../feature_repos/cassandra/
# This is unfortunately necessary because inside docker feature servers resolve
# Cassandra host name as `cassandra`, but since we're running materialization from shell,
# Cassandra is accessible on localhost:
sed -i 's/- cassandra/- localhost/g' feature_store.yaml
feast apply
# Make sure to change this back, since it can mess up with feature servers
# if you run another docker-compose command later:
sed -i 's/- localhost/- cassandra/g' feature_store.yaml
- Similarly, materialize from the host machine:
# This is unfortunately necessary because inside docker feature servers resolve
# Cassandra host name as `cassandra`, but since we're running materialization from shell,
# Cassandra is accessible on localhost:
sed -i 's/- cassandra/- localhost/g' feature_store.yaml
feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")
# Make sure to change this back, since it can mess up with feature servers
# if you run another docker-compose command later:
sed -i 's/- localhost/- cassandra/g' feature_store.yaml
3b. A workaround for the Dockerized feast to work
The Docker container have a copy of the registry directory, including data/registry.db.
But the image gets done before the apply step above (it is inevitable if we want
to create the keyspace and have the Cassandra part of the docker-compose),
so the Docker feasts have not the updated registry.db. For the time being, a workaround
is as follows:
docker cp data/registry.db cassandra-feast-1:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-2:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-3:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-4:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-5:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-6:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-7:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-8:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-9:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-10:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-11:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-12:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-13:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-14:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-15:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-16:/feature_repo/data/registry.db
cd ../../docker/cassandra/
docker-compose restart
cd ../../feature_repos/cassandra/
- Check that feature servers are working & they have materialized data
cd ../../..
parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6
This should return something like this:
+----------+
| entity |
|----------|
| 94 |
| 1992 |
| 4475 |
Put these numbers into an env variable with:
TEST_ENTITY_IDS=`parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6 | tail -n 3 | sed 's/|//g' | paste -d, -s`
echo $TEST_ENTITY_IDS
(which should output something like 94 , 1992 , 4475 )
Query the feature server with
curl -X POST \
"http://127.0.0.1:6566/get-online-features" \
-H "accept: application/json" \
-d "{
\"feature_service\": \"feature_service_0\",
\"entities\": {
\"entity\": [$TEST_ENTITY_IDS]
}
}" | jq
In the output, make sure that "values" field contains none of the null values. It should look something like this:
{
"values": [
4475,
1551,
9889,
- Run Benchmarks
cd python
./run-benchmark.sh
Ensure you have an Astra DB instance in the same AWS region as your benchmarking
client. To connect to it you need the Client ID and the Client Secret from a
database token, as well as the "secure connect bundle" zip-file which should
be placed inside the python/feature_repos/astra_db/ directory.
Adjust file feature_store.yaml in that directory to reflect Client ID, Client
Secret, database keyspace name, AWS region name and secure-connect-bundle filename.
Note: in order to be able to share the same feature_store.yaml from both
the Dockerized feast instances and the one on the host machine,
please put the secure connect bundle in the python/feature_repos/astra_db/
directory itself and refer to it as ./secure-connect-DATABASENAME.zip
(i.e. with a relative path).
- Apply feature definitions to create a Feast repo.
cd feature_repos/astra_db
feast apply
- Deploy feature servers using docker-compose
cd ../../docker/astra_db
docker-compose up -d
If everything goes well, you should see an output like this:
⠿ Network astra_db_default Created 0.0s
⠿ Container astra_db-feast-1 Started 2.7s
⠿ Container astra_db-feast-16 Started 2.8s
⠿ Container astra_db-feast-3 Started 2.4s
⠿ Container astra_db-feast-5 Started 1.4s
⠿ Container astra_db-feast-11 Started 1.8s
⠿ Container astra_db-feast-4 Started 1.6s
⠿ Container astra_db-feast-2 Started 1.2s
⠿ Container astra_db-feast-6 Started 0.8s
⠿ Container astra_db-feast-12 Started 2.1s
⠿ Container astra_db-feast-7 Started 3.0s
⠿ Container astra_db-feast-8 Started 0.8s
⠿ Container astra_db-feast-10 Started 2.8s
⠿ Container astra_db-feast-14 Started 1.2s
⠿ Container astra_db-feast-15 Started 2.9s
⠿ Container astra_db-feast-13 Started 1.8s
⠿ Container astra_db-feast-9 Started 2.3s
- Materialize data to Astra DB
cd ../../feature_repos/astra_db
feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")
- Check that feature servers are working & they have materialized data
cd ../../..
parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6
This should return something like this:
+----------+
| entity |
|----------|
| 94 |
| 1992 |
| 4475 |
Put these numbers into an env variable with:
TEST_ENTITY_IDS=`parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6 | tail -n 3 | sed 's/|//g' | paste -d, -s`
echo $TEST_ENTITY_IDS
(which should output something like 94 , 1992 , 4475 )
Query the feature server with
curl -X POST \
"http://127.0.0.1:6566/get-online-features" \
-H "accept: application/json" \
-d "{
\"feature_service\": \"feature_service_0\",
\"entities\": {
\"entity\": [$TEST_ENTITY_IDS]
}
}" | jq
In the output, make sure that "values" field contains none of the null values. It should look something like this:
{
"values": [
4475,
1551,
9889,
- Run Benchmarks
cd python
./run-benchmark.sh