Skip to content

Latest commit

 

History

History

README.md

Benchmarking Python Feature Server

Here we provide tools for benchmarking Python-based feature server with 3 online stores: Redis, AWS DynamoDB, and GCP Datastore. Follow the instructions below to reproduce the benchmarks.

Tested with: feast 0.25.1

Prerequisites

You need to have the following installed:

  • Python 3.8+
  • Feast 0.25+
  • Docker
  • Docker Compose v2.x
  • Vegeta
  • parquet-tools

All these benchmarks are run on an EC2 instance (c5.4xlarge, 16vCPU, 32GiB memory) or a GCP GCE instance (c2-standard-16, 16 vCPU, 64GiB memory), on the same region as the target online stores.

Note: see here for details on how to provision the cloud instances to run the tests.

Generate Data

For all of the following benchmarks, you'll need to generate the data using data_generator.py under the top-level directory of this repo. Just cd to the main directory and run python data_generator.py.

Redis

  1. Apply feature definitions to create a Feast repo.
cd python/feature_repos/redis
feast apply
  1. Deploy Redis & feature servers using docker-compose
cd ../../docker/redis
docker-compose up -d

If everything goes well, you should see an output like this:

Creating redis_redis_1 ... done
Creating redis_feast_1  ... done
Creating redis_feast_2  ... done
Creating redis_feast_3  ... done
Creating redis_feast_4  ... done
Creating redis_feast_5  ... done
Creating redis_feast_6  ... done
Creating redis_feast_7  ... done
Creating redis_feast_8  ... done
Creating redis_feast_9  ... done
Creating redis_feast_10 ... done
Creating redis_feast_11 ... done
Creating redis_feast_12 ... done
Creating redis_feast_13 ... done
Creating redis_feast_14 ... done
Creating redis_feast_15 ... done
Creating redis_feast_16 ... done
  1. Materialize data to Redis
cd ../../feature_repos/redis
# This is unfortunately necessary because inside docker feature servers resolve
# Redis host name as `redis`, but since we're running materialization from shell,
# Redis is accessible on localhost:
sed -i 's/redis:6379/localhost:6379/g' feature_store.yaml
feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")
# Make sure to change this back, since it can mess up with feature servers
# if you run another docker-compose command later:
sed -i 's/localhost:6379/redis:6379/g' feature_store.yaml
  1. Check that feature servers are working & they have materialized data
cd ../../..
parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6

This should return something like this:

+----------+
|   entity |
|----------|
|       94 |
|     1992 |
|     4475 |

Put these numbers into an env variable with:

TEST_ENTITY_IDS=`parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6 | tail -n 3 | sed 's/|//g' | paste -d, -s`
echo $TEST_ENTITY_IDS

(which should output something like 94 , 1992 , 4475 )

Query the feature server with

curl -X POST \
  "http://127.0.0.1:6566/get-online-features" \
  -H "accept: application/json" \
  -d "{
    \"feature_service\": \"feature_service_0\",
    \"entities\": {
      \"entity\": [$TEST_ENTITY_IDS]
    }
  }" | jq

In the output, make sure that "values" field contains none of the null values. It should look something like this:

    {
      "values": [
        4475,
        1551,
        9889,        
  1. Run Benchmarks
cd python
./run-benchmark.sh

AWS DynamoDB

For this benchmark, you'll need to have AWS credentials configured in ~/.aws/credentials.

  1. Apply feature definitions to create a Feast repo.
cd feature_repos/dynamo
feast apply
  1. Deploy feature servers using docker-compose
cd ../../docker/dynamo
docker-compose up -d

If everything goes well, you should see an output like this:

Creating dynamo_feast_1  ... done
Creating dynamo_feast_2  ... done
Creating dynamo_feast_3  ... done
Creating dynamo_feast_4  ... done
Creating dynamo_feast_5  ... done
Creating dynamo_feast_6  ... done
Creating dynamo_feast_7  ... done
Creating dynamo_feast_8  ... done
Creating dynamo_feast_9  ... done
Creating dynamo_feast_10 ... done
Creating dynamo_feast_11 ... done
Creating dynamo_feast_12 ... done
Creating dynamo_feast_13 ... done
Creating dynamo_feast_14 ... done
Creating dynamo_feast_15 ... done
Creating dynamo_feast_16 ... done
  1. Materialize data to DynamoDB
cd ../../feature_repos/dynamo
feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")
  1. Check that feature servers are working & they have materialized data
cd ../../..
parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6

This should return something like this:

+----------+
|   entity |
|----------|
|       94 |
|     1992 |
|     4475 |

Put these numbers into an env variable with:

TEST_ENTITY_IDS=`parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6 | tail -n 3 | sed 's/|//g' | paste -d, -s`
echo $TEST_ENTITY_IDS

(which should output something like 94 , 1992 , 4475 )

Query the feature server with

curl -X POST \
  "http://127.0.0.1:6566/get-online-features" \
  -H "accept: application/json" \
  -d "{
    \"feature_service\": \"feature_service_0\",
    \"entities\": {
      \"entity\": [$TEST_ENTITY_IDS]
    }
  }" | jq

In the output, make sure that "values" field contains none of the null values. It should look something like this:

    {
      "values": [
        4475,
        1551,
        9889,        
  1. Run Benchmarks
cd python
./run-benchmark.sh

GCP Datastore

For this benchmark, you need GCP credentials accessible. Here it is assumed that it's all in ${HOME}/.config/gcloud, which will be available to the docker containers running the feature server. (Adjust as needed by inspecting the docker-compose.yml).

  1. Apply feature definitions to create a Feast repo.
cd feature_repos/datastore
feast apply
  1. Deploy feature servers using docker-compose
cd ../../docker/datastore
docker-compose up -d

If everything goes well, you should see an output like this:

Creating datastore_feast_1  ... done
Creating datastore_feast_2  ... done
Creating datastore_feast_3  ... done
Creating datastore_feast_4  ... done
Creating datastore_feast_5  ... done
Creating datastore_feast_6  ... done
Creating datastore_feast_7  ... done
Creating datastore_feast_8  ... done
Creating datastore_feast_9  ... done
Creating datastore_feast_10 ... done
Creating datastore_feast_11 ... done
Creating datastore_feast_12 ... done
Creating datastore_feast_13 ... done
Creating datastore_feast_14 ... done
Creating datastore_feast_15 ... done
Creating datastore_feast_16 ... done

Note: The Python google package requires not only the credentials to be accessible (in read-write mode, as can be seen in the Datastore docker-compose.yml), but also the google cloud SDK to be installed. For this reason there is an additional step in the Dockerfile for Datastore, which handles the installation. Reference.

  1. Materialize data to Datastore
cd ../../feature_repos/datastore
feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")
  1. Check that feature servers are working & they have materialized data
cd ../../..
parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6

This should return something like this:

+----------+
|   entity |
|----------|
|       94 |
|     1992 |
|     4475 |

Put these numbers into an env variable with:

TEST_ENTITY_IDS=`parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6 | tail -n 3 | sed 's/|//g' | paste -d, -s`
echo $TEST_ENTITY_IDS

(which should output something like 94 , 1992 , 4475 )

Query the feature server with

curl -X POST \
  "http://127.0.0.1:6566/get-online-features" \
  -H "accept: application/json" \
  -d "{
    \"feature_service\": \"feature_service_0\",
    \"entities\": {
      \"entity\": [$TEST_ENTITY_IDS]
    }
  }" | jq

In the output, make sure that "values" field contains none of the null values. It should look something like this:

    {
      "values": [
        4475,
        1551,
        9889,        
  1. Run Benchmarks
cd python
./run-benchmark.sh

Cassandra

This runs on a single-node Cassandra cluster running in Docker alongside the benchmarking containers.

  1. Start the docker containers:
cd docker/cassandra
docker-compose up -d

If everything goes well, you should see an output like this:

 ⠿ Network cassandra_default        Created       0.0s
 ⠿ Container cassandra-cassandra-1  Started       0.6s
 ⠿ Container cassandra-feast-16     Started       1.0s
 ⠿ Container cassandra-feast-1      Started       1.5s
 ⠿ Container cassandra-feast-8      Started       3.0s
 ⠿ Container cassandra-feast-4      Started       2.4s
 ⠿ Container cassandra-feast-2      Started       2.4s
 ⠿ Container cassandra-feast-14     Started       2.2s
 ⠿ Container cassandra-feast-5      Started       1.5s
 ⠿ Container cassandra-feast-3      Started       2.8s
 ⠿ Container cassandra-feast-13     Started       0.8s
 ⠿ Container cassandra-feast-9      Started       1.3s
 ⠿ Container cassandra-feast-11     Started       1.7s
 ⠿ Container cassandra-feast-15     Started       0.9s
 ⠿ Container cassandra-feast-6      Started       2.8s
 ⠿ Container cassandra-feast-12     Started       2.0s
 ⠿ Container cassandra-feast-7      Started       2.5s
 ⠿ Container cassandra-feast-10     Started       1.8s

Wait about 60-90 seconds for Cassandra to fully start. Then you can proceed (if not ready yet, the next command will error and you can retry it a little later).

  1. Create the destination keyspace in Cassandra: check the output of this command to make sure feast_test is now here.
docker exec -it cassandra-cassandra-1 cqlsh -e \
  "CREATE KEYSPACE feast_test WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1}; DESC KEYSPACES;"

  1. From the host machine, provision the feature store:
cd ../../feature_repos/cassandra/

# This is unfortunately necessary because inside docker feature servers resolve
# Cassandra host name as `cassandra`, but since we're running materialization from shell,
# Cassandra is accessible on localhost:
sed -i 's/- cassandra/- localhost/g' feature_store.yaml
feast apply
# Make sure to change this back, since it can mess up with feature servers
# if you run another docker-compose command later:
sed -i 's/- localhost/- cassandra/g' feature_store.yaml
  1. Similarly, materialize from the host machine:
# This is unfortunately necessary because inside docker feature servers resolve
# Cassandra host name as `cassandra`, but since we're running materialization from shell,
# Cassandra is accessible on localhost:
sed -i 's/- cassandra/- localhost/g' feature_store.yaml
feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")
# Make sure to change this back, since it can mess up with feature servers
# if you run another docker-compose command later:
sed -i 's/- localhost/- cassandra/g' feature_store.yaml

3b. A workaround for the Dockerized feast to work

The Docker container have a copy of the registry directory, including data/registry.db. But the image gets done before the apply step above (it is inevitable if we want to create the keyspace and have the Cassandra part of the docker-compose), so the Docker feasts have not the updated registry.db. For the time being, a workaround is as follows:

docker cp data/registry.db cassandra-feast-1:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-2:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-3:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-4:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-5:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-6:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-7:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-8:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-9:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-10:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-11:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-12:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-13:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-14:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-15:/feature_repo/data/registry.db
docker cp data/registry.db cassandra-feast-16:/feature_repo/data/registry.db

cd ../../docker/cassandra/
docker-compose restart
cd ../../feature_repos/cassandra/
  1. Check that feature servers are working & they have materialized data
cd ../../..
parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6

This should return something like this:

+----------+
|   entity |
|----------|
|       94 |
|     1992 |
|     4475 |

Put these numbers into an env variable with:

TEST_ENTITY_IDS=`parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6 | tail -n 3 | sed 's/|//g' | paste -d, -s`
echo $TEST_ENTITY_IDS

(which should output something like 94 , 1992 , 4475 )

Query the feature server with

curl -X POST \
  "http://127.0.0.1:6566/get-online-features" \
  -H "accept: application/json" \
  -d "{
    \"feature_service\": \"feature_service_0\",
    \"entities\": {
      \"entity\": [$TEST_ENTITY_IDS]
    }
  }" | jq

In the output, make sure that "values" field contains none of the null values. It should look something like this:

    {
      "values": [
        4475,
        1551,
        9889,        
  1. Run Benchmarks
cd python
./run-benchmark.sh

Astra DB

Ensure you have an Astra DB instance in the same AWS region as your benchmarking client. To connect to it you need the Client ID and the Client Secret from a database token, as well as the "secure connect bundle" zip-file which should be placed inside the python/feature_repos/astra_db/ directory.

Adjust file feature_store.yaml in that directory to reflect Client ID, Client Secret, database keyspace name, AWS region name and secure-connect-bundle filename.

Note: in order to be able to share the same feature_store.yaml from both the Dockerized feast instances and the one on the host machine, please put the secure connect bundle in the python/feature_repos/astra_db/ directory itself and refer to it as ./secure-connect-DATABASENAME.zip (i.e. with a relative path).

  1. Apply feature definitions to create a Feast repo.
cd feature_repos/astra_db
feast apply
  1. Deploy feature servers using docker-compose
cd ../../docker/astra_db
docker-compose up -d

If everything goes well, you should see an output like this:

 ⠿ Network astra_db_default     Created        0.0s
 ⠿ Container astra_db-feast-1   Started        2.7s
 ⠿ Container astra_db-feast-16  Started        2.8s
 ⠿ Container astra_db-feast-3   Started        2.4s
 ⠿ Container astra_db-feast-5   Started        1.4s
 ⠿ Container astra_db-feast-11  Started        1.8s
 ⠿ Container astra_db-feast-4   Started        1.6s
 ⠿ Container astra_db-feast-2   Started        1.2s
 ⠿ Container astra_db-feast-6   Started        0.8s
 ⠿ Container astra_db-feast-12  Started        2.1s
 ⠿ Container astra_db-feast-7   Started        3.0s
 ⠿ Container astra_db-feast-8   Started        0.8s
 ⠿ Container astra_db-feast-10  Started        2.8s
 ⠿ Container astra_db-feast-14  Started        1.2s
 ⠿ Container astra_db-feast-15  Started        2.9s
 ⠿ Container astra_db-feast-13  Started        1.8s
 ⠿ Container astra_db-feast-9   Started        2.3s
  1. Materialize data to Astra DB
cd ../../feature_repos/astra_db
feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")
  1. Check that feature servers are working & they have materialized data
cd ../../..
parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6

This should return something like this:

+----------+
|   entity |
|----------|
|       94 |
|     1992 |
|     4475 |

Put these numbers into an env variable with:

TEST_ENTITY_IDS=`parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6 | tail -n 3 | sed 's/|//g' | paste -d, -s`
echo $TEST_ENTITY_IDS

(which should output something like 94 , 1992 , 4475 )

Query the feature server with

curl -X POST \
  "http://127.0.0.1:6566/get-online-features" \
  -H "accept: application/json" \
  -d "{
    \"feature_service\": \"feature_service_0\",
    \"entities\": {
      \"entity\": [$TEST_ENTITY_IDS]
    }
  }" | jq

In the output, make sure that "values" field contains none of the null values. It should look something like this:

    {
      "values": [
        4475,
        1551,
        9889,        
  1. Run Benchmarks
cd python
./run-benchmark.sh