Benchmarking for data wrangling frameworks

Currently supported APIs: Pandas, Modin, Polars, DuckDB

Operations Tested:

read csv
read parquet
add column of np arrays
get date range
filter rows based on random value
groupby
merge
concatenation
fill/drop nulls
create dataframe
save to csv
save to parquet
convert to pandas/numpy/arrow (duckdb only)

Tracked statistics of an operation:

Peak memory utilization
Time consumption

Usage

Make sure data being read has atleast 1 string column and 1 float column

Run with Docker (Recommended)

1. docker build -t [image-name/tag] .
2. docker run -v [local/data/storage/dir]:[/container/data/dir] [image-name] \
   --data_path [/container/data/dir/data.csv] \
   --save_dir [/container/data/dir] \
   --iterations [number of times to run benchmarking] \
   --frameworks [supported framework names i.e. pandas modin polars]

if you want to access the output of the tests, make sure that the save_dir is the same as the mounted dir on container

Run on venv/conda

1. pip install -r requirements.txt
2. python benchmark.py --data_path [data/dir/file.csv]

Check here for all arguments

Further Testing

To add new operations for all frameworks:

Go to operations.py
Underneath the PerformanceBenchmark base class, add an abstract method for your operation
Add the same method to all subclasses and their corresponsing functionality underneath
Pass the operation name, method and any args to get_operation_stat method and pass this method underneath the run_operations method in base class

To only add operations for a specific framework:

Define an operation within the frameworks class with the @profile decorator
Underneath the framework's class define a run_operations method
Structure the method to call your operations plus all common operations with (super().run_operations())
Return the total time taken to run operations

To add new frameworks to test:

Go to operations.py
Create a class with the name of your framework followed by Bench for example: class FrameworkBench:
Inherit this class from PerformanceBenchmark
Add all the methods used in run_operations method
Include framework specific functionality for each method
Add framework name in lowercase as one of the choices here

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
Dockerfile		Dockerfile
README.md		README.md
benchmark.py		benchmark.py
operations.py		operations.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking for data wrangling frameworks

Operations Tested:

Tracked statistics of an operation:

Usage

Run with Docker (Recommended)

Run on venv/conda

Further Testing

To add new operations for all frameworks:

To only add operations for a specific framework:

To add new frameworks to test:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Benchmarking for data wrangling frameworks

Operations Tested:

Tracked statistics of an operation:

Usage

Run with Docker (Recommended)

Run on venv/conda

Further Testing

To add new operations for all frameworks:

To only add operations for a specific framework:

To add new frameworks to test:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages