Skip to content

muneeb-ds/data_benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Benchmarking for data wrangling frameworks

Currently supported APIs: Pandas, Modin, Polars, DuckDB

Operations Tested:

  • read csv
  • read parquet
  • add column of np arrays
  • get date range
  • filter rows based on random value
  • groupby
  • merge
  • concatenation
  • fill/drop nulls
  • create dataframe
  • save to csv
  • save to parquet
  • convert to pandas/numpy/arrow (duckdb only)

Tracked statistics of an operation:

  • Peak memory utilization
  • Time consumption

Usage

Make sure data being read has atleast 1 string column and 1 float column

Run with Docker (Recommended)

1. docker build -t [image-name/tag] .
2. docker run -v [local/data/storage/dir]:[/container/data/dir] [image-name] \
   --data_path [/container/data/dir/data.csv] \
   --save_dir [/container/data/dir] \
   --iterations [number of times to run benchmarking] \
   --frameworks [supported framework names i.e. pandas modin polars]

if you want to access the output of the tests, make sure that the save_dir is the same as the mounted dir on container

Run on venv/conda

1. pip install -r requirements.txt
2. python benchmark.py --data_path [data/dir/file.csv]

Check here for all arguments

Further Testing

To add new operations for all frameworks:

  1. Go to operations.py
  2. Underneath the PerformanceBenchmark base class, add an abstract method for your operation
  3. Add the same method to all subclasses and their corresponsing functionality underneath
  4. Pass the operation name, method and any args to get_operation_stat method and pass this method underneath the run_operations method in base class

To only add operations for a specific framework:

  1. Define an operation within the frameworks class with the @profile decorator
  2. Underneath the framework's class define a run_operations method
  3. Structure the method to call your operations plus all common operations with (super().run_operations())
  4. Return the total time taken to run operations

To add new frameworks to test:

  1. Go to operations.py
  2. Create a class with the name of your framework followed by Bench for example: class FrameworkBench:
  3. Inherit this class from PerformanceBenchmark
  4. Add all the methods used in run_operations method
  5. Include framework specific functionality for each method
  6. Add framework name in lowercase as one of the choices here

About

Benchmarking for data wrangling frameworks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors