Currently supported APIs: Pandas, Modin, Polars, DuckDB
- read csv
- read parquet
- add column of np arrays
- get date range
- filter rows based on random value
- groupby
- merge
- concatenation
- fill/drop nulls
- create dataframe
- save to csv
- save to parquet
- convert to pandas/numpy/arrow (duckdb only)
- Peak memory utilization
- Time consumption
Make sure data being read has atleast 1 string column and 1 float column
1. docker build -t [image-name/tag] .
2. docker run -v [local/data/storage/dir]:[/container/data/dir] [image-name] \
--data_path [/container/data/dir/data.csv] \
--save_dir [/container/data/dir] \
--iterations [number of times to run benchmarking] \
--frameworks [supported framework names i.e. pandas modin polars]
if you want to access the output of the tests, make sure that the save_dir is the same as the mounted dir on container
1. pip install -r requirements.txt
2. python benchmark.py --data_path [data/dir/file.csv]
Check here for all arguments
- Go to operations.py
- Underneath the PerformanceBenchmark base class, add an abstract method for your operation
- Add the same method to all subclasses and their corresponsing functionality underneath
- Pass the operation name, method and any args to get_operation_stat method and pass this method underneath the run_operations method in base class
- Define an operation within the frameworks class with the @profile decorator
- Underneath the framework's class define a run_operations method
- Structure the method to call your operations plus all common operations with (super().run_operations())
- Return the total time taken to run operations
- Go to operations.py
- Create a class with the name of your framework followed by Bench for example:
class FrameworkBench: - Inherit this class from PerformanceBenchmark
- Add all the methods used in run_operations method
- Include framework specific functionality for each method
- Add framework name in lowercase as one of the choices here