Forked from datatable
This is a Python package for manipulating 2-dimensional tabular data structures (aka data frames). It is close in spirit to [pandas][] or [SFrame][]; however we put specific emphasis on speed and big data support. As the name suggests, the package is closely related to R's [data.table][] and attempts to mimic its core algorithms and API.
Requirements: Python 3.6+ (64 bit) and pip 20.3+.
datatable started in 2017 as a toolkit for performing big data (up to 100GB)
operations on a single-node machine, at the maximum speed possible. Such
requirements are dictated by modern machine-learning applications, which need
to process large volumes of data and generate many features in order to
achieve the best model accuracy. The first user of datatable was
[Driverless.ai][].
The set of features that we want to implement with datatable is at least
the following:
-
Column-oriented data storage.
-
Native-C implementation for all datatypes, including strings. Packages such as pandas and numpy already do that for numeric columns, but not for strings.
-
Support for date-time and categorical types. Object type is also supported, but promotion into object discouraged.
-
All types should support null values, with as little overhead as possible.
-
Data should be stored on disk in the same format as in memory. This will allow us to memory-map data on disk and work on out-of-memory datasets transparently.
-
Work with memory-mapped datasets to avoid loading into memory more data than necessary for each particular operation.
-
Fast data reading from CSV and other formats.
-
Multi-threaded data processing: time-consuming operations should attempt to utilize all cores for maximum efficiency.
-
Efficient algorithms for sorting/grouping/joining.
-
Expressive query syntax (similar to [data.table][]).
-
Minimal amount of data copying, copy-on-write semantics for shared data.
-
Use "rowindex" views in filtering/sorting/grouping/joining operators to avoid unnecessary data copying.
-
Interoperability with pandas / numpy / pyarrow / pure python: the users should have the ability to convert to another data-processing framework with ease.