PySpark Version to Implement PSM & DML

Open source python packages such as EconML and CausalML are excellent and comprehensive for people to fulfill causal inference. And the combination for PSM and DML often performed better results. For industry and large dataset, spark is a common engine for engineering. So this project is a simple revision to implement psm and dml by PySpark.

Example

Propensity Score Matching

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .getOrCreate()
    
def add_id(df):
    schema = df.schema.add(StructField("id", LongType()))
    rdd = df.rdd.zipWithIndex()
    
    def flat(l):
        for k in l:
            if not isinstance(k, (list, tuple)):
                yield k
            else:
                yield from flat(k)

    rdd = rdd.map(lambda x: list(flat(x)))
    df_with_id = spark.createDataFrame(rdd, schema).fillna(0)
    
    return df_with_id

from psm import PSM

treatment_group_ = spark.read.parquet('path to treatment file')
control_group_ = spark.read.parquet('path to control file')
all_group = treatment_group_.union(control_group_)
psm_df = add_id(all_group)

psm = PSM(spark, psm_df)

treatment_group, control_group = psm.fit(T='treatment column name')

Standard Mean Difference (SMD) Table

smd_table = psm.get_smd_table()
smd_table

Density Plot of Propensity Score

psm.get_propensity_plot(False)
psm.get_propensity_plot(True)

Double Machine Learning

from dml import LinearDML

df = spark.read.parquet('path to data')
df = add_id(df)

est = LinearDML(spark, df, model_y='rf', model_t='rf', discrete_treatment=True, cv=2)
est.fit(Y='outcome column name', T='treatment column name')

Average Treatment Effect

est.get_ate(decimals=4)

Individual Treatment Effect

est.get_ite(decimals=4)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
dml.py		dml.py
psm.py		psm.py
sparkxgb.zip		sparkxgb.zip
utilities.py		utilities.py
xgboost4j-0.90.jar		xgboost4j-0.90.jar
xgboost4j-spark-0.90.jar		xgboost4j-spark-0.90.jar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PySpark Version to Implement PSM & DML

Example

Propensity Score Matching

Standard Mean Difference (SMD) Table

Density Plot of Propensity Score

Double Machine Learning

Average Treatment Effect

Individual Treatment Effect

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PySpark Version to Implement PSM & DML

Example

Propensity Score Matching

Standard Mean Difference (SMD) Table

Density Plot of Propensity Score

Double Machine Learning

Average Treatment Effect

Individual Treatment Effect

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages