Demo for the DoWhy causal API

We show a simple example of adding a causal extension to any dataframe.

[1]:
import dowhy.datasets
import dowhy.api

import numpy as np
import pandas as pd

from statsmodels.api import OLS
[2]:
data = dowhy.datasets.linear_dataset(beta=5,
        num_common_causes=1,
        num_instruments = 0,
        num_samples=1000,
        treatment_is_binary=True)
df = data['df']
df['y'] = df['y'] + np.random.normal(size=len(df)) # Adding noise to data. Without noise, the variance in Y|X, Z is zero, and mcmc fails.
#data['dot_graph'] = 'digraph { v ->y;X0-> v;X0-> y;}'

treatment= data["treatment_name"][0]
outcome = data["outcome_name"][0]
common_cause = data["common_causes_names"][0]
df
[2]:
W0 v0 y
0 0.883800 True 6.032204
1 2.151652 True 8.274524
2 0.565327 True 5.615925
3 1.943763 True 9.943361
4 1.025108 True 6.526274
... ... ... ...
995 -1.452482 False -1.081065
996 1.532964 False 2.817101
997 -0.350965 True 1.761346
998 -1.642007 False -3.852874
999 0.641072 False 2.620429

1000 rows × 3 columns

[3]:
# data['df'] is just a regular pandas.DataFrame
df.causal.do(x=treatment,
                     variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
                     outcome=outcome,
                     common_causes=[common_cause],
                     proceed_when_unidentifiable=True).groupby(treatment).mean().plot(y=outcome, kind='bar')
[3]:
<AxesSubplot:xlabel='v0'>
../_images/example_notebooks_dowhy_causal_api_3_1.png
[4]:
df.causal.do(x={treatment: 1},
              variable_types={treatment:'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              method='weighting',
              common_causes=[common_cause],
              proceed_when_unidentifiable=True).groupby(treatment).mean().plot(y=outcome, kind='bar')
[4]:
<AxesSubplot:xlabel='v0'>
../_images/example_notebooks_dowhy_causal_api_4_1.png
[5]:
cdf_1 = df.causal.do(x={treatment: 1},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              dot_graph=data['dot_graph'],
              proceed_when_unidentifiable=True)

cdf_0 = df.causal.do(x={treatment: 0},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              dot_graph=data['dot_graph'],
              proceed_when_unidentifiable=True)

[6]:
cdf_0
[6]:
W0 v0 y propensity_score weight
0 1.117996 False 2.497059 0.333225 3.000976
1 0.238462 False 0.509070 0.455258 2.196556
2 0.213945 False 2.279505 0.458815 2.179528
3 0.425245 False -0.297296 0.428336 2.334615
4 2.301323 False 3.431520 0.200139 4.996525
... ... ... ... ... ...
995 0.621286 False 1.458936 0.400531 2.496683
996 1.642750 False 3.409962 0.268860 3.719407
997 0.409933 False -0.635569 0.430529 2.322722
998 -0.063154 False 0.953209 0.499221 2.003123
999 1.532964 False 2.817101 0.281662 3.550353

1000 rows × 5 columns

[7]:
cdf_1
[7]:
W0 v0 y propensity_score weight
0 1.055191 True 5.738184 0.658568 1.518446
1 1.676538 True 7.187013 0.735005 1.360535
2 0.111945 True 4.353155 0.526346 1.899889
3 -0.840229 True 3.503758 0.389082 2.570151
4 0.556640 True 5.125285 0.590361 1.693878
... ... ... ... ... ...
995 0.018338 True 6.145009 0.512687 1.950507
996 0.094625 True 5.329991 0.523821 1.909048
997 0.711712 True 4.394407 0.612092 1.633741
998 1.928415 True 8.157832 0.762678 1.311169
999 -0.280647 True 5.377214 0.469032 2.132052

1000 rows × 5 columns

Comparing the estimate to Linear Regression

First, estimating the effect using the causal data frame, and the 95% confidence interval.

[8]:
(cdf_1['y'] - cdf_0['y']).mean()
[8]:
$\displaystyle 5.092590494260132$
[9]:
1.96*(cdf_1['y'] - cdf_0['y']).std() / np.sqrt(len(df))
[9]:
$\displaystyle 0.17478007931409847$

Comparing to the estimate from OLS.

[10]:
model = OLS(np.asarray(df[outcome]), np.asarray(df[[common_cause, treatment]], dtype=np.float64))
result = model.fit()
result.summary()
[10]:
OLS Regression Results
Dep. Variable: y R-squared (uncentered): 0.970
Model: OLS Adj. R-squared (uncentered): 0.970
Method: Least Squares F-statistic: 1.620e+04
Date: Tue, 02 Mar 2021 Prob (F-statistic): 0.00
Time: 20:29:26 Log-Likelihood: -1414.1
No. Observations: 1000 AIC: 2832.
Df Residuals: 998 BIC: 2842.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
x1 1.7444 0.031 55.843 0.000 1.683 1.806
x2 5.0338 0.052 97.742 0.000 4.933 5.135
Omnibus: 4.501 Durbin-Watson: 1.900
Prob(Omnibus): 0.105 Jarque-Bera (JB): 4.538
Skew: 0.147 Prob(JB): 0.103
Kurtosis: 2.851 Cond. No. 2.50


Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.