Demo for the DoWhy causal API

We show a simple example of adding a causal extension to any dataframe.

[1]:

import dowhy.datasets
import dowhy.api

import numpy as np
import pandas as pd

from statsmodels.api import OLS

[2]:

data = dowhy.datasets.linear_dataset(beta=5,
        num_common_causes=1,
        num_instruments = 0,
        num_samples=1000,
        treatment_is_binary=True)
df = data['df']
df['y'] = df['y'] + np.random.normal(size=len(df)) # Adding noise to data. Without noise, the variance in Y|X, Z is zero, and mcmc fails.
#data['dot_graph'] = 'digraph { v ->y;X0-> v;X0-> y;}'

treatment= data["treatment_name"][0]
outcome = data["outcome_name"][0]
common_cause = data["common_causes_names"][0]
df

[2]:

	W0	v0	y
0	0.495676	False	0.114497
1	0.271500	True	6.111279
2	0.405087	True	4.467770
3	1.437909	True	10.856139
4	0.853315	False	5.036457
...	...	...	...
995	0.683802	False	0.712246
996	-0.649195	False	-3.760712
997	1.223760	False	4.286952
998	2.577242	True	11.527125
999	0.762801	True	6.146325

1000 rows × 3 columns

[3]:

# data['df'] is just a regular pandas.DataFrame
df.causal.do(x=treatment,
                     variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
                     outcome=outcome,
                     common_causes=[common_cause],
                     proceed_when_unidentifiable=True).groupby(treatment).mean().plot(y=outcome, kind='bar')

[3]:

<AxesSubplot: xlabel='v0'>

../_images/example_notebooks_dowhy_causal_api_3_1.png

[4]:

df.causal.do(x={treatment: 1},
              variable_types={treatment:'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              method='weighting',
              common_causes=[common_cause],
              proceed_when_unidentifiable=True).groupby(treatment).mean().plot(y=outcome, kind='bar')

[4]:

<AxesSubplot: xlabel='v0'>

../_images/example_notebooks_dowhy_causal_api_4_1.png

[5]:

cdf_1 = df.causal.do(x={treatment: 1},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              dot_graph=data['dot_graph'],
              proceed_when_unidentifiable=True)

cdf_0 = df.causal.do(x={treatment: 0},
              variable_types={treatment: 'b', outcome: 'c', common_cause: 'c'},
              outcome=outcome,
              dot_graph=data['dot_graph'],
              proceed_when_unidentifiable=True)

[6]:

cdf_0

[6]:

	W0	v0	y	propensity_score	weight
0	0.962677	False	2.731163	0.488649	2.046457
1	0.963060	False	1.285152	0.488636	2.046512
2	1.333450	False	2.819011	0.476049	2.100623
3	1.113292	False	4.793211	0.483528	2.068131
4	-0.076380	False	1.492980	0.523985	1.908450
...	...	...	...	...	...
995	0.495474	False	1.407595	0.504545	1.981984
996	0.093358	False	2.586551	0.518220	1.929683
997	-0.649195	False	-3.760712	0.543386	1.840313
998	-0.313608	False	-1.064077	0.532032	1.879586
999	-1.104398	False	-3.100139	0.558712	1.789830

1000 rows × 5 columns

[7]:

cdf_1

[7]:

	W0	v0	y	propensity_score	weight
0	1.800934	True	9.604396	0.539792	1.852566
1	0.908676	True	7.848732	0.509514	1.962655
2	1.486907	True	8.214336	0.529158	1.889796
3	2.860181	True	11.082506	0.575341	1.738098
4	0.605184	True	8.075637	0.499188	2.003253
...	...	...	...	...	...
995	0.762801	True	6.146325	0.504551	1.981959
996	-1.252360	True	2.651436	0.436328	2.291852
997	-0.702970	True	1.747041	0.454799	2.198774
998	1.964816	True	8.846990	0.545328	1.833759
999	1.906254	True	6.299404	0.543351	1.840431

1000 rows × 5 columns

Comparing the estimate to Linear Regression

First, estimating the effect using the causal data frame, and the 95% confidence interval.

[8]:

(cdf_1['y'] - cdf_0['y']).mean()

[8]:

$\displaystyle 4.70464995857471$

[9]:

1.96*(cdf_1['y'] - cdf_0['y']).std() / np.sqrt(len(df))

[9]:

$\displaystyle 0.22224585666512$

Comparing to the estimate from OLS.

[10]:

model = OLS(np.asarray(df[outcome]), np.asarray(df[[common_cause, treatment]], dtype=np.float64))
result = model.fit()
result.summary()

[10]:

OLS Regression Results
Dep. Variable:	y	R-squared (uncentered):	0.968
Model:	OLS	Adj. R-squared (uncentered):	0.968
Method:	Least Squares	F-statistic:	1.518e+04
Date:	Mon, 27 Mar 2023	Prob (F-statistic):	0.00
Time:	12:34:22	Log-Likelihood:	-1439.3
No. Observations:	1000	AIC:	2883.
Df Residuals:	998	BIC:	2892.
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
x1	2.3676	0.029	82.056	0.000	2.311	2.424
x2	4.9885	0.052	96.714	0.000	4.887	5.090

Omnibus:	2.708	Durbin-Watson:	1.914
Prob(Omnibus):	0.258	Jarque-Bera (JB):	2.655
Skew:	-0.079	Prob(JB):	0.265
Kurtosis:	3.197	Cond. No.	2.21

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.