Lalonde Pandas API Example
by Adam Kelleher
We’ll run through a quick example using the high-level Python API for the DoSampler. The DoSampler is different from most classic causal effect estimators. Instead of estimating statistics under interventions, it aims to provide the generality of Pearlian causal inference. In that context, the joint distribution of the variables under an intervention is the quantity of interest. It’s hard to represent a joint distribution nonparametrically, so instead we provide a sample from that distribution, which we call a “do” sample.
Here, when you specify an outcome, that is the variable you’re sampling under an intervention. We still have to do the usual process of making sure the quantity (the conditional interventional distribution of the outcome) is identifiable. We leverage the familiar components of the rest of the package to do that “under the hood”. You’ll notice some similarity in the kwargs for the DoSampler.
Getting the Data
First, download the data from the LaLonde example.
[1]:
from rpy2.robjects import r as R
%load_ext rpy2.ipython
%R install.packages("Matching")
%R library(Matching)
%R data(lalonde)
%R -o lalonde
lalonde.to_csv("lalonde.csv",index=False)
R[write to console]: Installing package into ‘/home/amit/R/x86_64-pc-linux-gnu-library/3.2’
(as ‘lib’ is unspecified)
R[write to console]: Error in contrib.url(repos, type) :
trying to use CRAN without setting a mirror
Calls: <Anonymous> ... withVisible -> install.packages -> grep -> contrib.url
R[write to console]: Loading required package: MASS
R[write to console]: ##
## Matching (Version 4.9-5, Build Date: 2019-03-05)
## See http://sekhon.berkeley.edu/matching for additional documentation.
## Please cite software as:
## Jasjeet S. Sekhon. 2011. ``Multivariate and Propensity Score Matching
## Software with Automated Balance Optimization: The Matching package for R.''
## Journal of Statistical Software, 42(7): 1-52.
##
Error in contrib.url(repos, type) :
trying to use CRAN without setting a mirror
Calls: <Anonymous> ... withVisible -> install.packages -> grep -> contrib.url
[2]:
# the data already loaded in the previous cell. we include the import
# here you so you don't have to keep re-downloading it.
import pandas as pd
lalonde=pd.read_csv("lalonde.csv")
The causal
Namespace
We’ve created a “namespace” for pandas.DataFrame
s containing causal inference methods. You can access it here with lalonde.causal
, where lalonde
is our pandas.DataFrame
, and causal
contains all our new methods! These methods are magically loaded into your existing (and future) dataframes when you import dowhy.api
.
[3]:
import dowhy.api
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-3-641fb1855e44> in <module>()
----> 1 import dowhy.api
/mnt/c/Users/amshar/code/dowhy/dowhy/api/__init__.py in <module>()
----> 1 import dowhy.api.causal_data_frame
/mnt/c/Users/amshar/code/dowhy/dowhy/api/causal_data_frame.py in <module>()
5
6
----> 7 @pd.api.extensions.register_dataframe_accessor("causal")
8 class CausalAccessor(object):
9 def __init__(self, pandas_obj):
AttributeError: module 'pandas.api' has no attribute 'extensions'
Now that we have the causal
namespace, lets give it a try!
The do
Operation
The key feature here is the do
method, which produces a new dataframe replacing the treatment variable with values specified, and the outcome with a sample from the interventional distribution of the outcome. If you don’t specify a value for the treatment, it leaves the treatment untouched:
[ ]:
do_df = lalonde.causal.do(x='treat',
outcome='re78',
common_causes=['nodegr', 'black', 'hisp', 'age', 'educ', 'married'],
variable_types={'age': 'c', 'educ':'c', 'black': 'd', 'hisp': 'd',
'married': 'd', 'nodegr': 'd','re78': 'c', 'treat': 'b'})
Notice you get the usual output and prompts about identifiability. This is all dowhy
under the hood!
We now have an interventional sample in do_df
. It looks very similar to the original dataframe. Compare them:
[ ]:
lalonde.head()
[ ]:
do_df.head()
Treatment Effect Estimation
We could get a naive estimate before for a treatment effect by doing
[ ]:
(lalonde[lalonde['treat'] == 1].mean() - lalonde[lalonde['treat'] == 0].mean())['re78']
We can do the same with our new sample from the interventional distribution to get a causal effect estimate
[ ]:
(do_df[do_df['treat'] == 1].mean() - do_df[do_df['treat'] == 0].mean())['re78']
We could get some rough error bars on the outcome using the normal approximation for a 95% confidence interval, like
[ ]:
import numpy as np
1.96*np.sqrt((do_df[do_df['treat'] == 1].var()/len(do_df[do_df['treat'] == 1])) +
(do_df[do_df['treat'] == 0].var()/len(do_df[do_df['treat'] == 0])))['re78']
but note that these DO NOT contain propensity score estimation error. For that, a bootstrapping procedure might be more appropriate.
This is just one statistic we can compute from the interventional distribution of 're78'
. We can get all of the interventional moments as well, including functions of 're78'
. We can leverage the full power of pandas, like
[ ]:
do_df['re78'].describe()
[ ]:
lalonde['re78'].describe()
and even plot aggregations, like
[ ]:
%matplotlib inline
[ ]:
import seaborn as sns
sns.barplot(data=lalonde, x='treat', y='re78')
[ ]:
sns.barplot(data=do_df, x='treat', y='re78')
Specifying Interventions
You can find the distribution of the outcome under an intervention to set the value of the treatment.
[ ]:
do_df = lalonde.causal.do(x={'treat': 1},
outcome='re78',
common_causes=['nodegr', 'black', 'hisp', 'age', 'educ', 'married'],
variable_types={'age': 'c', 'educ':'c', 'black': 'd', 'hisp': 'd',
'married': 'd', 'nodegr': 'd','re78': 'c', 'treat': 'b'})
[ ]:
do_df.head()
This new dataframe gives the distribution of 're78'
when 'treat'
is set to 1
.
For much more detail on how the do
method works, check the docstring:
[ ]:
help(lalonde.causal.do)
[ ]: