Estimating Confidence Intervals

Our trained generative causal models and specific sample set might not accurately represent the ground truth and there are typically numerical approximations in our algorithms. Confidence intervals help us quantify the uncertainty of our results introduced by these inaccuracies and approximations.

We therefore recommend to use confidence intervals as a default for all causal queries. This avoids a false sense of confidence in a specific value that might be due to a skewed data set or models that are not fitted optimally.

To compute confidence intervals, we use a method called bootstrapping and it’s implemented in the function dowhy.gcm.confidence_intervals().

How to use it

Let’s say we have a function that computes Pi using the Monte Carlo method with 1000 trials:

>>> import numpy as np
>>>
>>> def compute_pi_monte_carlo():
>>>     trials = 1000
>>>     return 4*(np.random.default_rng().uniform(-1, 1, (trials,))**2+
>>>               np.random.default_rng().uniform(-1, 1, (trials,))**2 <= 1).sum() / trials

Executing this function will give us a result:

>>> compute_pi_monte_carlo()
3.188

This looks like a reasonable number, but we can’t tell how reliable that result is, given that it was computed using a stochastic method. Now, let’s use dowhy.gcm.confidence_intervals() to compute the confidence interval:

>>> from dowhy import gcm
>>>
>>> median, intervals = gcm.confidence_intervals(compute_pi_monte_carlo,
>>>                                              num_bootstrap_resamples=1000)
>>> median, intervals
(array([3.136]), array([[3.0518, 3.224 ]]))

The first item in the result is the median of all results calculated. The second is the interval in which we are 95% confident that computed values will fall into. By increasing the number of trials we can improve the reliability of compute_pi_monte_carlo and hence, narrow the confidence interval. E.g. with trials = 10_000 we get an interval of [3.1132 , 3.16602], with trials = 100_000 we get [3.13256 , 3.150016].

Computing confidence intervals for causal queries

With this understanding, we can now apply confidence_intervals to one of our causal queries, e.g. distribution_change. Let’s generate some data first:

>>> import pandas as pd
>>> from scipy.stats import halfnorm
>>>
>>> X = halfnorm.rvs(size=1000, loc=0.5, scale=0.2)
>>> Y = halfnorm.rvs(size=1000, loc=1.0, scale=0.2)
>>> Z = np.maximum(X, Y) + np.random.normal(loc=0, scale=1, size=1000)
>>> data_old = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z))
>>>
>>> X = halfnorm.rvs(size=1000, loc=0.5, scale=0.2)
>>> Y = halfnorm.rvs(size=1000, loc=1.0, scale=0.2)
>>> Z = X + Y + np.random.normal(loc=0, scale=1, size=1000)
>>> data_new = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z))

Here, the causal mechanism for Z changes between the old and new data. As always, next, we’ll model cause-effect relationships as an SCM and assign causal mechanisms:

>>> import networkx as nx
>>> causal_model = gcm.StructuralCausalModel(nx.DiGraph([('X', 'Z'), ('Y', 'Z')]))  # X -> Z <- Y
>>> gcm.auto.assign_causal_mechanisms(causal_model, data_old)

And now, instead of calling distribution_change directly we call it using confidence_intervals. However, since confidence_intervals takes a function that takes no parameters, we must first define a function without parameters:

>>> def f():
>>>     return gcm.distribution_change(causal_model, data_old, data_new, 'Z')

And now:

>>> gcm.confidence_intervals(f)
({'X': 0.0005804787010800414, 'Y': 0.0030143299612704804, 'Z': 0.1724206687905231},
{'X': array([-0.00427554,  0.00891714]),
'Y': array([-0.00605089,  0.01782684]),
'Z': array([0.15441771, 0.19062329])})

The result again contains a first item, which is the median values of the contribution scores. The second item contains the intervals of the contribution scores for each variable.

To avoid defining a new function, we can streamline that call by using a lambda:

>>> gcm.confidence_intervals(lambda: gcm.distribution_change(causal_model,
>>>                                                          data_old, data_new,
>>>                                                          target_node='Z'))

Conveniently bootstrapping graph training on random subsets of training data

Many of the causal queries in the GCM package require a trained causal graph as a first argument. To compute confidence intervals for these methods, we need to explicitly re-train our causal graph multiple times with different random subsets of data and also run our causal query with each newly trained graph. To do this conveniently, the GCM package provides a function fit_and_compute. Assuming that we have data and a causal graph:

>>> Z = np.random.normal(loc=0, scale=1, size=1000)
>>> X = 2*Z + np.random.normal(loc=0, scale=1, size=1000)
>>> Y = 3*X + 4*Z + np.random.normal(loc=0, scale=1, size=1000)
>>> data = pd.DataFrame(dict(X=X, Y=Y, Z=Z))
>>>
>>> causal_model = gcm.StructuralCausalModel(nx.DiGraph([('Z', 'Y'), ('Z', 'X'), ('X', 'Y')]))
>>> gcm.auto.assign_causal_mechanisms(causal_model, data_old)

we can now use fit_and_compute as follows:

>>> strength_median, strength_intervals = gcm.confidence_intervals(
>>>     gcm.fit_and_compute(gcm.arrow_strength,
>>>                         causal_model,
>>>                         bootstrap_training_data=data,
>>>                         target_node='Y'))
>>> strength_median, strength_intervals
({('X', 'Y'): 45.90886398636573, ('Z', 'Y'): 15.47129383737619},
{('X', 'Y'): array([42.88319632, 50.43890079]), ('Z', 'Y'): array([13.44202416, 17.74266107])})

Runtime cost versus confidence

In certain scenarios it is prohibitely expensive to re-train causal graphs multiple times. E.g. when using AutoGluon as prediction models for the additive noise model, a single fit execution can quickly go into hours. Bootstrapping with 20 resamples will then quickly go into days depending on how much we can parallelize.

For that reason, sometimes the tradeoff is to only bootstrap on the causal query, not on the training. To make this analogous to the approach we used above, there is the function bootstrap_sampling(). This function assumes that the causal graph is already trained. Then it can be used as follows:

>>> gcm.fit(causal_model, data)
>>>
>>> strength_median, strength_intervals = gcm.confidence_intervals(
>>>     gcm.bootstrap_sampling(gcm.arrow_strength,
>>>                            causal_model,
>>>                            target_node='Y'))
>>>
>>> strength_median, strength_intervals
({('X', 'Y'): 46.07299374572871, ('Z', 'Y'): 15.358850280972195},
{('X', 'Y'): array([44.95914495, 47.63918151]), ('Z', 'Y'): array([15.04323069, 15.72570547])})

dowhy.gcm.bootstrap_sampling() accomplishes the same thing as the lambda we’ve seen earlier. However, using bootstrap_sampling is more expressive. We therefore recommend its use over a lambda for all causal queries that take a causal graph as the first argument.

Understanding the need for confidence intervals

As explained in the beginnging, the models and specific sample set might not accurately represent the ground truth and there are typically numerical approximations in our algorithms. The error comes from three sources:

Variance of the “optimal” parameters for a model, i.e. running fit twice with the same/slightly different data can yield two different models (becomes even worse when there is a stochastic element in the fit process of the prediction models as well). For instance, training an AutoGluon model twice on even exactly the same data would return two different models with slightly different performance.
Approximations in our algorithms. For instance, when estimating distribution change with 6+ upstream nodes, we will only approximate the Shapley values (the approximation has a stochastic factor). Therefore, running distribution change twice with exactly the same input and generated samples will return two different results.
Even if we do not approximate the Shapley values, the algorithms are typically based on some specific samples from the generative models, i.e. variance can also come from the variance between two sets of drawn samples. Therefore, even if the algorithm itself is deterministic (e.g. evaluate all possible subsets for the Shapley values), we would use randomly generated samples which yields different results when calling an algorithm twice.