Basic Example for generating samples from a GCM
A graphical causal model (GCM) describes the data generation process of the modeled variables. Therefore, after we fit a GCM, we can also generate completely new samples from it and, thus, can see it as data generator for synthetic data based on the underlying models. Generating new samples can generally be done by sorting the nodes in topological order, randomly sample from root-nodes and then propagate the data through the graph by evaluating the downstream causal mechanisms with randomly
sampled noise. The dowhy.gcm
package provides a simple helper function that does this automatically and, by this, offers a simple API to draw samples from a GCM.
Lets take a look at the following example:
[1]:
import numpy as np, pandas as pd
X = np.random.normal(loc=0, scale=1, size=1000)
Y = 2 * X + np.random.normal(loc=0, scale=1, size=1000)
Z = 3 * Y + np.random.normal(loc=0, scale=1, size=1000)
data = pd.DataFrame(data=dict(X=X, Y=Y, Z=Z))
data.head()
[1]:
X | Y | Z | |
---|---|---|---|
0 | 1.151194 | 3.228380 | 9.037347 |
1 | -1.225548 | -0.782950 | -2.265345 |
2 | 0.053397 | 0.745569 | 2.205155 |
3 | -0.670017 | -0.243348 | 0.001651 |
4 | 1.200729 | 2.767712 | 9.732209 |
Similar as in the introduction, we generate data for the simple linear DAG X→Y→Z. Lets define the GCM and fit it to the data:
[2]:
import networkx as nx
import dowhy.gcm as gcm
causal_model = gcm.StructuralCausalModel(nx.DiGraph([('X', 'Y'), ('Y', 'Z')]))
gcm.auto.assign_causal_mechanisms(causal_model, data) # Automatically assigns additive noise models to non-root nodes
gcm.fit(causal_model, data)
Fitting causal mechanism of node Z: 100%|██████████| 3/3 [00:00<00:00, 375.32it/s]
We now learned the generative models of the variables, based on the defined causal graph and the additive noise model assumption. To generate new samples from this model, we can now simply call:
[3]:
generated_data = gcm.draw_samples(causal_model, num_samples=1000)
generated_data.head()
[3]:
X | Y | Z | |
---|---|---|---|
0 | 0.251671 | 0.479329 | -0.372255 |
1 | 0.668491 | 1.293569 | 4.319180 |
2 | 1.559009 | 3.997305 | 14.070869 |
3 | 0.178214 | -0.034033 | 1.258398 |
4 | 1.266308 | 3.959506 | 10.893031 |
If our modeling assumptions are correct, the generated data should now resemble the observed data distribution, i.e. the generated samples correspond to the joint distribution we defined for our example data at the beginning. A quick way to make sure of this is to estimate the KL-divergence between observed and generated distribution:
[4]:
gcm.divergence.auto_estimate_kl_divergence(data.to_numpy(), generated_data.to_numpy())
[4]:
Here, we expect the divergence to be (very) small.
Note: We cannot validate the correctness of a causal graph this way, since any graph from a Markov equivalence class would be sufficient to generate data that is consistent with the observed one, but only one particular graph would generate the correct interventional and counterfactual distributions. This is, seeing the example above, X→Y→Z and X←Y←Z would generate the same observational distribution (since they encode the same conditionals), but only X→Y→Z would generate the correct interventional distribution (e.g. when intervening on Y).