dowhy.gcm.independence_test package
Submodules
dowhy.gcm.independence_test.kernel module
Functions in this module should be considered experimental, meaning there might be breaking API changes in the future.
- dowhy.gcm.independence_test.kernel.approx_kernel_based(X: ~numpy.ndarray, Y: ~numpy.ndarray, Z: ~typing.Optional[~numpy.ndarray] = None, num_random_features_X: int = 50, num_random_features_Y: int = 50, num_random_features_Z: int = 50, num_permutations: int = 100, approx_kernel: ~typing.Callable[[~numpy.ndarray], ~numpy.ndarray] = <function approximate_rbf_kernel_features>, scale_data: bool = False, use_bootstrap: bool = True, bootstrap_num_runs: int = 10, bootstrap_num_samples: int = 1000, bootstrap_n_jobs: ~typing.Optional[int] = None, p_value_adjust_func: ~typing.Callable[[~typing.Union[~numpy.ndarray, ~typing.List[float]]], float] = <function quantile_based_fwer>) float [source]
Implementation of the Randomized Conditional Independence Test. The independence test estimates a p-value for the null hypothesis that X and Y are independent (given Z). Depending whether Z is given, a conditional or pairwise independence test is performed.
If Z is given: Using RCIT as conditional independence test. If Z is not given: Using RIT as pairwise independence test.
Note: - The data can be multivariate, i.e. the given input matrices can have multiple columns. - Categorical data need to be represented as strings. - It is possible to apply a different kernel to each column in the matrices. For instance, a RBF kernel for the
first dimension in X and a delta kernel for the second.
- Based on the work:
Strobl, Eric V., Kun Zhang, and Shyam Visweswaran. Approximate kernel-based conditional independence tests for fast non-parametric causal discovery. Journal of Causal Inference 7.1 (2019).
- Parameters
X – Data matrix for observations from X.
Y – Data matrix for observations from Y.
Z – Optional data matrix for observations from Z. This is the conditional variable.
num_random_features_X – Number of features sampled from the approximated kernel map for X.
num_random_features_Y – Number of features sampled from the approximated kernel map for Y.
num_random_features_Z – Number of features sampled from the approximated kernel map for Z.
num_permutations – Number of permutations for estimating the test test statistic.
approx_kernel – The approximated kernel map. The expected input is a n x d numpy array and the output is expected to be a n x k numpy array with k << d. By default, the Nystroem method with a RBF kernel is used.
scale_data – If set to True, the data will be standardized. If set to False, the data is taken as it is. Standardizing the data helps in identifying weak dependencies. If one is only interested in stronger ones, consider setting this to False.
use_bootstrap – If True, the independence tests are performed on multiple subsets of the data and the final p-value is constructed based on the provided p_value_adjust_func function.
bootstrap_num_runs – Number of bootstrap runs (only relevant if use_bootstrap is True).
bootstrap_num_samples – Maximum number of used samples per bootstrap run.
bootstrap_n_jobs – Number of parallel jobs for the boostrap runs.
p_value_adjust_func – A callable that expects a numpy array of multiple p-values and returns one p-value. This is typically used a family wise error rate control method.
- Returns
The p-value for the null hypothesis that X and Y are independent (given Z).
- dowhy.gcm.independence_test.kernel.kernel_based(X: ~numpy.ndarray, Y: ~numpy.ndarray, Z: ~typing.Optional[~numpy.ndarray] = None, kernel: ~typing.Callable[[~numpy.ndarray], ~numpy.ndarray] = <function apply_rbf_kernel_with_adaptive_precision>, scale_data: bool = True, use_bootstrap: bool = True, bootstrap_num_runs: int = 20, bootstrap_num_samples_per_run: int = 2000, bootstrap_n_jobs: ~typing.Optional[int] = None, p_value_adjust_func: ~typing.Callable[[~typing.Union[~numpy.ndarray, ~typing.List[float]]], float] = <function quantile_based_fwer>) float [source]
Prepares the data and uses kernel (conditional) independence test. The independence test estimates a p-value for the null hypothesis that X and Y are independent (given Z). Depending whether Z is given, a conditional or pairwise independence test is performed.
If Z is given: Using KCI as conditional independence test. If Z is not given: Using HSIC as pairwise independence test.
Note: - The data can be multivariate, i.e. the given input matrices can have multiple columns. - Categorical data need to be represented as strings. - It is possible to apply a different kernel to each column in the matrices. For instance, a RBF kernel for the
first dimension in X and a delta kernel for the second.
Based on the work: - Conditional: K. Zhang, J. Peters, D. Janzing, B. Schölkopf. Kernel-based Conditional Independence Test and Application in Causal Discovery. UAI’11, Pages 804–813, 2011. - Pairwise: A. Gretton, K. Fukumizu, C.-H. Teo, L. Song, B. Schölkopf, A. Smola. A Kernel Statistical Test of Independence. NIPS 21, 2007.
- Parameters
X – Data matrix for observations from X.
Y – Data matrix for observations from Y.
Z – Optional data matrix for observations from Z. This is the conditional variable.
kernel – A kernel for estimating the pairwise similarities between samples. The expected input is a n x d numpy array and the output is expected to be a n x n numpy array. By default, the RBF kernel is used.
scale_data – If set to True, the data will be standardized. If set to False, the data is taken as it is. Standardizing the data helps in identifying weak dependencies. If one is only interested in stronger ones, consider setting this to False.
use_bootstrap – If True, the independence tests are performed on multiple subsets of the data and the final p-value is constructed based on the provided p_value_adjust_func function.
bootstrap_num_runs – Number of bootstrap runs (only relevant if use_bootstrap is True).
bootstrap_num_samples_per_run – Number of samples used in a bootstrap run (only relevant if use_bootstrap is True).
bootstrap_n_jobs – Number of parallel jobs for the boostrap runs.
p_value_adjust_func – A callable that expects a numpy array of multiple p-values and returns one p-value. This is typically used a family wise error rate control method.
- Returns
The p-value for the null hypothesis that X and Y are independent (given Z).
dowhy.gcm.independence_test.kernel_operation module
Functions in this module should be considered experimental, meaning there might be breaking API changes in the future.
- dowhy.gcm.independence_test.kernel_operation.apply_delta_kernel(X: ndarray) ndarray [source]
Applies the delta kernel, i.e. the distance is 1 if two entries are equal and 0 otherwise.
- Parameters
X – Input data.
- Returns
The outcome of the delta-kernel, a binary distance matrix.
- dowhy.gcm.independence_test.kernel_operation.apply_rbf_kernel(X: ndarray, precision: Optional[float] = None) ndarray [source]
Estimates the RBF (Gaussian) kernel for the given input data.
- Parameters
X – Input data.
precision – Specific precision matrix for the RBF kernel. If None is given, this is inferred from the data.
- Returns
The outcome of applying a RBF (Gaussian) kernel on the data.
- dowhy.gcm.independence_test.kernel_operation.apply_rbf_kernel_with_adaptive_precision(X: ndarray) ndarray [source]
Estimates the RBF (Gaussian) kernel for the given input data. Here, each column is scaled by an individual precision parameter which is automatically inferred from the data.
- Parameters
X – Input data.
- Returns
The outcome of applying a RBF (Gaussian) kernel on the data.
- dowhy.gcm.independence_test.kernel_operation.approximate_delta_kernel_features(X: ndarray, num_random_components: int) ndarray [source]
Applies the Nystroem method to create a NxD (D << N) approximated delta kernel map using a subset of the data, where N is the number of samples in X and D the number of components. The delta kernel gives 1 if two entries are equal and 0 otherwise.
- Parameters
X – Input data.
num_random_components – Number of components D for the approximated kernel map.
- Returns
A NxD approximated RBF kernel map, where N is the number of samples in X and D the number of components.
- dowhy.gcm.independence_test.kernel_operation.approximate_rbf_kernel_features(X: ndarray, num_random_components: int, precision: Optional[float] = None) ndarray [source]
Applies the Nystroem method to create a NxD (D << N) approximated RBF kernel map using a subset of the data, where N is the number of samples in X and D the number of components.
- Parameters
X – Input data.
num_random_components – Number of components D for the approximated kernel map.
precision – Specific precision matrix for the RBF kernel. If None is given, this is inferred from the data.
- Returns
A NxD approximated RBF kernel map, where N is the number of samples in X and D the number of components.
dowhy.gcm.independence_test.regression module
Regression based (conditional) independence test. Testing independence via regression, i.e. if a variable has information about another variable, then they are dependent.
- dowhy.gcm.independence_test.regression.regression_based(X: ~numpy.ndarray, Y: ~numpy.ndarray, Z: ~typing.Optional[~numpy.ndarray] = None, num_components_all_inputs: int = 40, num_runs: int = 20, p_value_adjust_func: ~typing.Callable[[~typing.Union[~numpy.ndarray, ~typing.List[float]]], float] = <function quantile_based_fwer>, f_test_samples_ratio: ~typing.Optional[float] = 0.3, max_samples_per_run: int = 10000) float [source]
The main idea is that if X and Y are dependent, then X should help in predicting Y. If there is no dependency, then X should not help. When Z is given, the idea remains the same, but here X and Y are conditionally independent given Z if X does not help in predicting Y when knowing Z. This is, X has not additional information about Y given Z. In the pairwise case (Z is not given), the performances (in terms of squared error) between predicting Y based on X and predicting Y by returning its mean (the best estimator without any inputs) are compared. Note that categorical inputs are transformed via the sklearn one-hot-encoder.
Here, we use the
sklearn.kernel_approximation.Nystroem
approach to approximate a kernel map of the inputs that serves as new input features. These new features allow to model complex non-linear relationships. In case of categorical data, we first apply an one-hot-encoding and then map it into the kernel feature space. Afterwards, we use linear regression as a prediction model based on the non-linear input features. The idea is then the same as in Granger causality, where we apply a f-test to see if the additional input features significantly help in predicting the target or not.Note: As compared to
kernel_based()
, this method is quite fast and provides reasonably well results. However, there are types of dependencies that this test cannot detect. For instance, if X determines the variance of Y, then this cannot be captured. For these more complex dependencies, consider using thekernel_based()
independence test instead.- This test is motivated by Granger causality, the approx_kernel_based test and the following paper:
K Chalupka, P Perona, F. Eberhardt. Fast Conditional Independence Test for Vector Variables with Large Sample Sizes. arXiv:1804.02747, 2018.
- Parameters
X – Input data for X.
Y – Input data for Y.
Z – Input data for Z. The set of variables to (optionally) condition on.
num_components_all_inputs – Number of kernel features when combining X and Z. If Z is not given, it will be replaced with an empty array. If Z is given, half of the number is used to generate features for Z.
num_runs – Number of runs. This equals the number of estimated p-values, which get adjusted by the p_value_adjust_func.
p_value_adjust_func – A callable that expects a numpy array of multiple p-values and returns one p-value. This is typically used a family wise error rate control method.
f_test_samples_ratio – Ratio for splitting the data into test and training data sets for calculating the f-statistic. A ratio of 0.3 means that 30% of the samples are used for the f-test (test samples) and 70% are used for training the prediction model (training samples). If set to None, training and test data set are the same, which could help in settings where only a few samples are available.
max_samples_per_run – Maximum number of samples used per run.
- Returns
The p-value for the null hypothesis that X and Y are independent given Z. If Z is not given, then for the hypothesis that X and Y are independent.
Module contents
- dowhy.gcm.independence_test.independence_test(X, Y, conditioned_on=None, method='kernel')[source]
Performs a (conditional) independence test. Three methods for (conditional) independence test are supported at the moment:
kernel: Kernel-based (conditional) independence test.
Zhang, J. Peters, D. Janzing, B. Schölkopf. Kernel-based Conditional Independence Test and Application in Causal Discovery. UAI’11, Pages 804–813, 2011.
Gretton, K. Fukumizu, C.-H. Teo, L. Song, B. Schölkopf, A. Smola. A Kernel Statistical Test of Independence. NIPS 21, 2007.
approx_kernel: Approximate kernel-based (conditional) independence test.
Strobl, K. Zhang, S. Visweswaran. Approximate kernel-based conditional independence tests for fast non-parametric causal discovery. Journal of Causal Inference, 2019.
regression: Regression based (conditional) independence test using a f-test. See
regression_based()
for more details.
- Parameters
X – Observations of X.
Y – Observations of Y.
conditioned_on – Observations of conditioning variable if we want to perform a conditional independence test. By default, independence test is carried out.
method – Method for conditional independence test. The choices are: kernel (default):
kernel_based()
(conditional) independence test. approx_kernel:approx_kernel_based()
(conditional) independence test. regression:regression_based()
(conditional) independence test. For more information about these methods, see above.
- Returns
p-value of the (conditional) independence test. (Conditional) Independence is the null hypothesis.