Non-linear Causal Inference Using Gaussianity Measures

Hernández-Lobato, Daniel; Morales-Mombiela, Pablo; Lopez-Paz, David; Suárez, Alberto

doi:10.1007/978-3-030-21810-2_8

Daniel Hernández-Lobato⁷,
Pablo Morales-Mombiela⁸,
David Lopez-Paz⁹ &
…
Alberto Suárez⁷

Part of the book series: The Springer Series on Challenges in Machine Learning ((SSCML))

909 Accesses
1 Citations

Abstract

We provide theoretical and empirical evidence for a type of asymmetry between causes and effects that is present when these are related via linear models contaminated with additive non-Gaussian noise. Assuming that the causes and the effects have the same distribution, we show that the distribution of the residuals of a linear fit in the anti-causal direction is closer to a Gaussian than the distribution of the residuals in the causal direction. This Gaussianization effect is characterized by reduction of the magnitude of the high-order cumulants and by an increment of the differential entropy of the residuals. The problem of non-linear causal inference is addressed by performing an embedding in an expanded feature space, in which the relation between causes and effects can be assumed to be linear. The effectiveness of a method to discriminate between causes and effects based on this type of asymmetry is illustrated in a variety of experiments using different measures of Gaussianity. The proposed method is shown to be competitive with state-of-the-art techniques for causal inference.

The vast majority of the work was done while being at the Max Planck Institute for Intelligent Systems and at Cambridge University

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Hardcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We assume such matrix exists and that it is positive definite.
2.
See https://www.codalab.org/competitions/1381 for more information.
3.
See http://webdav.tuebingen.mpg.de/cause-effect/ for more details.

References

J. Beirlant, E. J. Dudewicz, L. Györfi, and E. C. Van Der Meulen. Nonparametric entropy estimation: An overview. International Journal of Mathematical and Statistical Sciences, 6 (1): 17–39, 1997.
MathSciNet MATH Google Scholar
Z. Chen, K. Zhang, and L. Chan. Nonlinear causal discovery for high dimensional data: A kernelized trace method. In IEEE 13th International Conference on Data Mining, pages 1003–1008, 2013.
Google Scholar
Z. Chen, K. Zhang, L. Chan, and B. Schölkopf. Causal discovery via reproducing kernel Hilbert space embeddings. Neural Computation, 26 (7): 1484–1517, 2014.
Article MathSciNet Google Scholar
E. A. Cornish and R. A. Fisher. Moments and cumulants in the specification of distributions. Revue de l’Institut International de Statistique / Review of the International Statistical Institute, 5 (4): 307–320, 1938.
Article Google Scholar
H. Cramér. Mathematical methods of statistics. PhD thesis, 1946.
Google Scholar
D. Entner and P. O. Hoyer. Estimating a causal order among groups of variables in linear models. In Internatinal Conference on Artificial Neural Networks, pages 84–91. 2012.
Chapter Google Scholar
A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, and A. J. Smola. A kernel statistical test of independence. In Advances in Neural Information Processing Systems 20, pages 585–592. 2008.
Google Scholar
A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13: 723–773, 2012.
MathSciNet MATH Google Scholar
J. M. Hernández-Lobato, P. Morales-Mombiela, and A. Suárez. Gaussianity measures for detecting the direction of causal time series. In International Joint Conference on Artificial Intelligence, pages 1318–1323, 2011.
Google Scholar
P. O. Hoyer, D. Janzing, J. M. Mooij, J. Peters, and B. Schölkopf. Nonlinear causal discovery with additive noise models. In Advances in Neural Information Processing Systems 21, pages 689–696, 2009.
Google Scholar
A. Hyvärinen. New approximations of differential entropy for independent component analysis and projection pursuit. In Advances in Neural Information Processing Systems 10, pages 273–279. MIT Press, 1998.
Google Scholar
A. Hyvärinen and S. M. Smith. Pairwise likelihood ratios for estimation of non-Gaussian structural equation models. Journal of Machine Learning Research, 14 (1): 111–152, 2013.
MathSciNet MATH Google Scholar
A. Hyvärinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley & Sons, 2004.
Google Scholar
D. Janzing, P. O. Hoyer, and B. Schölkopf. Telling cause from effect based on high-dimensional observations. In International Conference on Machine Learning, pages 479–486, 2010.
Google Scholar
D. Janzing, J. M. Mooij, K. Zhang, J. Lemeire, J. Zscheischler, P. Daniušis, B. Steudel, and B. Schölkopf. Information-geometric approach to inferring causal directions. Artificial Intelligence, 182–183: 1–31, 2012.
Article MathSciNet Google Scholar
Y. Kawahara, K. Bollen, S. Shimizu, and T. Washio. GroupLiNGAM: Linear non-Gaussian acyclic models for sets of variables. 2012. arXiv:1006.5041.
Google Scholar
S. Kpotufe, E. Sgouritsa, D. Janzing, and B. Schölkopf. Consistency of causal inference under the additive noise model. In International Conference on Machine Learning, pages 478–486, 2014.
Google Scholar
A. J. Laub. Matrix Analysis For Scientists And Engineers. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2004. ISBN 0898715768.
Google Scholar
J. T. Marcinkiewicz. Sur une propriété de la loi de gauss. Mathematische Zeitschrift, 44: 612–618, 1938.
Article MathSciNet Google Scholar
P. McCullagh. Tensor methods in statistics. Chapman and Hall, 1987.
MATH Google Scholar
J. M. Mooij, O. Stegle, D. Janzing, K. Zhang, and B. Schölkopf. Probabilistic latent variable models for distinguishing between cause and effect. In Advances in Neural Information Processing Systems 23, pages 1687–1695. 2010.
Google Scholar
P. Morales-Mombiela, D. Hernández-Lobato, and A. Suárez. Statistical tests for the detection of the arrow of time in vector autoregressive models. In International Joint Conference on Artificial Intelligence, 2013.
Google Scholar
K. Murphy. Machine Learning: a Probabilistic Perspective. The MIT Press, 2012.
MATH Google Scholar
J. K. Patel and C. B. Read. Handbook of the normal distribution, volume 150. CRC Press, 1996.
MATH Google Scholar
J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, NY, USA, 2000. ISBN 0-521-77362-8.
MATH Google Scholar
B. Schölkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2002. ISBN 0262194759.
Google Scholar
B. Schölkopf, A. Smola, and K.-R. Müller. Kernel principal component analysis. In International Conference on Artificial Neural Networks, pages 583–588. Springer, 1997.
Google Scholar
S. Shimizu, P. O. Hoyer, A. Hyvärinen, and A. Kerminen. A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7: 2003–2030, 2006.
MATH Google Scholar
H. Singh, N. Misra, V. Hnizdo, A. Fedorowicz, and E. Demchuk. Nearest neighbor estimates of entropy. American journal of mathematical and management sciences, 23 (3–4): 301–321, 2003.
Article MathSciNet Google Scholar
L. Song, K. Fukumizu, and A. Gretton. Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. Signal Processing Magazine, IEEE, 30: 98–111, 2013.
Article Google Scholar
G. J. Székely and M. L. Rizzo. A new test for multivariate normality. Journal of Multivariate Analysis, 93 (1): 58–80, 2005.
Article MathSciNet Google Scholar
G. J. Székely and M. L. Rizzo. Energy statistics: A class of statistics based on distances. Journal of Statistical Planning and Inference, 143 (8): 1249–1272, 2013.
Article MathSciNet Google Scholar
K. Zhang and A. Hyvärinen. On the identifiability of the post-nonlinear causal model. In International Conference on Uncertainty in Artificial Intelligence, pages 647–655, 2009.
Google Scholar

Download references

Acknowledgements

Daniel Hernández-Lobato and Alberto Suárez gratefully acknowledge the use of the facilities of Centro de Computación Científica (CCC) at Universidad Autónoma de Madrid. These authors also acknowledge financial support from the Spanish Plan Nacional I+D+i, Grants TIN2013-42351-P and TIN2015-70308-REDT, and from Comunidad de Madrid, Grant S2013/ICE-2845 CASI-CAM-CM. David Lopez-Paz acknowledges support from Fundación la Caixa.

Author information

Authors and Affiliations

Universidad Autónoma de Madrid, Madrid, Spain
Daniel Hernández-Lobato & Alberto Suárez
Quantitative Risk Research, Madrid, Spain
Pablo Morales-Mombiela
Facebook AI Research, Paris, France
David Lopez-Paz

Authors

Daniel Hernández-Lobato
View author publications
You can also search for this author in PubMed Google Scholar
Pablo Morales-Mombiela
View author publications
You can also search for this author in PubMed Google Scholar
David Lopez-Paz
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Suárez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel Hernández-Lobato .

Editor information

Editors and Affiliations

Team TAU - CNRS, INRIA, Université Paris Sud, Université Paris Saclay, Orsay France, ChaLearn, Berkeley, CA, USA
Isabelle Guyon
SoFi, San Francisco, CA, USA
Alexander Statnikov
University of Paris-Sud, Paris-Saclay, Paris, France
Berna Bakir Batu

Appendices

Appendix 1

In this appendix we show that if $\mathscr {X}$ and $\mathscr {Y}$ follow the same distribution and they have been centered, then the determinant of the covariance matrix of the random variable corresponding to 𝜖 _i, denoted with Cov(𝜖 _i), coincides with the determinant of the covariance matrix corresponding to the random variable $\tilde {\boldsymbol {\epsilon }}_i$, denoted with $\text{Cov}(\tilde {\boldsymbol {\epsilon }}_i)$.

From the causal model, i.e., y _i = Ax _i + 𝜖 _i, we have that:

$$\displaystyle \begin{aligned} \text{Cov}(\mathscr{Y}) &= \mathbf{A} \text{Cov}(\mathscr{X}) {\mathbf{A}}^{\text{T}} + \text{Cov}(\boldsymbol{\epsilon}_i)\,. \end{aligned} $$

(8.43)

Since $\mathscr {X}$ and $\mathscr {Y}$ follow the same distribution we have that $\text{Cov}(\mathscr {Y})=\text{Cov}(\mathscr {X})$. Furthermore, we know from the causal model that $\mathbf {A}=\text{Cov}(\mathscr {Y},\mathscr {X})\text{Cov}(\mathscr {X})^{-1}$. Then,

$$\displaystyle \begin{aligned} \text{Cov}(\boldsymbol{\epsilon}_i) & = \text{Cov}(\mathscr{X}) - \text{Cov}(\mathscr{Y},\mathscr{X}) \text{Cov}(\mathscr{X})^{-1} \text{Cov}(\mathscr{X},\mathscr{Y}) \,. \end{aligned} $$

(8.44)

In the case of $\tilde {\boldsymbol {\epsilon }}_i$ we know that the relation $\tilde {\boldsymbol {\epsilon }}_i = (\mathbf {I}-\tilde {\mathbf {A}} \mathbf {A}){\mathbf {x}}_i - \tilde {\mathbf {A}} \boldsymbol {\epsilon }_i$ must be satisfied, where $\tilde {\mathbf {A}}=\text{Cov}(\mathscr {X},\mathscr {Y})\text{Cov}(\mathscr {Y})^{-1}= \text{Cov}(\mathscr {X},\mathscr {Y})\text{Cov}(\mathscr {X})^{-1}$. Thus, we have that:

$$\displaystyle \begin{aligned} \text{Cov}(\tilde{\boldsymbol{\epsilon}}_i) &= (\mathbf{I}-\tilde{\mathbf{A}} \mathbf{A}) \text{Cov}(\mathscr{X}) (\mathbf{I}- {\mathbf{A}}^{\text{T}}\tilde{\mathbf{A}}^{\text{T}}) + \tilde{\mathbf{A}} \text{Cov}(\boldsymbol{\epsilon}_i) \tilde{\mathbf{A}}^{\text{T}} \end{aligned} $$

(8.45)

$$\displaystyle \begin{aligned} & = \text{Cov}(\mathscr{X}) - \text{Cov}(\mathscr{X}) {\mathbf{A}}^{\text{T}}\tilde{\mathbf{A}}^{\text{T}} - \tilde{\mathbf{A}}\mathbf{A} \text{Cov}(\mathscr{X}) + \tilde{\mathbf{A}}\mathbf{A}\text{Cov}(\mathscr{X}) {\mathbf{A}}^{\text{T}}\tilde{\mathbf{A}}^{\text{T}} \end{aligned} $$

(8.46)

$$\displaystyle \begin{aligned} & \quad + \tilde{\mathbf{A}} \text{Cov}(\mathscr{X}) \tilde{\mathbf{A}}^{\text{T}} - \tilde{\mathbf{A}} \text{Cov}(\mathscr{Y},\mathscr{X}) \text{Cov}(\mathscr{X})^{-1} \text{Cov}(\mathscr{X},\mathscr{Y}) \tilde{\mathbf{A}}^{\text{T}} \end{aligned} $$

(8.47)

$$\displaystyle \begin{aligned} & = \text{Cov}(\mathscr{X}) - \text{Cov}(\mathscr{X}) {\mathbf{A}}^{\text{T}}\tilde{\mathbf{A}}^{\text{T}} - \tilde{\mathbf{A}}\mathbf{A} \text{Cov}(\mathscr{X}) + \tilde{\mathbf{A}} \text{Cov}(\mathscr{X}) \tilde{\mathbf{A}}^{\text{T}} \end{aligned} $$

(8.48)

$$\displaystyle \begin{aligned} & = \text{Cov}(\mathscr{X}) - \text{Cov}(\mathscr{X},\mathscr{Y}) \text{Cov}{\mathscr{X}}^{-1} \text{Cov}(\mathscr{Y},\mathscr{X}) \end{aligned} $$

(8.49)

$$\displaystyle \begin{aligned} & \quad - \text{Cov}(\mathscr{X},\mathscr{Y}) \text{Cov}{\mathscr{X}}^{-1} \text{Cov}(\mathscr{Y},\mathscr{X}) \end{aligned} $$

(8.50)

$$\displaystyle \begin{aligned} & \quad + \text{Cov}(\mathscr{X},\mathscr{Y}) \text{Cov}{\mathscr{X}}^{-1} \text{Cov}(\mathscr{Y},\mathscr{X}) \end{aligned} $$

(8.51)

$$\displaystyle \begin{aligned} & = \text{Cov}(\mathscr{X}) - \text{Cov}(\mathscr{X},\mathscr{Y}) \text{Cov}(\mathscr{X})^{-1} \text{Cov}(\mathscr{Y},\mathscr{X}) \,. \end{aligned} $$

(8.52)

By the matrix determinant theorem we have that $\text{det} \text{Cov}(\tilde {\boldsymbol {\epsilon }}_i)= \text{det} \text{Cov}(\boldsymbol {\epsilon }_i)$. See [23, p. 117] for further details.

Appendix 2

In this Appendix we motivate that, if the distribution of the residuals is not Gaussian, but is close to Gaussian, one should also expected more Gaussian residuals in the anti-causal direction in terms of the energy distance described in Sect. 8.3.4. For simplicity we will consider the univariate case. We use the fact that the energy distance in the one-dimensional case is the squared distance between the cumulative distribution functions of the residuals and a Gaussian distribution [32]. Thus,

$$\displaystyle \begin{aligned} \tilde{D}^2 &= \int_{-\infty}^\infty \left[\tilde{F}(x) - \varPhi(x)\right]^2 d x\,, & D^2 &= \int_{-\infty}^\infty \left[F(x) - \varPhi(x)\right]^2 d x\,, \end{aligned} $$

(8.53)

where $\tilde {D}^2$ and D ² are the energy distances to the Gaussian distribution in the anti-causal and the causal direction respectively; $\tilde {F}(x)$ and F(x) are the c.d.f. of the residuals in the anti-causal and the causal direction, respectively; and finally, Φ(x) is the c.d.f. of a standard Gaussian.

One should expect that $\tilde {D}^2 \leq D^2$. To motivate this, we use the Gram-Charlier series and compute an expansion of $\tilde {F}(x)$ and F(x) around the standard Gaussian distribution [24]. Such an expansion only converges in the case of distributions that are close to be Gaussian (see Sect. 17.6.6a of [5] for further details). Namely,

$$\displaystyle \begin{aligned} \tilde{F}(x) & = \varPhi(x) - \phi(x) \left( \frac{\tilde{a}_3}{3!} H_2(x) + \frac{\tilde{a}_4}{4!} H_3(x) + \cdots \right)\,, \end{aligned} $$

(8.54)

$$\displaystyle \begin{aligned} F(x) & = \varPhi(x) - \phi(x) \left( \frac{a_3}{3!} H_2(x) + \frac{a_4}{4!} H_3(x) + \cdots \right)\,, \end{aligned} $$

(8.55)

where ϕ(x) is the p.d.f. of a standard Gaussian, H _n(x) are Hermite polynomials and $\tilde {a}_n$ and a _n are coefficients that depend on the cumulants, e.g., a ₃ = κ ₃, a ₄ = κ ₄, $\tilde {a}_3=\tilde {\kappa }_3$, $\tilde {a}_4=\tilde {\kappa }_4$. Note, however, that coefficients a _n and $\tilde {a}_n$ for n > 5 depend on combinations of the cumulants. Using such an expansion we find:

(8.56)

$$\displaystyle \begin{aligned} & \approx \frac{\tilde{\kappa}_3^2}{36} \mathbb{E}[H_2(x)^2\phi(x)] + \frac{\tilde{\kappa}_4^2}{576} \mathbb{E}[H_3(x)^2\phi(x)] \,, \end{aligned} $$

(8.57)

where $\mathbb {E}[\cdot ]$ denotes expectation with respect to a standard Gaussian and we have truncated the Gram-Charlier expansion after n = 4. Truncation of the Gram-Charlier expansion after n = 4 is a standard procedure that is often done in the ICA literature for entropy approximation. See for example Sect. 5.5.1 of [13]. We have also used the fact that $\mathbb {E}[H_3(x)H_2(x)\phi (x)] = 0$. The same approach can be followed in the case of D ², the energy distance in the causal direction. The consequence is that $D^2\approx \kappa _3^2/36 \cdot \mathbb {E}[H_2(x)^2\phi (x)] + \kappa _4^2/576 \cdot \mathbb {E}[H_3(x)^2\phi (x)]$. Finally, the fact that one should expect $\tilde {D}^2\leq D^2$ is obtained by noting that $\tilde {\kappa }_n = c_n \kappa _n$, where c _n is some constant that lies in the interval (−1, 1), as indicated in Sect. 8.2.1. We expect that this result extends to the multivariate case.

Appendix 3

In this Appendix we motivate that one should expect also more Gaussian residuals in the anti-causal direction, based on a reduction of the cumulants, when the residuals in feature space are projected onto the first principal component. That is, when they are multiplied by the first eigenvector of the covariance matrix of the residuals, and scaled by the corresponding eigenvalue. Recall from Sect. 8.2.2 that these covariance matrices are C = I −AA ^T and $\tilde {\mathbf {C}}=\mathbf {I} - {\mathbf {A}}^{\text{T}} \mathbf {A}$, in the causal and anti-causal direction respectively. Note that both matrices have the same eigenvalues.

If A is symmetric we have that both C and $\tilde {\mathbf {C}}$ have the same matrix of eigenvectors P. Let ${\mathbf {p}}_1^n$ be the first eigenvector multiplied n times using the Kronecker product. The cumulants in the anti-causal and the causal direction, after projecting the data onto the first eigenvector are $\tilde {\kappa }_n^{\text{proj}} = ({\mathbf {p}}_1^{\text{n}})^{\text{T}} {\mathbf {M}}_n \text{vect}(\tilde {\kappa }_n) = c({\mathbf {p}}_1^{\text{n}})^{\text{T}} \text{vect}(\tilde {\kappa }_n)$ and $\kappa _n^{\text{proj}} = ({\mathbf {p}}_1^{\text{n}})^{\text{T}} \text{vect}(\kappa _n)$, respectively, where M _n is the matrix that relates the cumulants in the causal and the anti-causal direction (see Sect. 8.2.2) and c is one of the eigenvalues of M _n. In particular, if A is symmetric, it is not difficult to show that ${\mathbf {p}}_1^{\text{n}}$ is one of the eigenvectors of M _n. Furthermore, we also showed in that case that ||M _n||_op < 1 for n ≥ 3 (see Sect. 8.2.2). The consequence is that c ∈ (−1, 1), which combined with the fact that $||{\mathbf {p}}_1^n||=1$ leads to smaller cumulants in magnitude in the case of the projected residuals in the anti-causal direction.

If A is not symmetric we motivate that one should also expect more Gaussian residuals in the anti-causal direction due to a reduction in the magnitude of the cumulants. For this, we derive a smaller upper bound on their magnitude. This smaller upper bound is based on an argument that uses the operator norm of vectors.

Definition 8.2

The operator norm of a vector w induced by the ℓ _p norm is ||w||_op = min{c ≥ 0 : ||w ^T v||_p ≤ c||v||_p, ∀v}.

The consequence is that ||w||_op ≥||w ^T v||_p∕||v||_p, ∀v. Thus, the smallest the operator norm of w, the smallest the expected value obtained when multiplying any vector by the vector w. Furthermore, it is clear that ||w||_op = ||w||₂, in the case of the ℓ ₂-norm. From the previous paragraph, in the anti-causal direction we have $||\tilde {\kappa }_n^{\text{proj}}||{ }_2= ||(\tilde {\mathbf {p}}_1^n)^{\text{T}} {\mathbf {M}}_n \text{vect}(\kappa _n)||{ }_2$, where $\tilde {\mathbf {p}}_1$ is the first eigenvector of $\tilde {\mathbf {C}}$, while in the causal direction we have $||\kappa _n^{\text{proj}}||{ }_2= ||({\mathbf {p}}_1^n)^{\text{T}} \text{vect}(\kappa _n)||{ }_2$, where p ₁ is the first eigenvector of C. Thus, because the norm of each vector $\tilde {\mathbf {p}}_1^n$ and ${\mathbf {p}}_1^n$ is one, we have that $||{\mathbf {p}}_1^n||{ }_{\text{op}}=1$. However, because we expect M _n, to reduce the norm of $(\tilde {\mathbf {p}}_1^n)^{\text{T}}$, as motivated in Sect. 8.2.2, $||(\tilde {\mathbf {p}}_1^n)^{\text{T}}{\mathbf {M}}_n||{ }_{\text{op}} < 1$ should follow. This is expected to lead to smaller cumulants in magnitude in the anti-causal direction.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Hernández-Lobato, D., Morales-Mombiela, P., Lopez-Paz, D., Suárez, A. (2019). Non-linear Causal Inference Using Gaussianity Measures. In: Guyon, I., Statnikov, A., Batu, B. (eds) Cause Effect Pairs in Machine Learning. The Springer Series on Challenges in Machine Learning. Springer, Cham. https://doi.org/10.1007/978-3-030-21810-2_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-21810-2_8
Published: 23 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-21809-6
Online ISBN: 978-3-030-21810-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics