Skip to main content

Non-linear Causal Inference Using Gaussianity Measures

  • Chapter
  • First Online:
Cause Effect Pairs in Machine Learning

Abstract

We provide theoretical and empirical evidence for a type of asymmetry between causes and effects that is present when these are related via linear models contaminated with additive non-Gaussian noise. Assuming that the causes and the effects have the same distribution, we show that the distribution of the residuals of a linear fit in the anti-causal direction is closer to a Gaussian than the distribution of the residuals in the causal direction. This Gaussianization effect is characterized by reduction of the magnitude of the high-order cumulants and by an increment of the differential entropy of the residuals. The problem of non-linear causal inference is addressed by performing an embedding in an expanded feature space, in which the relation between causes and effects can be assumed to be linear. The effectiveness of a method to discriminate between causes and effects based on this type of asymmetry is illustrated in a variety of experiments using different measures of Gaussianity. The proposed method is shown to be competitive with state-of-the-art techniques for causal inference.

The vast majority of the work was done while being at the Max Planck Institute for Intelligent Systems and at Cambridge University

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We assume such matrix exists and that it is positive definite.

  2. 2.

    See https://www.codalab.org/competitions/1381 for more information.

  3. 3.

    See http://webdav.tuebingen.mpg.de/cause-effect/ for more details.

References

  1. J. Beirlant, E. J. Dudewicz, L. Györfi, and E. C. Van Der Meulen. Nonparametric entropy estimation: An overview. International Journal of Mathematical and Statistical Sciences, 6 (1): 17–39, 1997.

    MathSciNet  MATH  Google Scholar 

  2. Z. Chen, K. Zhang, and L. Chan. Nonlinear causal discovery for high dimensional data: A kernelized trace method. In IEEE 13th International Conference on Data Mining, pages 1003–1008, 2013.

    Google Scholar 

  3. Z. Chen, K. Zhang, L. Chan, and B. Schölkopf. Causal discovery via reproducing kernel Hilbert space embeddings. Neural Computation, 26 (7): 1484–1517, 2014.

    Article  MathSciNet  Google Scholar 

  4. E. A. Cornish and R. A. Fisher. Moments and cumulants in the specification of distributions. Revue de l’Institut International de Statistique / Review of the International Statistical Institute, 5 (4): 307–320, 1938.

    Article  Google Scholar 

  5. H. Cramér. Mathematical methods of statistics. PhD thesis, 1946.

    Google Scholar 

  6. D. Entner and P. O. Hoyer. Estimating a causal order among groups of variables in linear models. In Internatinal Conference on Artificial Neural Networks, pages 84–91. 2012.

    Chapter  Google Scholar 

  7. A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, and A. J. Smola. A kernel statistical test of independence. In Advances in Neural Information Processing Systems 20, pages 585–592. 2008.

    Google Scholar 

  8. A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13: 723–773, 2012.

    MathSciNet  MATH  Google Scholar 

  9. J. M. Hernández-Lobato, P. Morales-Mombiela, and A. Suárez. Gaussianity measures for detecting the direction of causal time series. In International Joint Conference on Artificial Intelligence, pages 1318–1323, 2011.

    Google Scholar 

  10. P. O. Hoyer, D. Janzing, J. M. Mooij, J. Peters, and B. Schölkopf. Nonlinear causal discovery with additive noise models. In Advances in Neural Information Processing Systems 21, pages 689–696, 2009.

    Google Scholar 

  11. A. Hyvärinen. New approximations of differential entropy for independent component analysis and projection pursuit. In Advances in Neural Information Processing Systems 10, pages 273–279. MIT Press, 1998.

    Google Scholar 

  12. A. Hyvärinen and S. M. Smith. Pairwise likelihood ratios for estimation of non-Gaussian structural equation models. Journal of Machine Learning Research, 14 (1): 111–152, 2013.

    MathSciNet  MATH  Google Scholar 

  13. A. Hyvärinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley & Sons, 2004.

    Google Scholar 

  14. D. Janzing, P. O. Hoyer, and B. Schölkopf. Telling cause from effect based on high-dimensional observations. In International Conference on Machine Learning, pages 479–486, 2010.

    Google Scholar 

  15. D. Janzing, J. M. Mooij, K. Zhang, J. Lemeire, J. Zscheischler, P. Daniušis, B. Steudel, and B. Schölkopf. Information-geometric approach to inferring causal directions. Artificial Intelligence, 182–183: 1–31, 2012.

    Article  MathSciNet  Google Scholar 

  16. Y. Kawahara, K. Bollen, S. Shimizu, and T. Washio. GroupLiNGAM: Linear non-Gaussian acyclic models for sets of variables. 2012. arXiv:1006.5041.

    Google Scholar 

  17. S. Kpotufe, E. Sgouritsa, D. Janzing, and B. Schölkopf. Consistency of causal inference under the additive noise model. In International Conference on Machine Learning, pages 478–486, 2014.

    Google Scholar 

  18. A. J. Laub. Matrix Analysis For Scientists And Engineers. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2004. ISBN 0898715768.

    Google Scholar 

  19. J. T. Marcinkiewicz. Sur une propriété de la loi de gauss. Mathematische Zeitschrift, 44: 612–618, 1938.

    Article  MathSciNet  Google Scholar 

  20. P. McCullagh. Tensor methods in statistics. Chapman and Hall, 1987.

    MATH  Google Scholar 

  21. J. M. Mooij, O. Stegle, D. Janzing, K. Zhang, and B. Schölkopf. Probabilistic latent variable models for distinguishing between cause and effect. In Advances in Neural Information Processing Systems 23, pages 1687–1695. 2010.

    Google Scholar 

  22. P. Morales-Mombiela, D. Hernández-Lobato, and A. Suárez. Statistical tests for the detection of the arrow of time in vector autoregressive models. In International Joint Conference on Artificial Intelligence, 2013.

    Google Scholar 

  23. K. Murphy. Machine Learning: a Probabilistic Perspective. The MIT Press, 2012.

    MATH  Google Scholar 

  24. J. K. Patel and C. B. Read. Handbook of the normal distribution, volume 150. CRC Press, 1996.

    MATH  Google Scholar 

  25. J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, NY, USA, 2000. ISBN 0-521-77362-8.

    MATH  Google Scholar 

  26. B. Schölkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2002. ISBN 0262194759.

    Google Scholar 

  27. B. Schölkopf, A. Smola, and K.-R. Müller. Kernel principal component analysis. In International Conference on Artificial Neural Networks, pages 583–588. Springer, 1997.

    Google Scholar 

  28. S. Shimizu, P. O. Hoyer, A. Hyvärinen, and A. Kerminen. A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7: 2003–2030, 2006.

    MATH  Google Scholar 

  29. H. Singh, N. Misra, V. Hnizdo, A. Fedorowicz, and E. Demchuk. Nearest neighbor estimates of entropy. American journal of mathematical and management sciences, 23 (3–4): 301–321, 2003.

    Article  MathSciNet  Google Scholar 

  30. L. Song, K. Fukumizu, and A. Gretton. Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. Signal Processing Magazine, IEEE, 30: 98–111, 2013.

    Article  Google Scholar 

  31. G. J. Székely and M. L. Rizzo. A new test for multivariate normality. Journal of Multivariate Analysis, 93 (1): 58–80, 2005.

    Article  MathSciNet  Google Scholar 

  32. G. J. Székely and M. L. Rizzo. Energy statistics: A class of statistics based on distances. Journal of Statistical Planning and Inference, 143 (8): 1249–1272, 2013.

    Article  MathSciNet  Google Scholar 

  33. K. Zhang and A. Hyvärinen. On the identifiability of the post-nonlinear causal model. In International Conference on Uncertainty in Artificial Intelligence, pages 647–655, 2009.

    Google Scholar 

Download references

Acknowledgements

Daniel Hernández-Lobato and Alberto Suárez gratefully acknowledge the use of the facilities of Centro de Computación Científica (CCC) at Universidad Autónoma de Madrid. These authors also acknowledge financial support from the Spanish Plan Nacional I+D+i, Grants TIN2013-42351-P and TIN2015-70308-REDT, and from Comunidad de Madrid, Grant S2013/ICE-2845 CASI-CAM-CM. David Lopez-Paz acknowledges support from Fundación la Caixa.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel Hernández-Lobato .

Editor information

Editors and Affiliations

Appendices

Appendix 1

In this appendix we show that if \(\mathscr {X}\) and \(\mathscr {Y}\) follow the same distribution and they have been centered, then the determinant of the covariance matrix of the random variable corresponding to 𝜖 i, denoted with Cov(𝜖 i), coincides with the determinant of the covariance matrix corresponding to the random variable \(\tilde {\boldsymbol {\epsilon }}_i\), denoted with \(\text{Cov}(\tilde {\boldsymbol {\epsilon }}_i)\).

From the causal model, i.e., y i = Ax i + 𝜖 i, we have that:

$$\displaystyle \begin{aligned} \text{Cov}(\mathscr{Y}) &= \mathbf{A} \text{Cov}(\mathscr{X}) {\mathbf{A}}^{\text{T}} + \text{Cov}(\boldsymbol{\epsilon}_i)\,. \end{aligned} $$
(8.43)

Since \(\mathscr {X}\) and \(\mathscr {Y}\) follow the same distribution we have that \(\text{Cov}(\mathscr {Y})=\text{Cov}(\mathscr {X})\). Furthermore, we know from the causal model that \(\mathbf {A}=\text{Cov}(\mathscr {Y},\mathscr {X})\text{Cov}(\mathscr {X})^{-1}\). Then,

$$\displaystyle \begin{aligned} \text{Cov}(\boldsymbol{\epsilon}_i) & = \text{Cov}(\mathscr{X}) - \text{Cov}(\mathscr{Y},\mathscr{X}) \text{Cov}(\mathscr{X})^{-1} \text{Cov}(\mathscr{X},\mathscr{Y}) \,. \end{aligned} $$
(8.44)

In the case of \(\tilde {\boldsymbol {\epsilon }}_i\) we know that the relation \(\tilde {\boldsymbol {\epsilon }}_i = (\mathbf {I}-\tilde {\mathbf {A}} \mathbf {A}){\mathbf {x}}_i - \tilde {\mathbf {A}} \boldsymbol {\epsilon }_i\) must be satisfied, where \(\tilde {\mathbf {A}}=\text{Cov}(\mathscr {X},\mathscr {Y})\text{Cov}(\mathscr {Y})^{-1}= \text{Cov}(\mathscr {X},\mathscr {Y})\text{Cov}(\mathscr {X})^{-1}\). Thus, we have that:

$$\displaystyle \begin{aligned} \text{Cov}(\tilde{\boldsymbol{\epsilon}}_i) &= (\mathbf{I}-\tilde{\mathbf{A}} \mathbf{A}) \text{Cov}(\mathscr{X}) (\mathbf{I}- {\mathbf{A}}^{\text{T}}\tilde{\mathbf{A}}^{\text{T}}) + \tilde{\mathbf{A}} \text{Cov}(\boldsymbol{\epsilon}_i) \tilde{\mathbf{A}}^{\text{T}} \end{aligned} $$
(8.45)
$$\displaystyle \begin{aligned} & = \text{Cov}(\mathscr{X}) - \text{Cov}(\mathscr{X}) {\mathbf{A}}^{\text{T}}\tilde{\mathbf{A}}^{\text{T}} - \tilde{\mathbf{A}}\mathbf{A} \text{Cov}(\mathscr{X}) + \tilde{\mathbf{A}}\mathbf{A}\text{Cov}(\mathscr{X}) {\mathbf{A}}^{\text{T}}\tilde{\mathbf{A}}^{\text{T}} \end{aligned} $$
(8.46)
$$\displaystyle \begin{aligned} & \quad + \tilde{\mathbf{A}} \text{Cov}(\mathscr{X}) \tilde{\mathbf{A}}^{\text{T}} - \tilde{\mathbf{A}} \text{Cov}(\mathscr{Y},\mathscr{X}) \text{Cov}(\mathscr{X})^{-1} \text{Cov}(\mathscr{X},\mathscr{Y}) \tilde{\mathbf{A}}^{\text{T}} \end{aligned} $$
(8.47)
$$\displaystyle \begin{aligned} & = \text{Cov}(\mathscr{X}) - \text{Cov}(\mathscr{X}) {\mathbf{A}}^{\text{T}}\tilde{\mathbf{A}}^{\text{T}} - \tilde{\mathbf{A}}\mathbf{A} \text{Cov}(\mathscr{X}) + \tilde{\mathbf{A}} \text{Cov}(\mathscr{X}) \tilde{\mathbf{A}}^{\text{T}} \end{aligned} $$
(8.48)
$$\displaystyle \begin{aligned} & = \text{Cov}(\mathscr{X}) - \text{Cov}(\mathscr{X},\mathscr{Y}) \text{Cov}{\mathscr{X}}^{-1} \text{Cov}(\mathscr{Y},\mathscr{X}) \end{aligned} $$
(8.49)
$$\displaystyle \begin{aligned} & \quad - \text{Cov}(\mathscr{X},\mathscr{Y}) \text{Cov}{\mathscr{X}}^{-1} \text{Cov}(\mathscr{Y},\mathscr{X}) \end{aligned} $$
(8.50)
$$\displaystyle \begin{aligned} & \quad + \text{Cov}(\mathscr{X},\mathscr{Y}) \text{Cov}{\mathscr{X}}^{-1} \text{Cov}(\mathscr{Y},\mathscr{X}) \end{aligned} $$
(8.51)
$$\displaystyle \begin{aligned} & = \text{Cov}(\mathscr{X}) - \text{Cov}(\mathscr{X},\mathscr{Y}) \text{Cov}(\mathscr{X})^{-1} \text{Cov}(\mathscr{Y},\mathscr{X}) \,. \end{aligned} $$
(8.52)

By the matrix determinant theorem we have that \(\text{det} \text{Cov}(\tilde {\boldsymbol {\epsilon }}_i)= \text{det} \text{Cov}(\boldsymbol {\epsilon }_i)\). See [23, p. 117] for further details.

Appendix 2

In this Appendix we motivate that, if the distribution of the residuals is not Gaussian, but is close to Gaussian, one should also expected more Gaussian residuals in the anti-causal direction in terms of the energy distance described in Sect. 8.3.4. For simplicity we will consider the univariate case. We use the fact that the energy distance in the one-dimensional case is the squared distance between the cumulative distribution functions of the residuals and a Gaussian distribution [32]. Thus,

$$\displaystyle \begin{aligned} \tilde{D}^2 &= \int_{-\infty}^\infty \left[\tilde{F}(x) - \varPhi(x)\right]^2 d x\,, & D^2 &= \int_{-\infty}^\infty \left[F(x) - \varPhi(x)\right]^2 d x\,, \end{aligned} $$
(8.53)

where \(\tilde {D}^2\) and D 2 are the energy distances to the Gaussian distribution in the anti-causal and the causal direction respectively; \(\tilde {F}(x)\) and F(x) are the c.d.f. of the residuals in the anti-causal and the causal direction, respectively; and finally, Φ(x) is the c.d.f. of a standard Gaussian.

One should expect that \(\tilde {D}^2 \leq D^2\). To motivate this, we use the Gram-Charlier series and compute an expansion of \(\tilde {F}(x)\) and F(x) around the standard Gaussian distribution [24]. Such an expansion only converges in the case of distributions that are close to be Gaussian (see Sect. 17.6.6a of [5] for further details). Namely,

$$\displaystyle \begin{aligned} \tilde{F}(x) & = \varPhi(x) - \phi(x) \left( \frac{\tilde{a}_3}{3!} H_2(x) + \frac{\tilde{a}_4}{4!} H_3(x) + \cdots \right)\,, \end{aligned} $$
(8.54)
$$\displaystyle \begin{aligned} F(x) & = \varPhi(x) - \phi(x) \left( \frac{a_3}{3!} H_2(x) + \frac{a_4}{4!} H_3(x) + \cdots \right)\,, \end{aligned} $$
(8.55)

where ϕ(x) is the p.d.f. of a standard Gaussian, H n(x) are Hermite polynomials and \(\tilde {a}_n\) and a n are coefficients that depend on the cumulants, e.g., a 3 = κ 3, a 4 = κ 4, \(\tilde {a}_3=\tilde {\kappa }_3\), \(\tilde {a}_4=\tilde {\kappa }_4\). Note, however, that coefficients a n and \(\tilde {a}_n\) for n > 5 depend on combinations of the cumulants. Using such an expansion we find:

(8.56)
$$\displaystyle \begin{aligned} & \approx \frac{\tilde{\kappa}_3^2}{36} \mathbb{E}[H_2(x)^2\phi(x)] + \frac{\tilde{\kappa}_4^2}{576} \mathbb{E}[H_3(x)^2\phi(x)] \,, \end{aligned} $$
(8.57)

where \(\mathbb {E}[\cdot ]\) denotes expectation with respect to a standard Gaussian and we have truncated the Gram-Charlier expansion after n = 4. Truncation of the Gram-Charlier expansion after n = 4 is a standard procedure that is often done in the ICA literature for entropy approximation. See for example Sect. 5.5.1 of [13]. We have also used the fact that \(\mathbb {E}[H_3(x)H_2(x)\phi (x)] = 0\). The same approach can be followed in the case of D 2, the energy distance in the causal direction. The consequence is that \(D^2\approx \kappa _3^2/36 \cdot \mathbb {E}[H_2(x)^2\phi (x)] + \kappa _4^2/576 \cdot \mathbb {E}[H_3(x)^2\phi (x)]\). Finally, the fact that one should expect \(\tilde {D}^2\leq D^2\) is obtained by noting that \(\tilde {\kappa }_n = c_n \kappa _n\), where c n is some constant that lies in the interval (−1, 1), as indicated in Sect. 8.2.1. We expect that this result extends to the multivariate case.

Appendix 3

In this Appendix we motivate that one should expect also more Gaussian residuals in the anti-causal direction, based on a reduction of the cumulants, when the residuals in feature space are projected onto the first principal component. That is, when they are multiplied by the first eigenvector of the covariance matrix of the residuals, and scaled by the corresponding eigenvalue. Recall from Sect. 8.2.2 that these covariance matrices are C = I −AA T and \(\tilde {\mathbf {C}}=\mathbf {I} - {\mathbf {A}}^{\text{T}} \mathbf {A}\), in the causal and anti-causal direction respectively. Note that both matrices have the same eigenvalues.

If A is symmetric we have that both C and \(\tilde {\mathbf {C}}\) have the same matrix of eigenvectors P. Let \({\mathbf {p}}_1^n\) be the first eigenvector multiplied n times using the Kronecker product. The cumulants in the anti-causal and the causal direction, after projecting the data onto the first eigenvector are \(\tilde {\kappa }_n^{\text{proj}} = ({\mathbf {p}}_1^{\text{n}})^{\text{T}} {\mathbf {M}}_n \text{vect}(\tilde {\kappa }_n) = c({\mathbf {p}}_1^{\text{n}})^{\text{T}} \text{vect}(\tilde {\kappa }_n)\) and \(\kappa _n^{\text{proj}} = ({\mathbf {p}}_1^{\text{n}})^{\text{T}} \text{vect}(\kappa _n)\), respectively, where M n is the matrix that relates the cumulants in the causal and the anti-causal direction (see Sect. 8.2.2) and c is one of the eigenvalues of M n. In particular, if A is symmetric, it is not difficult to show that \({\mathbf {p}}_1^{\text{n}}\) is one of the eigenvectors of M n. Furthermore, we also showed in that case that ||M n||op < 1 for n ≥ 3 (see Sect. 8.2.2). The consequence is that c ∈ (−1, 1), which combined with the fact that \(||{\mathbf {p}}_1^n||=1\) leads to smaller cumulants in magnitude in the case of the projected residuals in the anti-causal direction.

If A is not symmetric we motivate that one should also expect more Gaussian residuals in the anti-causal direction due to a reduction in the magnitude of the cumulants. For this, we derive a smaller upper bound on their magnitude. This smaller upper bound is based on an argument that uses the operator norm of vectors.

Definition 8.2

The operator norm of a vector w induced by the p norm is ||w||op = min{c ≥ 0 : ||w T v||p ≤ c||v||p, ∀v}.

The consequence is that ||w||op ≥||w T v||p∕||v||p, ∀v. Thus, the smallest the operator norm of w, the smallest the expected value obtained when multiplying any vector by the vector w. Furthermore, it is clear that ||w||op = ||w||2, in the case of the 2-norm. From the previous paragraph, in the anti-causal direction we have \(||\tilde {\kappa }_n^{\text{proj}}||{ }_2= ||(\tilde {\mathbf {p}}_1^n)^{\text{T}} {\mathbf {M}}_n \text{vect}(\kappa _n)||{ }_2\), where \(\tilde {\mathbf {p}}_1\) is the first eigenvector of \(\tilde {\mathbf {C}}\), while in the causal direction we have \(||\kappa _n^{\text{proj}}||{ }_2= ||({\mathbf {p}}_1^n)^{\text{T}} \text{vect}(\kappa _n)||{ }_2\), where p 1 is the first eigenvector of C. Thus, because the norm of each vector \(\tilde {\mathbf {p}}_1^n\) and \({\mathbf {p}}_1^n\) is one, we have that \(||{\mathbf {p}}_1^n||{ }_{\text{op}}=1\). However, because we expect M n, to reduce the norm of \((\tilde {\mathbf {p}}_1^n)^{\text{T}}\), as motivated in Sect. 8.2.2, \(||(\tilde {\mathbf {p}}_1^n)^{\text{T}}{\mathbf {M}}_n||{ }_{\text{op}} < 1\) should follow. This is expected to lead to smaller cumulants in magnitude in the anti-causal direction.

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Hernández-Lobato, D., Morales-Mombiela, P., Lopez-Paz, D., Suárez, A. (2019). Non-linear Causal Inference Using Gaussianity Measures. In: Guyon, I., Statnikov, A., Batu, B. (eds) Cause Effect Pairs in Machine Learning. The Springer Series on Challenges in Machine Learning. Springer, Cham. https://doi.org/10.1007/978-3-030-21810-2_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-21810-2_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-21809-6

  • Online ISBN: 978-3-030-21810-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics