Skip to main content
Log in

Variational Approach for Learning Markov Processes from Time Series Data

  • Published:
Journal of Nonlinear Science Aims and scope Submit manuscript

Abstract

Inference, prediction, and control of complex dynamical systems from time series is important in many areas, including financial markets, power grid management, climate and weather modeling, or molecular dynamics. The analysis of such highly nonlinear dynamical systems is facilitated by the fact that we can often find a (generally nonlinear) transformation of the system coordinates to features in which the dynamics can be excellently approximated by a linear Markovian model. Moreover, the large number of system variables often change collectively on large time- and length-scales, facilitating a low-dimensional analysis in feature space. In this paper, we introduce a variational approach for Markov processes (VAMP) that allows us to find optimal feature mappings and optimal Markovian models of the dynamics from given time series data. The key insight is that the best linear model can be obtained from the top singular components of the Koopman operator. This leads to the definition of a family of score functions called VAMP-r which can be calculated from data, and can be employed to optimize a Markovian model. In addition, based on the relationship between the variational scores and approximation errors of Koopman operators, we propose a new VAMP-E score, which can be applied to cross-validation for hyper-parameter optimization and model selection in VAMP. VAMP is valid for both reversible and nonreversible processes and for stationary and nonstationary processes or realizations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Andrew, G., Arora, R., Bilmes, J., Livescu, K.: Deep canonical correlation analysis. In: International Conference on Machine Learning, pp. 1247–1255 (2013)

  • Arlot, S., Celisse, A.: A survey of cross-validation procedures for model selection. Stat. Surv. 4, 40–79 (2010)

    MathSciNet  MATH  Google Scholar 

  • Bollt, E.M., Santitissadeekorn, N.: Applied and Computational Measurable Dynamics. SIAM (2013)

  • Boninsegna, L., Gobbo, G., Noé, F., Clementi, C.: Investigating molecular kinetics by variationally optimized diffusion maps. J. Chem. Theory Comput. 11, 5947–5960 (2015)

    Google Scholar 

  • Bowman, G.R., Pande, V.S., Noé, F. (eds.): An Introduction to Markov State Models and Their Application to Long Timescale Molecular Simulation. Volume 797 of Advances in Experimental Medicine and Biology. Springer, Heidelberg (2014)

    MATH  Google Scholar 

  • Brunton, S.L., Brunton, B.W., Proctor, J.L., Kutz, J.N.: Koopman invariant subspaces and finite linear representations of nonlinear dynamical systems for control. PLoS ONE 11(2), e0150171 (2016a)

    Google Scholar 

  • Brunton, S.L., Proctor, J.L., Kutz, J.N.: Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proc. Natl. Acad. Sci. 113(15), 3932–3937 (2016b)

    MathSciNet  MATH  Google Scholar 

  • Chekroun, M.D., Simonnet, E., Ghil, M.: Stochastic climate dynamics: random attractors and time-dependent invariant measures. Physica D Nonlinear Phenom. 240(21), 1685–1700 (2011)

    MathSciNet  MATH  Google Scholar 

  • Chodera, J.D., Noé, F.: Markov state models of biomolecular conformational dynamics. Curr. Opin. Struct. Biol. 25, 135–144 (2014)

    Google Scholar 

  • Conrad, N.D., Weber, M., Schütte, C.: Finding dominant structures of nonreversible Markov processes. Multiscale Model. Simul. 14(4), 1319–1340 (2016)

    MathSciNet  MATH  Google Scholar 

  • Dellnitz, M., Froyland, G., Junge, O.: The algorithms behind gaio–set oriented numerical methods for dynamical systems. In: Fiedler, B. (ed.) Ergodic Theory, Analysis, and Efficient Simulation of Dynamical Systems, pp. 145–174. Springer, Berlin (2001)

    MATH  Google Scholar 

  • Deuflhard, P., Weber, M.: Robust perron cluster analysis in conformation dynamics. In: Dellnitz, M., Kirkland, S., Neumann, M., Schütte, C. (eds.) Linear Algebra Application, vol. 398C, pp. 161–184. Elsevier, New York (2005)

    Google Scholar 

  • Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer, New York (2001)

    MATH  Google Scholar 

  • Froyland, G.: An analytic framework for identifying finite-time coherent sets in time-dependent dynamical systems. Physica D Nonlinear Phenom. 250, 1–19 (2013)

    MathSciNet  MATH  Google Scholar 

  • Froyland, G., Padberg, K.: Almost-invariant sets and invariant manifolds—connecting probabilistic and geometric descriptions of coherent structures in flows. Physica D Nonlinear Phenom. 238(16), 1507–1523 (2009)

    MathSciNet  MATH  Google Scholar 

  • Froyland, G., Padberg-Gehle, K.: Almost-invariant and finite-time coherent sets: directionality, duration, and diffusion. In: Bahsoun, W., Bose, C., Froyland, G. (eds.) Ergodic Theory, Open Dynamics, and Coherent Structures, pp. 171–216. Springer, Berlin (2014)

    MATH  Google Scholar 

  • Froyland, G., Gottwald, G.A., Hammerlindl, A.: A computational method to extract macroscopic variables and their dynamics in multiscale systems. SIAM J. Appl. Dyn. Syst. 13(4), 1816–1846 (2014)

    MathSciNet  MATH  Google Scholar 

  • Froyland, G., González-Tokman, C., Watson, T.M.: Optimal mixing enhancement by local perturbation. SIAM Rev. 58(3), 494–513 (2016)

    MathSciNet  MATH  Google Scholar 

  • Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004)

    MATH  Google Scholar 

  • Harmeling, S., Ziehe, A., Kawanabe, M., Müller, K.-R.: Kernel-based nonlinear blind source separation. Neural Comput. 15(5), 1089–1124 (2003)

    MATH  Google Scholar 

  • Hsing, T., Eubank, R.: Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators. Wiley, Amsterdam (2015)

    MATH  Google Scholar 

  • Klus, S., Schütte, C.: Towards tensor-based methods for the numerical approximation of the perron-frobenius and koopman operator (2015). arXiv:1512.06527

  • Klus, S., Koltai, P., Schütte, C.: On the numerical approximation of the perron-frobenius and koopman operator (2015). arXiv:1512.05997

  • Klus, S., Gelß, P., Peitz, S., Schütte, C.: Tensor-based dynamic mode decomposition. Nonlinearity 31(7), 3359 (2018)

    MathSciNet  MATH  Google Scholar 

  • Koltai, P., Wu, H., Noe, F., Schütte, C.: Optimal data-driven estimation of generalized Markov state models for non-equilibrium dynamics. Computation 6(1), 22 (2018)

    Google Scholar 

  • Konrad, A., Zhao, B.Y., Joseph, A.D., Ludwig, R.: A Markov-based channel model algorithm for wireless networks. In: Proceedings of the 4th ACM International Workshop on Modeling, Analysis and Simulation of Wireless and Mobile Systems, pp. 28–36. ACM (2001)

  • Koopman, B.O.: Hamiltonian systems and transformations in hilbert space. Proc. Natl. Acad. Sci. U.S.A. 17, 315–318 (1931)

    MATH  Google Scholar 

  • Korda, M., Mezić, I.: On convergence of extended dynamic mode decomposition to the Koopman operator. J. Nonlinear Sci. 28(2), 687–710 (2018)

    MathSciNet  MATH  Google Scholar 

  • Kurebayashi, W., Shirasaka, S., Nakao, H.: Optimal parameter selection for kernel dynamic mode decomposition. In: Proceedings of the International Symposium NOLTA, volume 370, p. 373 (2016)

  • Li, Q., Dietrich, F., Bollt, E.M., Kevrekidis, I.G.: Extended dynamic mode decomposition with dictionary learning: a data-driven adaptive spectral decomposition of the Koopman operator. Chaos 27(10), 103111 (2017)

    MathSciNet  MATH  Google Scholar 

  • Lusch, B., Kutz, J.N., Brunton, S.L.: Deep learning for universal linear embeddings of nonlinear dynamics. Nat. Commun. 9(1), 4950 (2018)

    Google Scholar 

  • Ma, Y., Han, J.J., Trivedi, K.S.: Composite performance and availability analysis of wireless communication networks. IEEE Trans. Veh. Technol. 50(5), 1216–1223 (2001)

    Google Scholar 

  • Mardt, A., Pasquali, L., Wu, H., Noé, F.: Vampnets for deep learning of molecular kinetics. Nat. Commun. 9(1), 5 (2018)

    Google Scholar 

  • Marshall, A.W., Olkin, I., Arnold, B.C.: Inequalities: Theory of Majorization and Its Applications, vol. 143. Springer, Berlin (1979)

    MATH  Google Scholar 

  • McGibbon, R.T., Pande, V.S.: Variational cross-validation of slow dynamical modes in molecular kinetics. J. Chem. Phys. 142, 124105 (2015)

    Google Scholar 

  • Mezić, I.: Spectral properties of dynamical systems, model reduction and decompositions. Nonlinear Dyn. 41, 309–325 (2005)

    MathSciNet  MATH  Google Scholar 

  • Mezić, I.: Analysis of fluid flows via spectral properties of the Koopman operator. Annu. Rev. Fluid Mech. 45, 357–378 (2013)

    MathSciNet  MATH  Google Scholar 

  • Molgedey, L., Schuster, H.G.: Separation of a mixture of independent signals using time delayed correlations. Phys. Rev. Lett. 72, 3634–3637 (1994)

    Google Scholar 

  • Noé, F.: Probability distributions of molecular observables computed from Markov models. J. Chem. Phys. 128, 244103 (2008)

    Google Scholar 

  • Noé, F., Clementi, C.: Kinetic distance and kinetic maps from molecular dynamics simulation. J. Chem. Theory Comput. 11, 5002–5011 (2015)

    Google Scholar 

  • Noé, F., Nüske, F.: A variational approach to modeling slow processes in stochastic dynamical systems. Multiscale Model. Simul. 11, 635–655 (2013)

    MathSciNet  MATH  Google Scholar 

  • Nüske, F., Keller, B.G., Pérez-Hernández, G., Mey, A.S.J.S., Noé, F.: Variational approach to molecular kinetics. J. Chem. Theory Comput. 10, 1739–1752 (2014)

    Google Scholar 

  • Nüske, F., Schneider, R., Vitalini, F., Noé, F.: Variational tensor approach for approximating the rare-event kinetics of macromolecular systems. J. Chem. Phys. 144, 054105 (2016)

    Google Scholar 

  • Otto, S.E., Rowley, C.W.: Linearly recurrent autoencoder networks for learning dynamics. SIAM J. Appl. Dyn. Syst. 18(1), 558–593 (2019)

    MathSciNet  MATH  Google Scholar 

  • Paul, F., Wu, H., Vossel, M., Groot, B., Noe, F.: Identification of kinetic order parameters for non-equilibrium dynamics. J. Chem. Phys. 150, 164120 (2018)

    Google Scholar 

  • Perez-Hernandez, G., Paul, F., Giorgino, T., Fabritiis, G.D., Noé, F.: Identification of slow molecular order parameters for Markov model construction. J. Chem. Phys. 139, 015102 (2013)

    Google Scholar 

  • Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes: The Art of Scientific Computing. Cambridge University Press, Cambridge (2007)

    MATH  Google Scholar 

  • Prinz, J.-H., Wu, H., Sarich, M., Keller, B.G., Senne, M., Held, M., Chodera, J.D., Schütte, C., Noé, F.: Markov models of molecular kinetics: generation and validation. J. Chem. Phys. 134, 174105 (2011)

    Google Scholar 

  • Renardy, M., Rogers, R.C.: An Introduction to Partial Differential Equations. Springer, New York (2004)

    MATH  Google Scholar 

  • Rowley, C.W., Mezić, I., Bagheri, S., Schlatter, P., Henningson, D.S.: Spectral analysis of nonlinear flows. J. Fluid Mech. 641, 115 (2009)

    MathSciNet  MATH  Google Scholar 

  • Schmid, P.J.: Dynamic mode decomposition of numerical and experimental data. J. Fluid Mech. 656, 5–28 (2010)

    MathSciNet  MATH  Google Scholar 

  • Schütte, C., Fischer, A., Huisinga, W., Deuflhard, P.: A direct approach to conformational dynamics based on hybrid Monte Carlo. J. Comput. Phys. 151, 146–168 (1999)

    MathSciNet  MATH  Google Scholar 

  • Schwantes, C.R., Pande, V.S.: Improvements in Markov state model construction reveal many non-native interactions in the folding of NTL9. J. Chem. Theory Comput. 9, 2000–2009 (2013)

    Google Scholar 

  • Schwantes, C.R., Pande, V.S.: Modeling molecular kinetics with tica and the kernel trick. J. Chem. Theory Comput. 11, 600–608 (2015)

    Google Scholar 

  • Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine learning algorithms. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, pp. 2951–2959 (2012)

  • Song, L., Fukumizu, K., Gretton, A.: Kernel embeddings of conditional distributions: a unified kernel framework for nonparametric inference in graphical models. IEEE Signal Process. Mag. 30(4), 98–111 (2013)

    Google Scholar 

  • Sparrow, C.: The Lorenz Equations: Bifurcations, Chaos, and Strange Attractors. Springer, New York (1982)

    MATH  Google Scholar 

  • Takeishi, N., Kawahara, Y., Yairi, T.: Learning Koopman invariant subspaces for dynamic mode decomposition. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, pp. 1130–1140 (2017)

  • Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 58, 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  • Tu, J.H., Rowley, C.W., Luchtenburg, D.M., Brunton, S.L., Kutz, J.N.: On dynamic mode decomposition: theory and applications. J. Comput. Dyn. 1(2), 391–421 (2014)

    MathSciNet  MATH  Google Scholar 

  • Williams, M.O., Kevrekidis, I.G., Rowley, C.W.: A data-driven approximation of the Koopman operator: extending dynamic mode decomposition. J. Nonlinear Sci. 25, 1307–1346 (2015a)

    MathSciNet  MATH  Google Scholar 

  • Williams, M.O., Rowley, C.W., Kevrekidis, I.G.: A kernel-based method for data-driven Koopman spectral analysis. J. Comput. Dyn. 2(2), 247–265 (2015b)

    MathSciNet  MATH  Google Scholar 

  • Wu, H., Noé, F.: Gaussian Markov transition models of molecular kinetics. J. Chem. Phys. 142, 084104 (2015)

    Google Scholar 

  • Wu, H., Nüske, F., Paul, F., Klus, S., Koltai, P., Noé, F.: Variational Koopman models: slow collective variables and molecular kinetics from short off-equilibrium simulations. J. Chem. Phys. 146, 154104 (2017)

    Google Scholar 

  • Ziehe, A., Müller, K.-R.: TDSEP —an efficient algorithm for blind separation using time structure. In: ICANN 98, pp. 675–680. Springer (1998)

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Hao Wu or Frank Noé.

Additional information

Communicated by Dr. Paul Newton.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was funded by Deutsche Forschungsgemeinschaft (SFB 1114/A4) and European Research Commission (ERC StG 307494 “pcCell” ).

Appendices

Appendix

For convenience of notation, we denote by \(p_{\tau }({\mathbf {x}},{\mathbf {y}})={\mathbb {P}}({\mathbf {x}}_{t+\tau }={\mathbf {y}}|{\mathbf {x}}_{t}={\mathbf {x}})\) the transition density which satisfies

$$\begin{aligned} \int _{A}p_{\tau }({\mathbf {x}},{\mathbf {y}})\mathrm {d}{\mathbf {y}}={\mathbb {P}}({\mathbf {x}}_{t+\tau }\in A|{\mathbf {x}}_{t}={\mathbf {x}}) \end{aligned}$$
(45)

for every measurable set A, and define the matrix of scalar products:

$$\begin{aligned} \left\langle {\mathbf {a}},{\mathbf {b}}^{\top }\right\rangle _{\rho }= & {} \left[ \left\langle a_{i},b_{j}\right\rangle _{\rho }\right] \in {\mathbb {R}}^{m\times n} \end{aligned}$$
(46)
$$\begin{aligned} {\mathcal {K}}{\mathbf {g}}= & {} ({\mathcal {K}}g_{1},{\mathcal {K}}g_{2},\ldots )^{\top } \end{aligned}$$
(47)

for \({\mathbf {a}}=(a_{1},a_{2},\ldots ,a_{m})^{\top }\), \({\mathbf {b}}=(b_{1},b_{2},\ldots ,b_{n})^{\top }\) and \({\mathbf {g}}=(g_{1},g_{2},\ldots )^{\top }\). In addition, \({\mathcal {N}}(\cdot |c,\sigma ^{2})\) denotes the probability density function of the normal distribution with mean c and variance \(\sigma ^{2}\).

Analysis of Koopman Operators

1.1 Definition of Empirical Distributions

We first consider the case where the simulation data consist of S independent trajectories \(\{{\mathbf {x}}_{t}^{1}\}_{t=1}^{T},\ldots ,\{{\mathbf {x}}_{t}^{S}\}_{t=1}^{T}\) of length T and the initial state \(x_{0}^{s}{\mathop {\sim }\limits ^{\mathrm {iid}}}p_{0}\left( {\mathbf {x}}\right) \). In this case, \(\rho _{0}\) and \(\rho _{1}\) can be defined by

$$\begin{aligned} \rho _{0}=\frac{1}{T-\tau }\sum _{t=1}^{T-\tau }{\mathcal {P}}_{t}p_{0}, \quad \rho _{1}=\frac{1}{T-\tau }\sum _{t=1}^{T-\tau }{\mathcal {P}}_{t+\tau }p_{0}, \end{aligned}$$
(48)

and they satisfy

$$\begin{aligned} \rho _{1}={\mathcal {P}}_{\tau }\rho _{0}, \end{aligned}$$
(49)

where \({\mathcal {P}}_{t}\) denotes the Markov propagator defined in (63). We can then conclude that the estimates of \({\mathbf {C}}_{00},{\mathbf {C}}_{11},{\mathbf {C}}_{01}\) given by (1618) are unbiased and consistent as\(S\rightarrow \infty \).

In more general cases where trajectories \(\{{\mathbf {x}}_{t}^{1}\}_{t=1}^{T_{1}},\ldots ,\{{\mathbf {x}}_{t}^{S}\}_{t=1}^{T_{S}}\) are generated with different initial conditions and different lengths, the similar conclusions can be obtained by defining \(\rho _{0},\rho _{1}\) as the averages of marginal distributions of \(\{{\mathbf {x}}_{t}^{s}|1\le t\le T_{s}-\tau ,1\le s\le S\}\) and \(\{{\mathbf {x}}_{t}^{s}|1+\tau \le t\le T_{s},1\le s\le S\}\) respectively.

1.2 Proof of Theorem 1

Because \({\mathcal {K}}_{\tau }\) is a Hilbert–Schmidt operator from \({\mathcal {L}}_{\rho _{1}}^{2}\) to \({\mathcal {L}}_{\rho _{0}}^{2}\), there exists the following SVD of \({\mathcal {K}}_{\tau }\):

$$\begin{aligned} {\mathcal {K}}_{\tau }g=\sum _{i=1}^{\infty }\sigma _{i}\left\langle g,\phi _{i}\right\rangle _{\rho _{1}}\psi _{i}. \end{aligned}$$
(50)

Due to the orthonormality of right singular functions, the projection of any function \(g\in {\mathcal {L}}_{\rho _{1}}^{2}\) onto the space spanned by \(\{\phi _{1},\ldots ,\phi _{k}\}\) can be written as \(\sum _{i=1}^{k}\left\langle g,\phi _{i}\right\rangle _{\rho _{1}}\phi _{i}\). Then \(\hat{{\mathcal {K}}}_{\tau }\) defined by (5) is the approximate Koopman operator deduced from model (4), and it is the best rank k approximation to \({\mathcal {K}}_{\tau }\) in Hilbert–Schmidt norm according to the generalized Eckart–Young Theorem (see Theorem 4.4.7 in Hsing and Eubank (2015)).

Since the adjoint operator \({\mathcal {K}}_{\tau }^{*}\) of \({\mathcal {K}}_{\tau }\) satisfies

$$\begin{aligned} \left\langle f,{\mathcal {K}}_{\tau }^{*}\mathbb {1}\right\rangle _{\rho _{1}}= & {} \left\langle {\mathcal {K}}_{\tau }f,\mathbb {1}\right\rangle _{\rho _{0}}\\= & {} \int {\mathbb {E}}[f({\mathbf {x}}_{t+\tau })|{\mathbf {x}}_{t}={\mathbf {x}}]\rho _{0}({\mathbf {x}})\mathrm {d}{\mathbf {x}}\\= & {} \int {\mathbb {E}}[f({\mathbf {x}})]\rho _{1}({\mathbf {x}})\mathrm {d}{\mathbf {x}}\\= & {} \left\langle f,\mathbb {1}\right\rangle _{\rho _{1}} \end{aligned}$$

for all f, we can obtain

$$\begin{aligned} {\mathcal {K}}_{\tau }^{*}\mathbb {1}={\mathcal {K}}_{\tau }\mathbb {1}=\mathbb {1}, \end{aligned}$$
(51)

and conclude from Proposition 2 in Froyland (2013) that \((\sigma _{1},\phi _{1},\psi _{1})=(1,\mathbb {1},\mathbb {1})\).

1.3 Transition Densities Deduced from Koopman Operators

The Koopman operator can also be written as

$$\begin{aligned} {\mathcal {K}}_{\tau }g({\mathbf {x}})=\int p_{\tau }({\mathbf {x}},{\mathbf {y}})g({\mathbf {y}})\mathrm {d}{\mathbf {y}} \end{aligned}$$
(52)

if the transition density is given, which implies that

$$\begin{aligned} {\mathcal {K}}_{\tau }\delta _{{\mathbf {y}}}({\mathbf {x}})=p_{\tau }({\mathbf {x}},{\mathbf {y}}). \end{aligned}$$
(53)

Then the transition density deduced from the approximate Koopman operator \(\hat{{\mathcal {K}}}_{\tau }\) defined by (5) is

$$\begin{aligned} {\hat{p}}_{\tau }({\mathbf {x}},{\mathbf {y}})= & {} \hat{{\mathcal {K}}}_{\tau }\delta _{{\mathbf {y}}}({\mathbf {x}})\nonumber \\= & {} \sum _{i=1}^{k}\sigma _{i}\psi _{i}({\mathbf {x}})\phi _{i}({\mathbf {y}})\rho _{1}({\mathbf {y}}). \end{aligned}$$
(54)

From (52), we can show that

$$\begin{aligned} \left\| {\mathcal {K}}_{\tau }\right\| _{\mathrm {HS}}^{2}= & {} \sum _{i}\left\langle {\mathcal {K}}_{\tau }\phi _{i},{\mathcal {K}}_{\tau }\phi _{i}\right\rangle _{\rho _{0}}\nonumber \\= & {} \int \sum _{i}\left( \int p({\mathbf {x}},{\mathbf {y}})\phi _{i}({\mathbf {y}})\mathrm {d}{\mathbf {y}}\right) ^{2}\rho _{0}({\mathbf {x}})\mathrm {d}{\mathbf {x}}\nonumber \\= & {} \int \sum _{i}\left( \int \frac{p({\mathbf {x}},{\mathbf {y}})}{\rho _{1}({\mathbf {y}})}\cdot \phi _{i}({\mathbf {y}})\cdot \rho _{1}({\mathbf {y}})\mathrm {d}{\mathbf {y}}\right) ^{2}\rho _{0}({\mathbf {x}})\mathrm {d}{\mathbf {x}}\nonumber \\= & {} \int \left( \int \left( \frac{p({\mathbf {x}},{\mathbf {y}})}{\rho _{1}({\mathbf {y}})}\right) ^{2}\cdot \rho _{1}({\mathbf {y}})\mathrm {d}{\mathbf {y}}\right) \rho _{0}({\mathbf {x}})\mathrm {d}{\mathbf {x}}\nonumber \\= & {} \iint \frac{\rho _{0}({\mathbf {x}})}{\rho _{1}({\mathbf {y}})}p({\mathbf {x}},{\mathbf {y}})^{2}\mathrm {d}{\mathbf {x}}\mathrm {d}{\mathbf {y}}, \end{aligned}$$
(55)

and

$$\begin{aligned} \left\| \hat{{\mathcal {K}}}_{\tau }-{\mathcal {K}}_{\tau }\right\| _{\mathrm {HS}}^{2}=\iint \frac{\rho _{0}({\mathbf {x}})}{\rho _{1}({\mathbf {y}})}\left( {\hat{p}}({\mathbf {x}},{\mathbf {y}})-p({\mathbf {x}},{\mathbf {y}})\right) ^{2}\mathrm {d}{\mathbf {x}}\mathrm {d}{\mathbf {y}}, \end{aligned}$$
(56)

i.e., the operator error between \(\hat{{\mathcal {K}}}_{\tau }\) and \({\mathcal {K}}_{\tau }\) can be represented by the error between \({\hat{p}}_{\tau }\) and \(p_{\tau }\).

It is worth pointing out that the approximate transition density in (54) satisfies the normalization constraint with

$$\begin{aligned} \int {\hat{p}}_{\tau }({\mathbf {x}},{\mathbf {y}})\mathrm {d}{\mathbf {y}}= & {} \sum _{i=1}^{k}\sigma _{i}\psi _{i}({\mathbf {x}})\left\langle \phi _{j},\mathbb {1}\right\rangle _{\rho _{1}}\nonumber \\= & {} \sigma _{1}\psi _{1}({\mathbf {x}})\nonumber \\\equiv & {} 1, \end{aligned}$$
(57)

but \({\hat{p}}_{\tau }({\mathbf {x}},{\mathbf {y}})\) is possibly negative for some \({\mathbf {x}},{\mathbf {y}}\). Thus, the approximate Koopman operators and transition densities are not guaranteed to yield valid probabilistic models, although they can still be utilized to quantitative analysis of Markov processes.

1.4 Sufficient Conditions for Theorem 1

We show here \({\mathcal {L}}_{\rho _{0}}^{2},{\mathcal {L}}_{\rho _{1}}^{2}\) are separable Hilbert spaces and \({\mathcal {K}}_{\tau }:{\mathcal {L}}_{\rho _{1}}^{2}\mapsto {\mathcal {L}}_{\rho _{0}}^{2}\) is Hilbert–Schmidt if one of the following conditions is satisfied:

Condition 1

The state space of the Markov process is a finite set.

Proof

The proof is trivial by considering \({\mathcal {K}}_{\tau }\) is a linear operator between finite-dimensional spaces, and thus omitted. \(\square \)

Condition 2

The state space of the Markov process is \({\mathbb {R}}^{d}\), \(\rho _{0}({\mathbf {x}}),\rho _{1}({\mathbf {y}})\) are positive for all \({\mathbf {x}},{\mathbf {y}}\in {\mathbb {R}}^{d}\), and there exists a constant M so that

$$\begin{aligned} p_{\tau }({\mathbf {x}},{\mathbf {y}})\le M\rho _{1}({\mathbf {y}}),\quad \forall {\mathbf {x}},{\mathbf {y}} \end{aligned}$$
(58)

Proof

Let \(\{e_{1},e_{2},\ldots \}\) be a orthonormal basis of \({\mathcal {L}}^{2}({\mathbb {R}}^{d})\). Then \({\mathcal {L}}_{\rho _{0}}^{2},{\mathcal {L}}_{\rho _{1}}^{2}\) are separable because they have the countable orthonormal bases \(\{\rho _{0}^{-\frac{1}{2}}e_{1},\rho _{0}^{-\frac{1}{2}}e_{2},\ldots \}\) and \(\{\rho _{1}^{-\frac{1}{2}}e_{1},\rho _{1}^{-\frac{1}{2}}e_{2},\ldots \}\).

Now we prove that \(\left\| {\mathcal {K}}_{\tau }\right\| _{\mathrm {HS}}<\infty \). Because

$$\begin{aligned} \iint \frac{\rho _{0}({\mathbf {x}})}{\rho _{1}({\mathbf {y}})}p_{\tau }({\mathbf {x}},{\mathbf {y}})^{2}\mathrm {d}{\mathbf {x}}\mathrm {d}{\mathbf {y}}\le & {} \iint M\rho _{0}({\mathbf {x}})p_{\tau }({\mathbf {x}},{\mathbf {y}})\mathrm {d}{\mathbf {x}}\mathrm {d}{\mathbf {y}}\nonumber \\= & {} M, \end{aligned}$$
(59)

the operator \({\mathcal {S}}\) defined by

$$\begin{aligned} {\mathcal {S}}f({\mathbf {x}})=\int \sqrt{\frac{\rho _{0}({ \mathbf {x}})}{\rho _{1}({\mathbf {y}})}}p_{\tau }({\mathbf {x}},{ \mathbf {y}})f({\mathbf {y}})\mathrm {d}{\mathbf {y}} \end{aligned}$$
(60)

is a Hilbert–Schmidt integral operator from \({\mathcal {L}}^{2}({\mathbb {R}}^{d})\) to \({\mathcal {L}}^{2}({\mathbb {R}}^{d})\) with \(\left\| {\mathcal {S}}\right\| _{\mathrm {HS}}^{2}\le M\) (Renardy and Rogers 2004). Therefore,

$$\begin{aligned} \left\| {\mathcal {K}}_{\tau }\right\| _{\mathrm {HS}}^{2}= & {} \sum _{i}\left\langle {\mathcal {K}}_{\tau }\rho _{1}^{-\frac{1}{2}}e_{i},{\mathcal {K}}_{\tau }\rho _{1}^{-\frac{1}{2}}e_{i}\right\rangle _{\rho _{0}}\nonumber \\= & {} \sum _{i}\left\langle {\mathcal {S}}e_{i},{\mathcal {S}}e_{i}\right\rangle \nonumber \\= & {} \left\| {\mathcal {S}}\right\| _{\mathrm {HS}}^{2}\le M, \end{aligned}$$
(61)

where \(\left\langle f,g\right\rangle =\int f({\mathbf {x}})g({\mathbf {x}})\mathrm {d}{\mathbf {x}}\). \(\square \)

1.5 Koopman Operators of Deterministic Systems

For the completeness of paper, we prove here the following proposition by contradiction: The Koopman operator \({\mathcal {K}}_{\tau }\) of the deterministic system \({\mathbf {x}}_{t+\tau }=F({\mathbf {x}}_{t})\)defined by

$$\begin{aligned} {\mathcal {K}}_{\tau }g({\mathbf {x}})=g(F({\mathbf {x}})) \end{aligned}$$
(62)

is not a compact operator from \({\mathcal {L}}_{\rho _{1}}^{2}\) to \({\mathcal {L}}_{\rho _{0}}^{2}\) if \({\mathcal {L}}_{\rho _{1}}^{2}\) is infinite-dimensional.

Assume that \({\mathcal {K}}_{\tau }\) is compact. Then, the SVD (50) of \({\mathcal {K}}_{\tau }\) exists with \(\sigma _{i}\rightarrow 0\) as \(i\rightarrow \infty \), and there is j so that \(0\le \sigma _{j}<1\). This implies \(\left\langle {\mathcal {K}}_{\tau }\psi _{j},{\mathcal {K}}_{\tau }\psi _{j}\right\rangle _{\rho _{0}}=\sigma _{j}^{2}<1\). However, according to the definition of the Koopman operator, \(\left\langle {\mathcal {K}}_{\tau }\psi _{j},{\mathcal {K}}_{\tau }\psi _{j}\right\rangle _{\rho _{0}}=\left\langle \psi _{j},\psi _{j}\right\rangle _{\rho _{1}}=1\), which leads to a contradiction. We can conclude that \({\mathcal {K}}_{\tau }\) is not compact and hence not Hilbert–Schmidt.

Markov Propagators

The Markov propagator \({\mathcal {P}}_{\tau }\) is defined by

$$\begin{aligned} p_{t+\tau }\left( {\mathbf {x}}\right)= & {} {\mathcal {P}}_{\tau }p_{t}\left( {\mathbf {x}}\right) \nonumber \\\triangleq & {} \int p_{\tau }\left( {\mathbf {y}},{\mathbf {x}}\right) p_{t}\left( {\mathbf {y}}\right) \mathrm {d}{\mathbf {y}}, \end{aligned}$$
(63)

with \(p_{t}\left( {\mathbf {x}}\right) ={\mathbb {P}}({\mathbf {x}}_{t}={\mathbf {x}})\) being the probability density of \({\mathbf {x}}_{t}\). According to the SVD of the Koopman operator given in (50), we have

$$\begin{aligned} p_{\tau }\left( {\mathbf {x}},{\mathbf {y}}\right) ={\mathcal {K}}_{\tau }\delta _{{\mathbf {y}}}\left( {\mathbf {x}}\right) =\sum _{i=1}^{\infty }\sigma _{i}\psi _{i}\left( {\mathbf {x}}\right) \phi _{i}\left( {\mathbf {y}}\right) \rho _{1}\left( {\mathbf {y}}\right) . \end{aligned}$$
(64)

Then

$$\begin{aligned} {\mathcal {P}}_{\tau }p_{t}\left( {\mathbf {x}}\right)= & {} \int p_{\tau }\left( {\mathbf {y}},{\mathbf {x}}\right) p_{t}\left( {\mathbf {y}}\right) \mathrm {d}{\mathbf {y}}\nonumber \\= & {} \sum _{i=1}^{\infty }\sigma _{i}\left\langle p_{t},\rho _{0}\psi _{i}\right\rangle _{\rho _{0}^{-1}}\rho _{1}\left( {\mathbf {x}}\right) \phi _{i}\left( {\mathbf {x}}\right) . \end{aligned}$$
(65)

Where the following normalizations were used:

$$\begin{aligned} \left\langle \rho _{0}\psi _{i},\rho _{0}\psi _{j}\right\rangle _{\rho _{0}^{-1}}= & {} \left\langle \psi _{i},\psi _{j}\right\rangle _{\rho _{0}}=1_{i=j}\end{aligned}$$
(66)
$$\begin{aligned} \left\langle \rho _{1}\phi _{i},\rho _{1}\phi _{j}\right\rangle _{\rho _{1}^{-1}}= & {} \left\langle \phi _{i},\phi _{j}\right\rangle _{\rho _{1}}=1_{i=j}, \end{aligned}$$
(67)

The SVD of \({\mathcal {P}}_{\tau }\) can be written as

$$\begin{aligned} {\mathcal {P}}_{\tau }p_{t}=\sum _{i=1}^{\infty }\sigma _{i}\left\langle p_{t},\rho _{0}\psi _{i}\right\rangle _{\rho _{0}^{-1}}\rho _{1}\phi _{i}. \end{aligned}$$
(68)

Proof of the Variational Principle

Notice that \({\mathbf {f}}\) and \({\mathbf {g}}\) can be expressed as

$$\begin{aligned} {\mathbf {f}}={\mathbf {D}}_{0}^{\top } \varvec{\psi },\quad {\mathbf {g}}={\mathbf {D}}_{1}^{\top }\varvec{\phi } \end{aligned}$$
(69)

where \(\varvec{\psi }=(\psi _{1},\psi _{2},\ldots )^{\top }\), \(\varvec{\phi }=(\phi _{1},\phi _{2},\ldots )^{\top }\) and \({\mathbf {D}}_{0},{\mathbf {D}}_{1}\in {\mathbb {R}}^{\infty \times k}\).

Since

$$\begin{aligned} \left\langle {\mathbf {f}},{\mathbf {f}}^{\top }\right\rangle _{\rho _{0}}= & {} {\mathbf {D}}_{0}^{\top }{\mathbf {D}}_{0}\end{aligned}$$
(70)
$$\begin{aligned} \left\langle {\mathbf {g}},{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}= & {} {\mathbf {D}}_{1}^{\top }{\mathbf {D}}_{1} \end{aligned}$$
(71)

and

$$\begin{aligned} \left\langle {\mathbf {f}},{\mathcal {K}}_{\tau }{\mathbf {g}}^{\top }\right\rangle _{\rho _{0}}= & {} {\mathbf {D}}_{0}^{\top }\left\langle \varvec{\psi },{\mathcal {K}}_{\tau }\varvec{\phi }^{\top }\right\rangle _{\rho _{0}}{\mathbf {D}}_{1}\\= & {} {\mathbf {D}}_{0}^{\top }\left\langle \varvec{\psi },\varvec{\psi }^{\top }\right\rangle _{\rho _{0}}\varvec{\varSigma }{\mathbf {D}}_{1}\\= & {} {\mathbf {D}}_{0}^{\top }\varvec{\varSigma }{\mathbf {D}}_{1}, \end{aligned}$$

the optimization problem can be equivalently written as

$$\begin{aligned} \max _{{\mathbf {D}}_{0}^{\top }{\mathbf {D}}_{0}={\mathbf {I}},{\mathbf {D}}_{1}^{\top }{\mathbf {D}}_{1}={\mathbf {I}}}\sum _{i=1}^{k}\left( \sigma _{i}{\mathbf {d}}_{0,i}^{\top }{\mathbf {d}}_{1,i}\right) ^{r}, \end{aligned}$$
(72)

where \(\varvec{\varSigma }=\mathrm {diag}(\sigma _{1},\sigma _{2},\ldots )\).According to the Cauchy–Schwarz inequality and the conclusion in Section I.3.C of Marshall et al. (1979), we have

$$\begin{aligned} \sum _{i=1}^{k}\left| \sigma _{i}{\mathbf {d}}_{0,i}^{\top }{\mathbf {d}}_{1,i}\right| \le \sum _{i=1}^{k}\sigma _{i} \end{aligned}$$
(73)

and

$$\begin{aligned} \sum _{i=1}^{k}\left( \sigma _{i}{\mathbf {d}}_{0,i}^{\top }{\mathbf {d}}_{1,i}\right) ^{r}\le \sum _{i=1}^{k}\left| \sigma _{i}{\mathbf {d}}_{0,i}^{\top }{\mathbf {d}}_{1,i}\right| ^{r}\le \sum _{i=1}^{k}\sigma _{i}^{r} \end{aligned}$$
(74)

under the constraint \({\mathbf {D}}_{0}^{\top }{\mathbf {D}}_{0}={\mathbf {I}},{\mathbf {D}}_{1}^{\top }{\mathbf {D}}_{1}={\mathbf {I}}\). The variational principle can then be proven by considering

$$\begin{aligned} \sum _{i=1}^{k}\left( \sigma _{i}{\mathbf {d}}_{0,i}^{\top }{\mathbf {d}}_{1,i}\right) ^{r}=\sum _{i=1}^{k}\sigma _{i}^{r} \end{aligned}$$
(75)

when the first k rows of \({\mathbf {D}}_{0}\) and \({\mathbf {D}}_{1}\) are identity matrix.

Variational Principle of Reversible Markov Processes

The variational principle of reversible Markov processes can be summarized as follows: If the Markov process \(\{{\mathbf {x}}_{t}\}\) is time-reversible with respect to stationary distribution \(\mu \) and all eigenvalues of \({\mathcal {K}}_{\tau }\) is nonnegative, then

$$\begin{aligned} \sum _{i=1}^{k}\lambda _{i}^{r}&=\max \sum _{i=1}^{k}\left\langle f_{i},{\mathcal {K}}_{\tau }f_{i}\right\rangle _{\mu }^{r}\nonumber \\ s.t.&\left\langle f_{i},f_{j}\right\rangle _{\mu }=1_{i=j} \end{aligned}$$
(76)

for \(r\ge 1\) and the maximal value is achieved with \(f_{i}=\psi _{i}\), where \(\psi _{i}\) denotes the eigenfunction with the ith largest eigenvalue \(\lambda _{i}\). The proof is trivial by using variational principle of general Markov processes and considering that the eigendecomposition of \({\mathcal {K}}_{\tau }\) is equivalent to its SVD if \(\{{\mathbf {x}}_{t}\}\) is time-reversible and \(\rho _{0}=\rho _{1}=\mu \).

Analysis of Estimation Algorithms

1.1 Correctness of Feature TCCA

We show in this appendix that the feature TCCA algorithm described in Sect. 3.1 solves the optimization problem (19).

Let \({\mathbf {U}}^{\prime }={\mathbf {C}}_{00}^{\frac{1}{2}}{\mathbf {U}}=({\mathbf {u}}_{1}^{\prime },\ldots ,{\mathbf {u}}_{k}^{\prime })\) and \({\mathbf {V}}^{\prime }={\mathbf {C}}_{00}^{\frac{1}{2}}{\mathbf {V}}=({\mathbf {v}}_{1}^{\prime },\ldots ,{\mathbf {v}}_{k}^{\prime })\), (19) can be equivalently expressed as

$$\begin{aligned} \max _{{\mathbf {U}}^{\prime },{\mathbf {V}}^{\prime }}&\sum _{i=1}^{k}\left( {\mathbf {u}}_{i}^{\prime \top }{\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}{\mathbf {v}}_{i}^{\prime }\right) ^{r}\nonumber \\ \mathrm {s.t.}&{\mathbf {U}}^{\prime \top }{\mathbf {U}}^{\prime }={\mathbf {I}}\nonumber \\&{\mathbf {V}}^{\prime \top }{\mathbf {V}}^{\prime }={\mathbf {I}}. \end{aligned}$$
(77)

According to the Cauchy–Schwarz inequality and the conclusion in Section I.3.C of Marshall et al. (1979), we have

$$\begin{aligned} \sum _{i=1}^{k}\left( {\mathbf {u}}_{i}^{\prime \top }{\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}{\mathbf {v}}_{i}^{\prime }\right) ^{r}\le & {} \sum _{i=1}^{k}\left| {\mathbf {u}}_{i}^{\prime \top }{\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}{\mathbf {v}}_{i}^{\prime }\right| ^{r}\nonumber \\\le & {} \sum _{i=1}^{k}s_{i}^{r} \end{aligned}$$
(78)

under the constraints \({\mathbf {U}}^{\prime \top }{\mathbf {U}}^{\prime }={\mathbf {I}},{\mathbf {V}}^{\prime \top }{\mathbf {V}}^{\prime }={\mathbf {I}}\), where \(s_{i}\) is the ith largest singular value of \({\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\). Considering the equalities hold in the above when \({\mathbf {U}}^{\prime },{\mathbf {V}}^{\prime }\) are the first k left and right singular vectors of \({\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\), we get

$$\begin{aligned} \max _{{\mathbf {U}},{\mathbf {V}}}&{\mathcal {R}}_{r}({\mathbf {U}},{\mathbf {V}})=\sum _{i=1}^{k}s_{i}^{r}\nonumber \\ \mathrm {s.t.}&{\mathbf {U}}^{\top }{\mathbf {C}}_{00}{\mathbf {U}}={\mathbf {I}}\nonumber \\&{\mathbf {V}}^{\top }{\mathbf {C}}_{11}{\mathbf {V}}={\mathbf {I}}, \end{aligned}$$
(79)

and the correctness of the feature TCCA can then be proved.

Furthermore, if \(k=\min \{\mathrm {dim}\left( \varvec{\chi }_{0}\right) ,\mathrm {dim}\left( \varvec{\chi }_{1}\right) \}\), we can get

$$\begin{aligned} \max _{{\mathbf {U}},{\mathbf {V}}}{\mathcal {R}}_{r}({\mathbf {U}},{\mathbf {V}})=\left\| {\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\right\| _{r}^{r} \end{aligned}$$
(80)

under the orthonormality constraints.

1.2 Feature TCCA of Projected Koopman Operators

Define projection operators

$$\begin{aligned} {\mathcal {Q}}_{\varvec{\chi }_{0}}f\triangleq & {} \mathop {\mathrm{arg\,min}}\limits _{f^{\prime }\in \mathrm {span}\{\chi _{0,1},\chi _{0,2},\ldots \}}\left\langle f^{\prime }-f,f^{\prime }-f\right\rangle _{\rho _{0}}\nonumber \\= & {} \left\langle f,\varvec{\chi }_{0}^{\top }\right\rangle _{\rho _{0}}{\mathbf {C}}_{00}^{-1}\varvec{\chi }_{0},\end{aligned}$$
(81)
$$\begin{aligned} {\mathcal {Q}}_{\varvec{\chi }_{1}}g\triangleq & {} \mathop {\mathrm{arg\,min}}\limits _{g^{\prime }\in \mathrm {span}\{\chi _{1,1},\chi _{1,2},\ldots \}}\left\langle g^{\prime }-g,g^{\prime }-g\right\rangle _{\rho _{1}}\nonumber \\= & {} \left\langle g,\varvec{\chi }_{1}^{\top }\right\rangle _{\rho _{1}}{\mathbf {C}}_{11}^{-1}\varvec{\chi }_{1}, \end{aligned}$$
(82)

and let \({\mathcal {K}}_{\tau }^{\mathrm {proj}}={\mathcal {Q}}_{\varvec{\chi }_{0}}{\mathcal {K}}_{\tau }{\mathcal {Q}}_{\varvec{\chi }_{1}}\) be the projection of the Koopman operator \({\mathcal {K}}_{\tau }\) onto the subspaces of \(\varvec{\chi }_{0},\varvec{\chi }_{1}\). Then for any \(f={\mathbf {u}}^{\top }\varvec{\chi }_{0}\in \mathrm {span}\{\chi _{0,1},\chi _{0,2},\ldots \}\) and \(g={\mathbf {v}}^{\top }\varvec{\chi }_{1}\in \mathrm {span}\{\chi _{1,1},\chi _{1,2},\ldots \}\),

$$\begin{aligned} \left\langle f,{\mathcal {K}}_{\tau }^{\mathrm {proj}}g\right\rangle _{\rho _{0}}= & {} \left\langle g,\varvec{\chi }_{1}^{\top }\right\rangle _{\rho _{1}}{\mathbf {C}}_{11}^{-1}{\mathbf {C}}_{01}^{\top }{\mathbf {C}}_{00}^{-1}\left\langle \varvec{\chi }_{0},f\right\rangle _{\rho _{0}}\nonumber \\= & {} {\mathbf {u}}^{\top }{\mathbf {C}}_{01}{\mathbf {v}}\nonumber \\= & {} \left\langle f,{\mathcal {K}}_{\tau }g\right\rangle _{\rho _{0}}, \end{aligned}$$
(83)

which implies that Eq. (19) can also be interpreted as the variational problem for the feature TCCA of \({\mathcal {K}}_{\tau }^{\mathrm {proj}}\).

By ignoring the statistical noise, we can conclude from Theorem 2 that the \(\{(s_{i},f_{i},g_{i})\}\) provided by the feature TCCA are exactly the singular components of \({\mathcal {K}}_{\tau }^{\mathrm {proj}}\), and the optimality of the estimation result is therefore invariant for any choice of \(r\ge 1\). In addition, the sum over the r’th power of all singular values of \({\mathcal {K}}_{\tau }^{\mathrm {proj}}\) is

$$\begin{aligned} \sum _{i}s_{i}^{r}=\left\| {\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\right\| _{r}^{r}. \end{aligned}$$
(84)

1.3 An Example of Nonlinear TCCA

Consider a stochastic system

$$\begin{aligned} x_{t+1}=\frac{1}{2}x_{t}+u_{t}, \end{aligned}$$
(85)

where \(u_{t}\) is Gaussian white noise with mean zero and variance 1. By setting

$$\begin{aligned} \rho _{0}(x)=\rho _{1}(x)={\mathcal {N}}\left( x|0,\frac{4}{3}\right) \end{aligned}$$
(86)

to be the stationary distribution and basis functions

$$\begin{aligned} \varvec{\chi }_{0}(x)=\varvec{\chi }_{1}(x)=\left( 1,\exp (-wx^{2})-\sqrt{\frac{3}{8w+3}},x\exp (-(1-w^{\frac{1}{10}})x^{2})\right) ^{\top } \end{aligned}$$
(87)

with parameter \(w\in [0.01,1]\), we can obtain

$$\begin{aligned} {\mathbf {C}}_{00}={\mathbf {C}}_{11}= & {} \mathrm {diag}\left( 1,\left( \frac{16}{3}w+1\right) ^{-\frac{1}{2}}-\frac{3}{8w+3},4\sqrt{3}\left( -16w^{\frac{1}{10}}+19\right) ^{-\frac{3}{2}}\right) ,\nonumber \\ {\mathbf {C}}_{01}= & {} \mathrm {diag}\Bigg (1,\left( \frac{16}{3}w^{2}+\frac{16}{3}w+1\right) ^{-\frac{1}{2}}-\frac{3}{8w+3},\nonumber \\&\quad \quad \quad 2\sqrt{3}\left( 16(1-w^{\frac{1}{10}})^{2}-16w^{\frac{1}{10}}+19\right) ^{-\frac{3}{2}}\Bigg ) \end{aligned}$$
(88)

The maximal VAMP-r score for a given w can then be analytically computed by

$$\begin{aligned} {\mathcal {R}}_{r}(w)=\mathrm {tr}\left[ \left( {\mathbf {C}}_{00}(w)^{-1}{\mathbf {C}}_{01}(w)\right) ^{r}\right] \end{aligned}$$
(89)

according to (24). We evaluate \({\mathcal {R}}_{r}(w)\) at 9901 equally spaced points of w in the interval [0.01, 1] for \(r=1,2\), and the maximal values of \({\mathcal {R}}_{1},{\mathcal {R}}_{2}\) are achieved at \(w=0.3157\) and \(w=0.7069\) respectively.

Implementation of Estimation Algorithms

1.1 Decorrelation of Basis Functions

For convenience of notation, here we define

$$\begin{aligned} {\mathbf {X}}= & {} \left( \varvec{\chi }_{0}({\mathbf {x}}_{1}),\ldots ,\varvec{\chi }_{0}({\mathbf {x}}_{T-\tau })\right) ^{\top } \end{aligned}$$
(90)
$$\begin{aligned} {\mathbf {Y}}= & {} \left( \varvec{\chi }_{1}({\mathbf {x}}_{1+\tau }),\ldots ,\varvec{\chi }_{0}({\mathbf {x}}_{T})\right) ^{\top }. \end{aligned}$$
(91)

In this paper, we utilize principal component analysis (PCA) to explicitly reduce correlations between basis functions as follows: First, we compute the empirical means of basis functions and the covariance matrices of mean-centered basis functions:

$$\begin{aligned} \varvec{\pi }_{0}= & {} \frac{1}{T-\tau }{\mathbf {X}}^{\top }{\mathbf {1}} \end{aligned}$$
(92)
$$\begin{aligned} \varvec{\pi }_{1}= & {} \frac{1}{T-\tau }{\mathbf {Y}}^{\top }{\mathbf {1}} \end{aligned}$$
(93)
$$\begin{aligned} \mathrm {COV}_{0}= & {} \frac{1}{T-\tau }{\mathbf {X}}^{\top }{\mathbf {X}}-\varvec{\pi }_{0}\varvec{\pi }_{0}^{\top } \end{aligned}$$
(94)
$$\begin{aligned} \mathrm {COV}_{1}= & {} \frac{1}{T-\tau }{\mathbf {Y}}^{\top }{\mathbf {Y}}-\varvec{\pi }_{1}\varvec{\pi }_{1}^{\top }. \end{aligned}$$
(95)

Next, perform the truncated eigen decomposition of the covariance matrices as

$$\begin{aligned} \mathrm {COV}_{0}\approx & {} {\mathbf {Q}}_{0,d}^{\top }{\mathbf {S}}_{0,d}{\mathbf {Q}}_{0,d} \end{aligned}$$
(96)
$$\begin{aligned} \mathrm {COV}_{1}\approx & {} {\mathbf {Q}}_{1,d}^{\top }{\mathbf {S}}_{1,d}{\mathbf {Q}}_{1,d}, \end{aligned}$$
(97)

where the diagonal of matrices \({\mathbf {S}}_{0,d},{\mathbf {S}}_{1,d}\) contain all positive eigenvalues that are larger than \(\epsilon _{0}\) and absolute values of all negative eigenvalues (\(\epsilon _{0}=10^{-10}\) in our applications). Last, the new basis functions are given by

$$\begin{aligned} \varvec{\chi }_{0}^{\mathrm {new}}=\left[ \begin{array}{c} {\mathbf {Q}}_{0,d}^{\top }{\mathbf {S}}_{0,d}^{\frac{1}{2}}\left( \varvec{\chi }_{0}-\varvec{\pi }_{0}\right) \\ \mathbb {1} \end{array}\right] ,\quad \varvec{\chi }_{1}^{\mathrm {new}}=\left[ \begin{array}{c} {\mathbf {Q}}_{1,d}^{\top }{\mathbf {S}}_{1,d}^{\frac{1}{2}}\left( \varvec{\chi }_{1}-\varvec{\pi }_{1}\right) \\ \mathbb {1} \end{array}\right] \end{aligned}$$
(98)

We denote the transformation (98) by

$$\begin{aligned} \varvec{\chi }_{0}^{\mathrm {new}},\varvec{\chi }_{1}^{\mathrm {new}}=\mathrm {DC}\left[ \varvec{\chi }_{0},\varvec{\chi }_{1}|\varvec{\pi }_{0},\varvec{\pi }_{1},\mathrm {COV}_{0},\mathrm {COV}_{1}\right] \end{aligned}$$
(99)

Then the feature TCCA algorithm with decorrelation of basis functions can be summarized as:

  1. 1.

    Compute \(\varvec{\pi }_{0},\varvec{\pi }_{1}\) and \(\mathrm {COV}_{0},\mathrm {COV}_{1}\) by (9295).

  2. 2.

    Let \(\varvec{\chi }_{0},\varvec{\chi }_{1}:=\mathrm {DC}\left[ \varvec{\chi }_{0},\varvec{\chi }_{1}|\varvec{\pi }_{0},\varvec{\pi }_{1},\mathrm {COV}_{0},\mathrm {COV}_{1}\right] \), and recalculate \({\mathbf {X}}\) and \({\mathbf {Y}}\) according to the new basis functions.

  3. 3.

    Compute covariance matrices \({\mathbf {C}}_{00},{\mathbf {C}}_{01},{\mathbf {C}}_{11}\) by

    $$\begin{aligned} {\mathbf {C}}_{00}= & {} \frac{1}{T-\tau }{\mathbf {X}}^{\top }{\mathbf {X}}\\ {\mathbf {C}}_{01}= & {} \frac{1}{T-\tau }{\mathbf {X}}^{\top }{\mathbf {Y}}\\ {\mathbf {C}}_{11}= & {} \frac{1}{T-\tau }{\mathbf {Y}}^{\top }{\mathbf {Y}} \end{aligned}$$
  4. 4.

    Perform the truncated SVD \({\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}={\mathbf {U}}_{k}^{\prime }\hat{\varvec{\varSigma }}_{k}{\mathbf {V}}_{k}^{\prime \top }\).

  5. 5.

    Output estimated singular components \(\hat{\varvec{\varSigma }}_{k}=\mathrm {diag}({\hat{\sigma }}_{1},\ldots ,{\hat{\sigma }}_{k})\), \({\mathbf {U}}_{k}^{\top }\varvec{\chi }_{0}=({\hat{\psi }}_{1},\ldots ,{\hat{\psi }}_{k})^{\top }\) and \({\mathbf {V}}_{k}^{\top }\varvec{\chi }_{1}=({\hat{\phi }}_{1},\ldots ,{\hat{\phi }}_{k})^{\top }\) with \({\mathbf {U}}_{k}={\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {U}}_{k}^{\prime }\) and \({\mathbf {V}}_{k}={\mathbf {C}}_{11}^{-\frac{1}{2}}{\mathbf {V}}_{k}^{\prime }\).

Notice that the estimated \({\mathbf {C}}_{00}\), \({\mathbf {C}}_{01}\) and \({\mathbf {C}}_{11}\) in the above algorithm satisfy

$$\begin{aligned} \left[ \begin{array}{cc} {\mathbf {C}}_{00} &{} {\mathbf {C}}_{01}\\ {\mathbf {C}}_{01}^{\top } &{} {\mathbf {C}}_{11} \end{array}\right]= & {} \frac{1}{T-\tau }\left[ \begin{array}{cc} {\mathbf {X}}^{\top }{\mathbf {X}} &{} {\mathbf {X}}^{\top }{\mathbf {Y}}\\ {\mathbf {Y}}^{\top }{\mathbf {X}} &{} {\mathbf {Y}}^{\top }{\mathbf {Y}} \end{array}\right] \nonumber \\= & {} \frac{1}{T-\tau }\left( {\mathbf {X}},{\mathbf {Y}}\right) ^{\top }\left( {\mathbf {X}},{\mathbf {Y}}\right) \nonumber \\\succeq & {} 0 \end{aligned}$$
(100)

where \({\mathbf {C}}\succeq 0\) means \({\mathbf {C}}\) is a positive semi-definite matrix. According to the Schur complement lemma, we have

$$\begin{aligned}&{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-1}{\mathbf {C}}_{01}^{\top } \preceq {\mathbf {C}}_{00}\nonumber \\&\quad \Rightarrow \left( {\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\right) \left( {\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\right) ^{\top } \preceq {\mathbf {I}} \end{aligned}$$
(101)

where \({\mathbf {I}}\) denotes an identity matrix of appropriate size. So the estimated \(\sigma _{1}\le 1\).

Furthermore, since \({\mathbf {v}}_{0}^{\top }\varvec{\chi }_{0}={\mathbf {v}}_{1}^{\top }\varvec{\chi }_{1}=\mathbb {1}\) for \({\mathbf {v}}_{0}=(0,\ldots ,0,1)^{\top }\) and \({\mathbf {v}}_{1}=(0,\ldots ,0,1)^{\top }\),

$$\begin{aligned} \left( {\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\right) \left( {\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\right) ^{\top }{\mathbf {C}}_{00}^{\frac{1}{2}}{\mathbf {v}}_{0}= & {} {\mathbf {C}}_{00}^{\frac{1}{2}}\left( {\mathbf {X}}^{\top }{\mathbf {X}}\right) ^{-1}{\mathbf {X}}^{\top }{\mathbf {Y}}\left( {\mathbf {Y}}^{\top }{\mathbf {Y}}\right) ^{-1}{\mathbf {Y}}^{\top }{\mathbf {X}}{\mathbf {v}}_{0}\nonumber \\= & {} {\mathbf {C}}_{00}^{\frac{1}{2}}{\mathbf {X}}^{+}{\mathbf {Y}}{\mathbf {Y}}^{+}{\mathbf {1}}\nonumber \\= & {} {\mathbf {C}}_{00}^{\frac{1}{2}}{\mathbf {v}}_{0} \end{aligned}$$
(102)

which implies that 1 is the largest singular value of \({\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\).

1.2 Parameter Optimization in Nonlinear TCCA

The optimization problem

$$\begin{aligned} \max _{{\mathbf {w}}}{\mathcal {R}}_{r}({\mathbf {w}})=\left\| {\mathbf {C}}_{00}\left( {\mathbf {w}}\right) ^{-\frac{1}{2}}{\mathbf {C}}_{01}\left( {\mathbf {w}}\right) {\mathbf {C}}_{11}\left( {\mathbf {w}}\right) ^{-\frac{1}{2}}\right\| _{r}^{r} \end{aligned}$$
(103)

can be solved by direct search as in our examples (see Appendix K.1). But for a high-dimensional parameter vector \({\mathbf {w}}\), it is more efficient to perform the optimization by the gradient descent method in the form of

$$\begin{aligned} {\mathbf {w}}\leftarrow {\mathbf {w}}+\eta \frac{\partial {\mathcal {R}}_{r}({\mathbf {w}})}{\partial {\mathbf {w}}}, \end{aligned}$$
(104)

where \(\eta \) is the step size. When \(r=2\), the gradient of \({\mathcal {R}}_{r}\) with respect to an element \(w_{i}\) in \({\mathbf {w}}\) can be written as

$$\begin{aligned} \frac{\partial {\mathcal {R}}_{r}}{\partial w_{i}}= & {} \frac{2}{T-\tau }\mathrm {tr}\left[ {\mathbf {C}}_{00}^{-1}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-1}\left( {\mathbf {Y}}-{\mathbf {C}}_{01}^{\top }{\mathbf {C}}_{00}^{-1}{\mathbf {X}}\right) \left( \frac{\partial {\mathbf {X}}}{\partial w_{i}}\right) ^{\top }\right] \nonumber \\&+\frac{2}{T-\tau }\mathrm {tr}\left[ {\mathbf {C}}_{11}^{-1}{\mathbf {C}}_{01}^{\top }{\mathbf {C}}_{00}^{-1}\left( {\mathbf {X}}-{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-1}{\mathbf {Y}}\right) \left( \frac{\partial {\mathbf {Y}}}{\partial w_{i}}\right) ^{\top }\right] , \end{aligned}$$
(105)

where \({\mathbf {X}},{\mathbf {Y}}\) have the same definitions as in Appendix F.1. If the data size is too large, we can approximate the gradient based on a random subset of data in each iteration, and update \({\mathbf {w}}\) in a stochastic gradient descent manner (Andrew et al. 2013; Mardt et al. 2018).

Like feature TCCA, the nonlinear TCCA also suffers from the numerical singularity when \({\mathbf {C}}_{00}\) or \({\mathbf {C}}_{11}\) is not full rank. This problem can be addressed by the decorrelation of basis functions when performing direct search. For the gradient descent method (or stochastic gradient descent method), we can replace the objective function \({\mathcal {R}}_{r}({\mathbf {w}})\) by a regularized one

$$\begin{aligned} {\mathcal {R}}_{r}({\mathbf {w}};\epsilon )=\left\| \left( {\mathbf {C}}_{00}\left( {\mathbf {w}}\right) +\epsilon { \mathbf {I}}\right) ^{-\frac{1}{2}}{\mathbf {C}}_{01}\left( {\mathbf {w}}\right) \left( {\mathbf {C}}_{11}\left( {\mathbf {w}}\right) + \epsilon {\mathbf {I}}\right) ^{-\frac{1}{2}}\right\| _{r}^{r}, \end{aligned}$$
(106)

where \(\epsilon >0\) is a hyper-parameter and can be selected by the cross-validation.

Relationship Between VAMP and EDMD

The proof of (21) is trivial. Here, we only show that the eigenvalue problem of \(\hat{{\mathcal {K}}}_{\tau }\) given by the feature TCCA is equivalent to that of matrix \({\mathbf {K}}_{\chi }\) as

$$\begin{aligned} \hat{{\mathcal {K}}}_{\tau }g=\lambda g\Longleftrightarrow {\mathbf {K}}_{\chi }{\mathbf {b}}=\lambda {\mathbf {b}}\text { with }g={\mathbf {b}}^{\top }\varvec{\chi } \end{aligned}$$
(107)

under the assumption that \(\varvec{\chi }_{0}=\varvec{\chi }_{1}=\varvec{\chi }\) and \({\mathbf {C}}_{00}\) is invertible, which is consistent with the spectral approximation theory in EDMD. First, if g and \(\lambda \) satisfy \({\mathcal {K}}_{\tau }g=\lambda g\), there must exist vector \({\mathbf {b}}\) so that \(g={\mathbf {b}}^{\top }\varvec{\chi }\). Then

$$\begin{aligned} \hat{{\mathcal {K}}}_{\tau }g= & {} \lambda g\nonumber \\ \Rightarrow {\mathbf {b}}^{\top }{\mathbf {K}}_{\chi }^{\top }\varvec{\chi }= & {} \lambda {\mathbf {b}}^{\top }\varvec{\chi }\nonumber \\ \Rightarrow {\mathbf {b}}^{\top }{\mathbf {K}}_{\chi }^{\top }{\mathbf {C}}_{00}= & {} \lambda {\mathbf {b}}^{\top }{\mathbf {C}}_{00}\nonumber \\ \Rightarrow {\mathbf {K}}_{\chi }{\mathbf {b}}= & {} \lambda {\mathbf {b}}. \end{aligned}$$
(108)

Second, if \({\mathbf {K}}_{\chi }{\mathbf {b}}=\lambda {\mathbf {b}}\),

$$\begin{aligned} \hat{{\mathcal {K}}}_{\tau }{\mathbf {b}}^{\top }\varvec{\chi }= & {} {\mathbf {b}}^{\top }{\mathbf {K}}_{\chi }^{\top }\varvec{\chi }\nonumber \\= & {} \lambda {\mathbf {b}}^{\top }\varvec{\chi }. \end{aligned}$$
(109)

Analysis of the VAMP-E Score

1.1 Proof of (28)

Here we define

$$\begin{aligned} {\mathbf {C}}_{ff}= & {} \left\langle {\mathbf {f}},{\mathbf {f}}^{\top }\right\rangle _{\rho _{0}}={\mathbf {U}}^{\top }{\mathbf {C}}_{00}{\mathbf {U}}, \end{aligned}$$
(110)
$$\begin{aligned} {\mathbf {C}}_{gg}= & {} \left\langle {\mathbf {g}},{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}={\mathbf {V}}^{\top }{\mathbf {C}}_{11}{\mathbf {V}}, \end{aligned}$$
(111)
$$\begin{aligned} {\mathbf {C}}_{fg}= & {} \left\langle {\mathbf {f}},{\mathcal {K}}_{\tau }{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}={\mathbf {U}}^{\top }{\mathbf {C}}_{01}{\mathbf {V}}. \end{aligned}$$
(112)

Considering \(\{\phi _{i}\}\) is an orthonormal basis of \({\mathcal {L}}_{\rho _{1}}^{2}\), we have

$$\begin{aligned} \left\| \hat{{\mathcal {K}}}_{\tau }\right\| _{\mathrm {HS}}^{2}= & {} \sum _{j}\left\langle \hat{{\mathcal {K}}}_{\tau }\phi _{j},\hat{{\mathcal {K}}}_{\tau }\phi _{j}\right\rangle _{\rho _{0}}\nonumber \\= & {} \sum _{j}\left\langle \left\langle \phi _{j},{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}{\mathbf {K}}{\mathbf {f}},{\mathbf {f}}^{\top }{\mathbf {K}}\left\langle {\mathbf {g}},\phi _{j}\right\rangle _{\rho _{1}}\right\rangle _{\rho _{0}}\nonumber \\= & {} \sum _{j}\left\langle \phi _{j},{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}{\mathbf {K}}\left\langle {\mathbf {f}},{\mathbf {f}}^{\top }\right\rangle _{\rho _{0}}{\mathbf {K}}\left\langle {\mathbf {g}},\phi _{j}\right\rangle _{\rho _{1}}\nonumber \\= & {} \mathrm {tr}\left[ {\mathbf {K}}\left\langle {\mathbf {f}},{\mathbf {f}}^{\top }\right\rangle _{\rho _{0}}{\mathbf {K}}\sum _{j}\left\langle {\mathbf {g}},\phi _{j}\right\rangle _{\rho _{1}}\left\langle \phi _{j},{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}\right] \nonumber \\= & {} \mathrm {tr}\left[ {\mathbf {K}}\left\langle {\mathbf {f}},{\mathbf {f}}^{\top }\right\rangle _{\rho _{0}}{\mathbf {K}}\left\langle \sum _{j}\left\langle {\mathbf {g}},\phi _{j}\right\rangle _{\rho _{1}}\phi _{j},{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}\right] \nonumber \\= & {} \mathrm {tr}\left[ {\mathbf {K}}\left\langle {\mathbf {f}},{\mathbf {f}}^{\top }\right\rangle _{\rho _{0}}{\mathbf {K}}\left\langle {\mathbf {g}},{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}\right] \nonumber \\= & {} \mathrm {tr}\left[ {\mathbf {K}}{\mathbf {C}}_{ff}{\mathbf {K}}{\mathbf {C}}_{gg}\right] \end{aligned}$$
(113)

and

$$\begin{aligned} \left\langle \hat{{\mathcal {K}}}_{\tau },{\mathcal {K}}_{\tau }\right\rangle _{\mathrm {HS}}= & {} \sum _{j}\left\langle \hat{{\mathcal {K}}}_{\tau }\phi _{j},{\mathcal {K}}_{\tau }\phi _{j}\right\rangle _{\rho _{0}}\nonumber \\= & {} \sum _{j}\left\langle \left\langle \phi _{j},{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}{\mathbf {S}}{\mathbf {f}},\sigma _{j}\psi _{j}\right\rangle _{\rho _{0}}\nonumber \\= & {} \sum _{j}\sigma _{j}\left\langle \phi _{j},{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}{\mathbf {S}}\left\langle {\mathbf {f}},\psi _{j}\right\rangle _{\rho _{0}}\nonumber \\= & {} \mathrm {tr}\left[ {\mathbf {K}}\sum _{j}\sigma _{j}\left\langle {\mathbf {f}},\psi _{j}\right\rangle _{\rho _{0}}\left\langle \phi _{j},{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}\right] \nonumber \\= & {} \mathrm {tr}\left[ {\mathbf {K}}\left\langle {\mathbf {f}},\sum _{j}\sigma _{j}\psi _{j}\left\langle \phi _{j},{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}\right\rangle _{\rho _{0}}\right] \nonumber \\= & {} \mathrm {tr}\left[ {\mathbf {K}}\left\langle {\mathbf {f}},{\mathcal {K}}_{\tau }{\mathbf {g}}^{\top }\right\rangle _{\rho _{0}}\right] \nonumber \\= & {} \mathrm {tr}\left[ {\mathbf {K}}{\mathbf {C}}_{fg}\right] , \end{aligned}$$
(114)

where \(\left\langle \cdot ,\cdot \right\rangle _{\mathrm {HS}}\) denotes the Hilbert–Schmidt inner product of operators. Then, according to the definition of Hilbert–Schmidt norm,

$$\begin{aligned} \left\| \hat{{\mathcal {K}}}_{\tau }-{\mathcal {K}}_{\tau }\right\| _{\mathrm {HS}}^{2}= & {} \left\| \hat{{\mathcal {K}}}_{\tau }\right\| _{\mathrm {HS}}^{2}-2\sum _{j}\left\langle \hat{{\mathcal {K}}}_{\tau },{\mathcal {K}}_{\tau }\right\rangle _{\mathrm {HS}}+\left\| {\mathcal {K}}_{\tau }\right\| _{\mathrm {HS}}^{2}\nonumber \\= & {} \mathrm {tr}\left[ {\mathbf {K}}{\mathbf {C}}_{ff}{\mathbf {K}}{\mathbf {C}}_{gg}-2{\mathbf {K}}{\mathbf {C}}_{fg}\right] +\left\| {\mathcal {K}}_{\tau }\right\| _{\mathrm {HS}}^{2} \end{aligned}$$
(115)

1.2 Relationship Between VAMP-2 and VAMP-E

We first show that the feature TCCA algorithm maximizes VAMP-E. Notice that

$$\begin{aligned} {\mathcal {R}}_{E}({\mathbf {K}},{\mathbf {U}},{\mathbf {V}})= & {} \mathrm {tr}\left[ 2\left( {\mathbf {C}}_{00}^{\frac{1}{2}}{\mathbf {U}}{\mathbf {K}}{\mathbf {V}}^{\top }{\mathbf {C}}_{11}^{\frac{1}{2}}\right) ^{\top }\left( {\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\right) \right. \nonumber \\&\left. -\left( {\mathbf {C}}_{00}^{\frac{1}{2}}{\mathbf {U}}{\mathbf {K}}{\mathbf {V}}^{\top }{\mathbf {C}}_{11}^{\frac{1}{2}}\right) ^{\top }\left( {\mathbf {C}}_{00}^{\frac{1}{2}}{\mathbf {U}}{\mathbf {K}}{\mathbf {V}}^{\top }{\mathbf {C}}_{11}^{\frac{1}{2}}\right) \right] \nonumber \\= & {} -\left\| {\mathbf {C}}_{00}^{\frac{1}{2}}{\mathbf {U}}{\mathbf {K}}{\mathbf {V}}^{\top }{\mathbf {C}}_{11}^{\frac{1}{2}}-{\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\right\| _{F}^{2}+\left\| {\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\right\| _{F}^{2}\nonumber \\= & {} -\left\| {\mathbf {U}}^{\prime }{\mathbf {K}}{\mathbf {V}}^{\prime \top }-{\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\right\| _{F}^{2}+\left\| {\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\right\| _{F}^{2}, \end{aligned}$$
(116)

where \(\left\| \cdot \right\| _{F}\) denotes the Frobenius norm and \({\mathbf {U}}^{\prime }={\mathbf {C}}_{00}^{\frac{1}{2}}{\mathbf {U}}\), \({\mathbf {V}}^{\prime }={\mathbf {C}}_{11}^{\frac{1}{2}}{\mathbf {V}}\). It can be seen that the feature TCCA algorithm maximizes the first term on the right-hand side of (116) and therefore maximizes VAMP-E.

For the optimal model generated by the nonlinear TCCA, the first term on the right-hand side of (116) is equal to zero, and the second term is maximized as a function of \({\mathbf {w}}\). Thus, the nonlinear TCCA also maximizes VAMP-E.

In addition, for \({\mathbf {K}},{\mathbf {U}},{\mathbf {V}}\) provided by both feature TCCA and nonlinear TCCA,

$$\begin{aligned} {\mathcal {R}}_{E}({\mathbf {K}},{\mathbf {U}},{\mathbf {V}})= & {} -\left\| {\mathbf {U}}^{\prime }{\mathbf {K}}{\mathbf {V}}^{\prime \top }-{\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\right\| _{F}^{2}+\left\| {\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{\mathbf {C}}_{11}^{-\frac{1}{2}}\right\| _{F}^{2}\nonumber \\= & {} -\sum _{i=k+1}^{\min \{m,n\}}K_{ii}^{2}+\sum _{i=1}^{\min \{m,n\}}K_{ii}^{2}\nonumber \\= & {} \sum _{i=1}^{k}K_{ii}^{2}\nonumber \\= & {} {\mathcal {R}}_{2}({\mathbf {U}},{\mathbf {V}}). \end{aligned}$$
(117)

Subspace Variational Principle

The variational principle proposed in Sect. 2.2 can be further extended to singular subspaces of the Koopman operator as follows:

$$\begin{aligned} \sum _{i=1}^{k}\sigma _{i}^{r}\ge {\mathcal {R}}_{r}^{\mathrm {space}}\left[ {\mathbf {f}},{\mathbf {g}}\right] =\left\| {\mathbf {C}}_{ff}^{-\frac{1}{2}}{\mathbf {C}}_{fg}{\mathbf {C}}_{gg}^{-\frac{1}{2}}\right\| _{r}^{r} \end{aligned}$$
(118)

for \(r\ge 1\), and the equality holds if \(\mathrm {span}\{\psi _{1},\ldots ,\psi _{k}\}=\mathrm {span}\{f_{1},\ldots ,f_{k}\}\) and \(\mathrm {span}\{\phi _{1},\ldots ,\phi _{k}\}=\mathrm {span}\{g_{1},\ldots ,g_{k}\}\), where \({\mathbf {C}}_{ff}=\left\langle {\mathbf {f}},{\mathbf {f}}^{\top }\right\rangle _{\rho _{0}}\), \({\mathbf {C}}_{fg}=\left\langle {\mathbf {f}},{\mathcal {K}}_{\tau }{\mathbf {g}}^{\top }\right\rangle _{\rho _{0}}\) and \({\mathbf {C}}_{gg}=\left\langle {\mathbf {g}},{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}\). This statement can be proven by implementing the feature TCCA algorithm with feature functions \({\mathbf {f}}\) and \({\mathbf {g}}\).

The \({\mathcal {R}}_{r}^{\mathrm {space}}\left[ {\mathbf {f}},{\mathbf {g}}\right] \) is a relaxation of VAMP-r, which measures the consistency between the subspaces spanned by \({\mathbf {f}},{\mathbf {g}}\) and the dominant singular spaces, and we call it the subspace VAMP-r score. \({\mathcal {R}}_{r}^{\mathrm {space}}\left[ {\mathbf {f}},{\mathbf {g}}\right] \) is invariant with respect to the invertible linear transformations of \({\mathbf {f}}\) and \({\mathbf {g}}\), i.e., \({\mathcal {R}}_{r}^{\mathrm {space}}\left[ {\mathbf {f}},{\mathbf {g}}\right] ={\mathcal {R}}_{r}^{\mathrm {space}}\left[ {\mathbf {A}}_{f}{\mathbf {f}},{\mathbf {A}}_{g}{\mathbf {g}}\right] \) for any invertible matrices \({\mathbf {A}}_{f},{\mathbf {A}}_{g}\).

In the cross-validation for feature TCCA, we can utilize \({\mathcal {R}}_{r}^{\mathrm {space}}\) to calculate the validation score by

$$\begin{aligned} \mathrm {CV}\left( {\mathbf {K}},{\mathbf {U}},{\mathbf {V}}|{\mathcal {D}}_{\mathrm {test}}\right)= & {} {\mathcal {R}}_{r}^{\mathrm {space}}\left( {\mathbf {U}},{\mathbf {V}}|{\mathcal {D}}_{\mathrm {test}}\right) \nonumber \\= & {} {\mathcal {R}}_{r}^{\mathrm {space}}\left[ {\mathbf {U}}^{\top }\varvec{\chi }_{0},{\mathbf {V}}^{\top }\varvec{\chi }_{1}|{\mathcal {D}}_{\mathrm {test}}\right] \nonumber \\= & {} \left\| \left( {\mathbf {U}}^{\top }{\mathbf {C}}_{00}^{\mathrm {test}}{\mathbf {U}}\right) ^{-\frac{1}{2}}\left( {\mathbf {U}}^{\top }{\mathbf {C}}_{01}^{\mathrm {test}}{\mathbf {V}}\right) \left( {\mathbf {V}}^{\top }{\mathbf {C}}_{11}^{\mathrm {test}}{\mathbf {V}}\right) ^{-\frac{1}{2}}\right\| _{r}^{r}.\nonumber \\ \end{aligned}$$
(119)

We now analyze the difficulties of applying \({\mathcal {R}}_{r}^{\mathrm {space}}\) to the cross-validation. First, for given basis functions \(\varvec{\chi }_{0},\varvec{\chi }_{1}\), \({\mathcal {R}}_{r}^{\mathrm {space}}\left( {\mathbf {U}},{\mathbf {V}}|{\mathcal {D}}_{\mathrm {test}}\right) \) is monotonically increasing with respect to k and

$$\begin{aligned} {\mathcal {R}}_{r}^{\mathrm {space}}\left( {\mathbf {U}}_{k},{\mathbf {V}}_{k}|{\mathcal {D}}_{\mathrm {test}}\right) =\left\| \left( {\mathbf {C}}_{00}^{\mathrm {test}}\right) ^{-\frac{1}{2}}{\mathbf {C}}_{01}^{\mathrm {test}}\left( {\mathbf {C}}_{11}^{\mathrm {test}}\right) ^{-\frac{1}{2}}\right\| _{r}^{r} \end{aligned}$$
(120)

is independent of the estimated singular components if \(k=\max \{\mathrm {dim}(\varvec{\chi }_{0}),\mathrm {dim}(\varvec{\chi }_{1})\}\). Therefore, k is a new hyper-parameter that cannot be determined by the cross-validation. Second, for training set, \({\mathbf {U}}_{k}^{\top }{\mathbf {C}}_{00}^{\mathrm {train}}{\mathbf {U}}_{k}={\mathbf {V}}_{k}^{\top }{\mathbf {C}}_{11}^{\mathrm {train}}{\mathbf {V}}_{k}={\mathbf {I}}\). But for test set, \({\mathbf {U}}_{k}^{\top }{\mathbf {C}}_{00}^{\mathrm {test}}{\mathbf {U}}_{k}\) and \({\mathbf {V}}_{k}^{\top }{\mathbf {C}}_{11}^{\mathrm {test}}{\mathbf {V}}_{k}\) are possibly singular and the validation score cannot be reliably computed.

Computation of \(\hat{{\mathcal {K}}}_{\tau }^{n}\)

The approximate Koopman operator in the form of (27) can also be written as

$$\begin{aligned} \hat{{\mathcal {K}}}_{\tau }g=\left\langle g,{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}{\mathbf {K}}{\mathbf {f}}. \end{aligned}$$
(121)

Hence,

$$\begin{aligned} \hat{{\mathcal {K}}}_{\tau }^{n}g=\left\langle g,{\mathbf {g}}^{\top }\right\rangle _{\rho _{1}}{\mathbf {K}}\left( {\mathbf {R}}^{n-1}\right) ^{\top }{\mathbf {f}}, \end{aligned}$$
(122)

and we have

$$\begin{aligned} \left\langle f,\hat{{\mathcal {K}}}_{\tau }^{n}g\right\rangle _{\rho _{0}(n\tau )}=\left\langle f,{\mathbf {f}}^{\top }\right\rangle _{\rho _{0}(n\tau )}{\mathbf {R}}^{n-1}{\mathbf {K}}\left\langle {\mathbf {g}},g\right\rangle _{\rho _{1}} \end{aligned}$$
(123)

and

$$\begin{aligned} {\hat{p}}_{n\tau }({\mathbf {x}},{\mathbf {y}})= & {} \hat{{\mathcal {K}}}_{\tau }^{n}\delta _{{\mathbf {y}}}({\mathbf {x}})\nonumber \\= & {} {\mathbf {f}}({\mathbf {x}})^{\top }{\mathbf {R}}^{n-1}{\mathbf {K}}{\mathbf {g}}({\mathbf {y}})\rho _{1}({\mathbf {y}}), \end{aligned}$$
(124)

where

$$\begin{aligned} {\mathbf {R}}={\mathbf {K}}\left\langle {\mathbf {g}},{\mathbf {f}}^{\top }\right\rangle _{\rho _{1}}. \end{aligned}$$
(125)

Notice that substituting \({\mathbf {f}}={\mathbf {U}}^{\top }\varvec{\chi }_{0},{\mathbf {g}}={\mathbf {V}}^{\top }\varvec{\chi }_{1}\) into (123) yields (37).

Details of Numerical Examples

1.1 One-Dimensional System

For convenience of analysis and computation, we partition the state space \([-20,20]\) into 2000 bins \(S_{1},\ldots ,S_{2000}\) uniformly, and discretize the one-dimensional dynamical system described in Example 1 as

$$\begin{aligned} {\mathbb {P}}(x_{t+1}\in S_{j}|x_{t}\in S_{i})\propto {\mathcal {N}}\left( s_{j}|\frac{s_{i}}{2}+\frac{7s_{i}}{1+0.12s_{i}^{2}}+6\cos s_{i},10\right) , \end{aligned}$$
(126)

where \(s_{i}\) is the center of the bin \(S_{i}\), and the local distribution of \(x_{t}\) within any bin is always uniform distribution. All numerical computations and simulations in Examples 1, 2 and 3 are based on (126), and the initial state \(x_{0}\) is distributed according to the stationary distribution \(\rho _{0}=\rho _{1}=\mu \).

In Example 1, the stationary distribution and singular components of the Koopman operator are analytically computed by the feature TCCA with basis functions \(\chi _{0,i}(x)=\chi _{1,i}(x)=1_{x\in S_{i}}\) as follows:

  1. 1.

    Compute the transition matrix \({\mathbf {P}}=[P_{ij}]=[{\mathbb {P}}(x_{t+1}\in S_{j}|x_{t}\in S_{i})]\) and the stationary vector \(\varvec{\pi }=[\pi _{i}]\) satisfying

    $$\begin{aligned} \varvec{\pi }^{\top }{\mathbf {P}}=\varvec{\pi }^{\top },\quad \sum _{i}\pi _{i}=1. \end{aligned}$$
  2. 2.

    Compute covariance matrices \({\mathbf {C}}_{00}={\mathbf {C}}_{11}=\mathrm {diag}(\varvec{\pi })\) and \({\mathbf {C}}_{01}=\mathrm {diag}(\varvec{\pi }){\mathbf {P}}\).

  3. 3.

    Perform the SVD

    $$\begin{aligned} \bar{{\mathbf {K}}}={\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {C}}_{01}{ \mathbf {C}}_{11}^{-\frac{1}{2}}={\mathbf {U}}^{\prime }{\mathbf {K}}{ \mathbf {V}}^{\prime \top } \end{aligned}$$

    with \({\mathbf {K}}=\mathrm {diag}(\sigma _{1},\ldots ,\sigma _{2000})\) and \(\sigma _{1}\ge \sigma _{2}\ge \ldots \ge \sigma _{2000}\).

  4. 4.

    Compute \({\mathbf {U}}=[U_{ij}]={\mathbf {C}}_{00}^{-\frac{1}{2}}{\mathbf {U}}^{\prime }\) and \({\mathbf {V}}=[V_{ij}]={\mathbf {C}}_{11}^{-\frac{1}{2}}{\mathbf {V}}^{\prime }\).

  5. 5.

    Output the stationary distribution \(\mu (x)=\sum _{i}50\pi _{i}\cdot 1_{x\in S_{i}}\) and singular components \((\sigma _{i},\psi _{i}(x),\phi _{i}(x))=(\sigma _{i},\sum _{j}U_{ji}\cdot 1_{x\in S_{j}},\sum _{j}V_{ji}\cdot 1_{x\in S_{j}})\).

The transition density of the projected Koopman operator \(\hat{{\mathcal {K}}}_{\tau }=\sum _{i=1}^{k}\sigma _{i}\left\langle \cdot ,\phi _{i}\right\rangle _{\rho _{1}}\psi _{i}\) is obtained by

$$\begin{aligned} {\hat{p}}_{\tau }(x,y)= & {} \hat{{\mathcal {K}}}_{\tau }\delta _{y}(x)\nonumber \\= & {} \sum _{i=1}^{k}\sigma _{i}\psi _{i}(x)\phi _{i}(y)\mu (y) \end{aligned}$$
(127)

(see Appendix A.3) and the corresponding approximate transition matrix is

$$\begin{aligned} \hat{{\mathbf {P}}}={\mathbf {U}}_{k}^{\top }{\mathbf {K}}_{k}{ \mathbf {V}}_{k}\mathrm {diag}(\varvec{\pi }), \end{aligned}$$
(128)

where \({\mathbf {U}}_{k},{\mathbf {V}}_{k}\) consist of the first k columns of \({\mathbf {U}},{\mathbf {V}}\), and \({\mathbf {K}}_{k}=\mathrm {diag}(\sigma _{1},\ldots ,\sigma _{k})\). Then the relative error of \(\hat{{\mathcal {K}}}_{\tau }\) in Fig. 1e can be calculated by

$$\begin{aligned} \frac{\Vert \hat{{\mathcal {K}}}_{\tau }-{\mathcal {K}}_{\tau } \Vert _{\mathrm {HS}}}{\Vert {\mathcal {K}}_{\tau }\Vert _{\mathrm {HS}}}= \frac{\sqrt{\sum _{i=k+1}^{2000}\sigma _{i}^{2}}}{ \sqrt{\sum _{i=1}^{2000}\sigma _{i}^{2}}}, \end{aligned}$$
(129)

the long-time transition density in Fig. 2 is given by

$$\begin{aligned} {\hat{p}}_{n\tau }(x,y)=50\sum _{j}\left[ \hat{{\mathbf {P}}}^{n} \right] _{ij}\cdot 1_{y\in S_{j}}, \end{aligned}$$
(130)

and the cumulative error of \({\hat{p}}_{n\tau }(x,y)\) is

$$\begin{aligned} \mathrm {error}= & {} \sum _{n=1}^{256}\int \mu (y)^{-1}\left( {\hat{p}}_{n\tau } (x,y)-p_{n\tau }(x,y)\right) ^{2}\mathrm {d}y\nonumber \\= & {} \sum _{n=1}^{256}\sum _{j=1}^{2000}\pi _{j}^{-1}\left( \left[ \hat{{\mathbf {P}}}^{n}\right] _{ij}-\left[ {\mathbf {P}}^{n}\right] _{ij}\right) ^{2} \end{aligned}$$
(131)

for \(x\in S_{i}\).

In Examples 2 and 3 , the smoothing parameter w are optimized by the golden-section search algorithm (Press et al. 2007) as follows for nonlinear TCCA:

  1. 1.

    Let \(a=-6\), \(b=6\), \(c=0.618a+0.382b\), \(d=0.382a+0.618b\).

  2. 2.

    Compute \({\mathcal {R}}_{2}(\exp a)\), \({\mathcal {R}}_{2}(\exp b)\), \({\mathcal {R}}_{2}(\exp c)\) and \({\mathcal {R}}_{2}(\exp d)\), where \({\mathcal {R}}_{2}(w)=\left\| {\mathbf {C}}_{00}\left( w\right) ^{-\frac{1}{2}}{\mathbf {C}}_{01}\left( w\right) {\mathbf {C}}_{11}\left( w\right) ^{-\frac{1}{2}}\right\| _{F}^{2}\) and \(\Vert \cdot \Vert _{F}\) denotes the Frobenius norm.

  3. 3.

    If \(\max \{{\mathcal {R}}_{2}(\exp a),{\mathcal {R}}_{2}(\exp b),{\mathcal {R}}_{2}(\exp c)\}>\max \{{\mathcal {R}}_{2}(\exp b),{\mathcal {R}}_{2}(\exp c),{\mathcal {R}}_{2}(\exp d)\}\), let \((a,b,c,d):=(a,d,0.618a+0.382d,c)\). Otherwise, let \((a,b,c,d):=(c,b,d,0.618b+0.382c)\).

  4. 4.

    If \(|a-b|<10^{-3}\), output \(\log w\in \{a,b,c,d\}\) with the largest value of \({\mathcal {R}}_{2}(w)\). Otherwise, go back to Step 2.

Furthermore, w is computed in the same way when perform nonlinear TCCA in Sects. 5.1 and 5.2 .

1.2 Double-Gyre System

For the double-gyre system in Sect. 5.1, we first perform the temporal discretization by the Euler–Maruyama scheme as

$$\begin{aligned} {\mathbb {P}}(x_{t+\varDelta }|{\mathbf {x}}_{t})= & {} {\mathcal {N}}(x_{t+\varDelta }|x_{t}-\pi A\sin (\pi x_{t})\cos (\pi y_{t})\varDelta ,\epsilon ^{2}(x_{t}/4+1)),\nonumber \\ {\mathbb {P}}(y_{t+\varDelta }|{\mathbf {x}}_{t})= & {} {\mathcal {N}}(y_{t+\varDelta }|y_{t}+\pi A\cos (\pi x_{t})\sin (\pi y_{t})\varDelta ,\epsilon ^{2}), \end{aligned}$$
(132)

where \({\mathbf {x}}_{t}=(x_{t},y_{t})^{\top }\) and \(\varDelta =0.02\) is the step size. Then perform the spatial discretization as

$$\begin{aligned} {\mathbb {P}}({\mathbf {x}}_{t+\varDelta }\in S_{j}|{\mathbf {x}}_{t}\in S_{i})\propto & {} {\mathcal {N}}(s_{j,x}|s_{i,x}-\pi A\sin (\pi s_{i,x})\cos (\pi s_{i,y})\varDelta ,\epsilon ^{2}(s_{i,x}/4+1))\nonumber \\&\cdot {\mathcal {N}}(s_{j,y}|s_{i,y}+\pi A\cos (\pi s_{i,x})\sin (\pi s_{i,y})\varDelta ,\epsilon ^{2}). \end{aligned}$$
(133)

Here \(S_{1},\ldots ,S_{1250}\) are \(50\times 25\) bins which form a uniform partition of the state space \([0,2]\times [0,1]\) and \((s_{i,x},s_{i,y})\) represents the center of \(S_{i}\). Simulation data and the “true” singular components are all computed by using (133) with the initial distribution of \((x_{0},y_{0})\) being the stationary one.

In Fig. 6, the transition density of lag time \(n\tau \) is computed from the estimated singular components \(({\mathbf {K}},{\mathbf {U}}^{\top }\varvec{\chi }_{0},{\mathbf {V}}^{\top }\varvec{\chi }_{1})\) as

$$\begin{aligned} {\hat{p}}_{n\tau }({\mathbf {x}},{\mathbf {y}})=625\sum _{j}\left[ \hat{{\mathbf {P}}}^{n}\right] _{ij}\cdot 1_{{\mathbf {y}}\in S_{j}},\quad \text {for }x\in S_{i} \end{aligned}$$
(134)

where

$$\begin{aligned} \hat{{\mathbf {P}}}={\mathbf {U}}^{\top }{\mathbf {K}}{\mathbf {V}}\mathrm {diag}(\varvec{\rho }_{1}) \end{aligned}$$
(135)

is the approximate transition matrix, and \(\varvec{\rho }_{1}=[\varvec{\rho }_{1i}]\) with

$$\begin{aligned} \varvec{\rho }_{1i}=\frac{1}{T-\tau }\sum _{t=1}^{T-\tau }1_{{\mathbf {x}}_{t+\tau }\in S_{i}}. \end{aligned}$$
(136)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, H., Noé, F. Variational Approach for Learning Markov Processes from Time Series Data. J Nonlinear Sci 30, 23–66 (2020). https://doi.org/10.1007/s00332-019-09567-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00332-019-09567-y

Keywords

Mathematics Subject Classification

Navigation