1 Introduction

Many real-world data mining applications, including those in finance, entail modeling event occurrences in a continuous time setting. Examples of such data abound in finance; including order flows [3], trades [1], news [12], price jumps, volatility spikes, etc. Temporal point processes, statistical models of points scattered along the real line, are often the primary models used to address these data sets.

The Poisson process (PP) is one such statistical model that assumes independence among occurrences. Points are assumed to occur without any interaction, sometimes described as completely randomly [6]. PPs have been used in finance for modeling discrete event systems, e.g. limit orders [3]. While PPs lead to convenient mathematics for computing many quantities of interest analytically, they fail our simple intuition that financial events are seldom independent of one another, i.e. that they excite each other.

Self-exciting point processes, specifically Hawkes processes (HP) [7], are recently growing more common in quantitative finance [2] as well as machine learning literatures [8, 9]. First explored in the backdrop of seismology, HPs assume causal, linear non-negative excitation behavior among occurrences. This is why they have been considered especially suited to modeling financial discrete events.

Typically, HPs are applied towards prediction tasks. Maximum likelihood estimates of model parameters are fit to an observation, a collection of occurrence timestamps, that are assumed to arise from the process. Model validation or selection is then performed through predictive likelihood, or some other cross-validation metric, used to determine how good the fit is on a held out sample. Here, instead, we present a method of model selection (or equivalently, hypothesis testing) for self-exciting point process models. We take a Bayesian approach, and describe approximate inference and marginal likelihood estimation schemes. We present preliminary experiments on high frequency currency, cryptocurrency and equity limit order book data. Among a family of Bayesian inference methods, we posit that Laplace approximation to model evidence is best suited to the problem at hand.

In Sect. 2 we first give a brief overview of self-exciting processes and Bayesian model selection before describing our inference scheme. In Sect. 3, we present a set of preliminary findings on currency price, equity order book, and crypto-currency event sets, before concluding in Sect. 4.

2 Model

2.1 Hawkes Process

Let \(\{N(t)\}_{t \in \mathbb {R}_+}\) denote a counting process, a jump process where jump sizes are \(+1\) and \(N(0)=0\). Furthermore, we will use the overloaded notation N(ab] to refer to the number of jumps (or equivalently, points) in the interval (ab] – also a random variable. In correspondence to a temporal point process, we think of N(t) as the number of points –event occurrences such as orders or transactions– until time t.

Homogeneous Poisson processes are characterized by complete independence and stationarity assumptions. We have that N(ab] and N(cd] are independent random variables given that (ab] and (cd] are disjoint intervals on the real line. Furthermore, by stationarity we have that \(\langle N(a, b] \rangle = \langle N(a + \tau , b + \tau ] \rangle \) for all \(\tau \), where we let \(\langle . \rangle \) denote the expectation operator. However, it is these two assumptions that limit a realistic modeling of sequences of events that might as well have influenced each other.

Working with general classes of point processes where point occurrences are interdependent is difficult – both theoretically and computationally [6]. One alternative that leads to both mathematical and computational convenience is a class of temporal point processes (or, equivalently, counting processes), determined by a conditional intensity function [6]. Concretely, let \(\lambda ^*\) denote the conditional intensity function of a self-exciting point processFootnote 1, defined by

$$\begin{aligned} \lambda ^*(t) \triangleq \lim _{\delta \downarrow 0} \delta ^{-1} \langle N(t, t + \delta ] | \mathcal {H}_t \rangle . \end{aligned}$$

Here we use \(\mathcal {H}_t\) to denote the history of events up to time tFootnote 2. Note that setting \(\lambda ^*(t) = \nu (t)\), a deterministic measurable function of t, would simply yield a (nonhomogeneous) Poisson process.

HPs arise as one of the simplest examples of point processes defined through a conditional intensity [4, 6]. They model linear self-excitation behavior, where the instantaneous probability of an event occurrence is given by a linear combination of the effects of past events. A (univariate) HP is a point process determined by the conditional intensity function [6, 7].

$$\begin{aligned} \lambda ^*(t) = \mu + \sum _{t_j < t} \varphi (t - t_j). \end{aligned}$$
(1)

Here \(\mu > 0\) is the constant background (exogenous) intensity function. \(\varphi : \mathbb {R}_{+} \rightarrow \mathbb {R}_{+}\) is the triggering kernel, an often monotonically decreasing function that governs self-excitation.

Fig. 1.
figure 1

Intensity function of a Hawkes process with exponential delay density

We will be concerned with the case \(\varphi (x) = \alpha \theta \exp (- \theta x)\), where \(\alpha \in [0, 1), \theta > 0\). Here since \(\int \theta \exp (- \theta x) d\theta = 1\), we can interpret the triggering kernel in terms of its parameters. \(\alpha \) governs the infectivity or the average number of new events that are triggered by an event. The remaining part is the exponential density for the length of the delay between events triggering each other. Note that \(\alpha < 1\) is required for stationarity.

One can think of the intensity as a stochastic process itself, which is excited every time a jump occurs stochastically on the underlying process N(t). That is, a jump in N(t) leads to a jump of size \(\alpha \) in \(\lambda ^*\). This effect then decays according to a schedule determined by the decay factor in \(\varphi \), which in the case above, was taken as an exponential decay proportional to \(\exp (-\theta \varDelta t)\). We illustrate this effect in Fig. 1.

We refer the reader to the review by Bacry et al. [2] for further details on HP and their varied applications in quantitative finance.

Finally, let us note that for any conditional intensity point process the likelihood of finitely many points \(\varPi = \{t_i\}_{i=1}^N\) where \(0< t_1< \dots< t_N < T\) on a bounded interval (0, T] is given by

$$ p(\varPi | \lambda ^*) = \exp \left( -\int _0^T \lambda ^*(s) ds\right) \prod _i^N \lambda ^*(t_i), $$

where the conditional intensity function \(\lambda ^*(x)\) uniquely determines the process. For Poisson processes, granted that the compensator \(-\int _0^T \lambda (s) ds\) can be computed, the evaluation of the likelihood is trivial. This is not the case in general, however. Note that the computation of the likelihood for a general HP defined as in (1) would take time \(O(N^2)\), as each intensity evaluation takes time linear in the number of events. This crucial aspect prohibits the use of likelihood-based inference, including many Bayesian methods, in general. In the exponential kernel HP case, however, both the log likelihood and its gradient can be computed in linear time owing to the memoryless property. In the sequel, we constrain our attention to HP parameterized as such.

2.2 Bayesian Model Comparison

As mentioned previously, point processes are used mainly as models of discrete events occurring asynchronously in continuous time. Compared to discrete-time models that are often used in econometrics or time series forecasting, the methods of comparing and selecting models are less obvious.

Although HPs have been explored widely in finance, existing works often use cross-validation – basing model comparison on predictive likelihood, or other domain-driven measures of error on held out data. On the other hand, there is earlier work on frequentist hypothesis testing of HP vs PP [5]. In this paper, we present work in progress regarding a Bayesian approach – bringing the advantages (and potential pitfalls) of encoding prior assumptions on model parameters and deriving intuitive tests of model validity.

In Bayesian model comparison, one judges models through marginal (integrated) likelihoods, using the same calculus of probability that one judges parameter configurations of a fixed model. Let \(p(\varPi | \varTheta )\) denote the data likelihood, and \(p(\varTheta )\) a prior distribution under a certain model. Our aim is to compute the marginal likelihood

$$ p(\varPi ) = \int p(\varPi | \varTheta ) p(\varTheta ) d\varTheta , $$

where we let \(\varTheta \) denote the vector of all model parameters. Intuitively, this quantity can be read as \(\langle p(\varPi |\varTheta ) \rangle _{p(\varTheta )}\), i.e. the expected likelihood that a given model will assign to data \(\varPi \), as parameters are drawn from the prior \(p(\varTheta )\). Note that this quantity comes with “Occam’s razor” included, i.e. high-dimensional models with diffuse priors are automatically penalized. One can then use the marginal likelihoods of two different models to compare them.

Let \(p_1, p_0\) denote marginal likelihoods under two different models. The ratio

$$\begin{aligned} BF = \dfrac{p_1(\varPi )}{p_0(\varPi )} \end{aligned}$$
(2)

is known as the Bayes factor. Bayesian hypothesis tests are performed by calculating the marginal likelihood under the null (\(p_0\)), as well as the alternative (\(p_1\)) hypotheses, and computing BF. \(BF > 10\) is taken as strong evidence that the first model (\(p_1\)) better explains the observations. Similarly, many models (or prior configurations) can be compared on the same footing.

2.3 Proposed Method

Here we propose a simple hypothesis test for “self-excitation” behavior in financial events. We calculate the Bayes factor (2) by taking a homogeneous PP as the null hypothesis (\(p_0\)), and an exponential-decay HP as given in (1) as the alternative (\(p_1\)). In doing so, we explore methods of marginal likelihood estimation for HP, which also paves the way to comparing HP models.

We equip both models (\(p_0, p_1\)) with appropriate prior distributions. In the former, we choose a Gamma distribution for the constant intensity parameter. The Gamma distribution is conjugate to the PP likelihood, making marginal likelihood computation analytically tractable. For HP, parameters \(\mu , \alpha , \theta \) are given Gamma, Beta and Gamma priors respectively.

Marginal likelihood for HPs is intractable under any choice of prior, and we must resort to an approximation. Yet, this approximation is still made difficult by computational challenges related to the likelihood, outlined above. For example, one sampling-based alternative for marginal likelihood estimation, annealed importance sampling [11], requires a large number of likelihood computations before a single weighted sample can be drawn. This prohibits a realistic application of this method for HPs with large observed samples.

However, especially in the high-frequency context, we can invoke another approximation method. Financial continuous time data sets, unlike earthquakes, are characterized by large sample sizes. We find that this leads to peaked, unimodal posteriors, with which we can turn to Laplace approximation to the marginal likelihood [10].

We approximate the posterior with a multivariate Gaussian distribution centered around the posterior mode, \(\varTheta ^* = \arg \max p(\varTheta | \varPi )\). Given the posterior potential \(\varphi (\varTheta ) = p(\varTheta | \varPi ) p(\varTheta )\), we approximate \(p(\varPi ) = \int d\varTheta \varphi (\varTheta )\) via

$$\begin{aligned} \ln p(\varPi ) \approx \ln \varphi (\varTheta ^*) + \frac{3}{2} \ln 2 \pi - \frac{1}{2} \ln |H|, \end{aligned}$$

where \(H = \nabla ^2 -\varphi (\varTheta ^*)\) is the Hessian of \(-\varphi \) evaluated at the mode.

This method reduces marginal likelihood estimation to a series of simple steps. First, maximum a posteriori (MAP) estimates of HP are obtained. This can be achieved via expectation maximization, as well as gradient-based methods in the simple case of univariate HP. The Hessian H can be approximated numerically or computed exactly. Software for estimating marginal likelihood, as well as other tasks such as posterior inference under univariate Bayesian HP, is made available onlineFootnote 3.

3 Experiments

Table 1. Results of experiments on financial data sets. N denotes the number of occurrences in the data set. BF is the computed Bayes factor. Bayesian credible intervals at 95% are given for \(\alpha , \theta \)

Our experiments cover a range of financial event sets. FX are high frequency (millisecond range) tick events in an interbank currency exchange, previously investigated using HP [13]. We model three large-volume currency pairs selected at random. Crypto are price increase events on three large-cap cryptocurrencies on the cryptocurrency exchange Bittrex sampled at five minute (low) frequency. Finally LOB are limit order arrivals in a large-cap bank stock in the Turkish equity exchange, Borsa Istanbul, sampled at very high frequency (nanosecond range). Samples of each data set are given in Fig. 2. In FX and LOB , we limit event sets to 1000 events, roughly to 10 min of trading. Observe that in both data sets, the data cluster around certain points in time. This effect is less pronounced in Crypto .

Fig. 2.
figure 2

Data samples from the three data sets. x-axis denotes time of occurrence, y-axis is random noise for better visibility

We report the results of our tests, where we calculate the Bayes factor as described in Sect. 2.3. We further present 95% Bayesian credible intervals for the triggering kernel parameters, where we use simple random walk Metropolis (RWM) [10] algorithm to draw from the posterior.

We present the results in Table 1. The test accurately captures that low-frequency price jumps do not present sufficient evidence in favor of self-excitation. In FX and LOB , however, we find overwhelming evidence that HP outperforms PP. Note however that, if one were to register only large return jumps as events, HPs could still fit the data at lower frequencies. This is not surprising, its analogue in the discrete-time setting would be known as volatility clustering.

There are, however, two issues we must address. First, Bayesian analysis is well known to be sensitive to choice of priors. In our analyses, we find that large data sets easily mitigate this effect. In Fig. 2, we change the scale hyperparameter of the prior for \(\theta \), the delay distribution. We find that, except for unrealistic choices of priors which set the average delay to less than 0.01 ms, the conclusion is largely unaffected. Varying other hyperparameters lead to similar conclusions.

Finally, let us note that this paper and many others in the field assume constant background intensity \(\mu \). The test in this paper also assumes a homogeneous PP as the null hypothesis. However, the exogenous process that governs financial events is often not stationary. For example, financial events follow intraday, weekly and yearly cycles. Our test, and many other investigations in HP, are prone to capturing this effect and explaining it away using the endogenous component of HP. We test this effect using a toy data set drawn from a nonhomogeneous PP with intensity \(\lambda (t) \propto \exp (\sin t)\) (see, e.g. Figure 4). On this data, our test easily passes (rejects PP), although the nonstationarity is purely exogenous. In our experiments, we mitigate the potential effect of periodicity by sampling short time intervals (Fig. 3).

Fig. 3.
figure 3

Logarithm of the Bayes factor as the scale hyperparameter of \(p(\theta )\) is changed; for EURUSD in the FX data set.

Fig. 4.
figure 4

A draw from a nonhomogeneous Poisson process with periodic intensity

4 Conclusion

We combined techniques from Bayesian machine learning and evolutionary point processes for modeling high-frequency financial data. We cast HP to a Bayesian setting, and discussed the computation of a Bayesian model comparison scheme for testing “self-excitation” behavior in financial events as well as posterior inference. Early experiments confirm basic intuition regarding high-frequency financial events.

Our method can be used to capture self-excitation effects in financial discrete event data, much in the same way conditional heteroskedasticity models capture volatility clustering. However, the test assumes that background intensities are stationary, and can lead to pitfalls in financial analysis. This issue constitutes the next step to this study.