Time series joins, motifs, discords and shapelets: a unifying view that exploits the matrix profile

Yeh, Chin-Chia Michael; Zhu, Yan; Ulanova, Liudmila; Begum, Nurjahan; Ding, Yifei; Dau, Hoang Anh; Zimmerman, Zachary; Silva, Diego Furtado; Mueen, Abdullah; Keogh, Eamonn

doi:10.1007/s10618-017-0519-9

Time series joins, motifs, discords and shapelets: a unifying view that exploits the matrix profile

Published: 24 June 2017

Volume 32, pages 83–123, (2018)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Chin-Chia Michael Yeh¹^na1,
Yan Zhu¹^na1,
Liudmila Ulanova¹,
Nurjahan Begum¹,
Yifei Ding¹,
Hoang Anh Dau¹,
Zachary Zimmerman¹,
Diego Furtado Silva²,
Abdullah Mueen³ &
…
Eamonn Keogh¹

3368 Accesses
83 Citations
4 Altmetric
Explore all metrics

Abstract

The last decade has seen a flurry of research on all-pairs-similarity-search (or similarity joins) for text, DNA and a handful of other datatypes, and these systems have been applied to many diverse data mining problems. However, there has been surprisingly little progress made on similarity joins for time series subsequences. The lack of progress probably stems from the daunting nature of the problem. For even modest sized datasets the obvious nested-loop algorithm can take months, and the typical speed-up techniques in this domain (i.e., indexing, lower-bounding, triangular-inequality pruning and early abandoning) at best produce only one or two orders of magnitude speedup. In this work we introduce a novel scalable algorithm for time series subsequence all-pairs-similarity-search. For exceptionally large datasets, the algorithm can be trivially cast as an anytime algorithm and produce high-quality approximate solutions in reasonable time and/or be accelerated by a trivial porting to a GPU framework. The exact similarity join algorithm computes the answer to the time series motif and time series discord problem as a side-effect, and our algorithm incidentally provides the fastest known algorithm for both these extensively-studied problems. We demonstrate the utility of our ideas for many time series data mining problems, including motif discovery, novelty discovery, shapelet discovery, semantic segmentation, density estimation, and contrast set mining. Moreover, we demonstrate the utility of our ideas on domains as diverse as seismology, music processing, bioinformatics, human activity monitoring, electrical power-demand monitoring and medicine.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

catch22: CAnonical Time-series CHaracteristics

Article Open access 09 August 2019

A review and evaluation of elastic distance functions for time series clustering

Article Open access 07 September 2023

Elastic similarity and distance measures for multivariate time series

Article Open access 14 February 2023

Notes

There are many such worse case scenarios, including high levels of noise blurring the distinction between closest and furthest neighbors and thus rendering triangular-inequality pruning and early abandoning worthless.

References

Agrawr R, Faloutsos C, Swami AN (1993) Efficient similarity search in sequence databases. In: Proceedings of the 4th international conference on foundations of data organization and algorithms (FODO’93), pp 69–84
Alibaba.com (2017) http://www.alibaba.com/showroom/seismograph.html
Assent I, Kranen P, Baldauf C, Seidl T (2012) AnyOut: anytime outlier detection on streaming data. In: Proceedings of the 17th international conference on database systems for advanced applications—volume part I (DASFAA’12), pp 228–242
Bavardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: Proceedings of the 16th international conference on World Wide Web, pp 131–140
Begum N, Keogh E (2014) Rare time series motif discovery from unbounded streams. In: Proceedings of the VLDB endowment (VLDB), vol 8(2), pp 149–160
Beroza G (2016) Personal correspondence. Jan 21, 2016
Bouezmarni T, Rombouts J (2010) Nonparametric density estimation for positive time series. Comput Stat Data Anal 54:245–261
Article MathSciNet MATH Google Scholar
Brown AEX, Yemini EI, Grundy LJ, Jucikas T, Schafer WR (2013) A dictionary of behavioral motifs reveals clusters of genes affecting caenorhabditis elegans locomotion. Proc Natl Acad Sci USA 110:791–796
Article Google Scholar
Chandola V, Cheboli D, Kumar V (2009) Detecting anomalies in a time series database. UMN TR09-004
Chen Y, Keogh E, Hu B, Begum N, Bagnall A, Mueen A, Batista G (2015) The UCR time series classification archive. http://www.cs.ucr.edu/~eamonn/time_series_data/
CMU Motion Capture Database (2017) http://mocap.cs.cmu.edu/
Convolution (2016) Wikipedia, The Free Encyclopedia https://en.wikipedia.org/wiki/Convolution
Ding H, Trajcevski G, Scheuermann P, Wang X, Keogh E (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. In: Proceedings of the VLDB endowment (VLDB), vol 1(2), pp 1542–1552
Dittmar C, Hildebrand K. F, Gaertner D, Winges M, Müller F, Aichroth P (2012) Audio forensics meets music information retrieval—a toolbox for inspection of music plagiarism. In: 2012 Proceedings of the 20th European signal processing conference (EUSIPCO), pp 126–131
Geller RJ, Mueller CS (1980) Four similar earthquakes in central California. Geophys Res Lett 7:821–824
Article Google Scholar
Gomez-Valero L, Rusniok C, Cazalet C, Buchrieser C (2011) Comparative and functional genomics of legionella identified eukaryotic like proteins as key players in host-pathogen interactions. Front Microbiol 2:208
Article Google Scholar
Hao MC, Marwah M, Janetzko H, Dayal U, Keim DA, Patnaik D, Ramakrishnan N, Sharma RK (2012) Visual exploration of frequent patterns in multivariate time series. Inf Vis 11:71–83
Article Google Scholar
Hassanieh H, Indyk P, Katabi D, Price E (2012) Nearly optimal sparse Fourier transform. In: Proceedings of the forty-fourth annual ACM symposium on theory of computing (STOC), pp 563–78
Hu B, Chen Y, Zakaria J, Ulanova L, Keogh EJ (2013) Classification of multi-dimensional streaming time series by weighting each classifier’s track record. In: 2013 IEEE 13th international conference on data mining (ICDM), pp 281–290
Huang T, Zhu Y, Mao Y, Li X, Liu M, Wu Y, Ha Y, Dobbie G (2016) Parallel discord discovery. In: Advances in knowledge discovery and data mining: 20th Pacific-Asia conference, PAKDD 2016, Auckland, New Zealand, April 19–22, 2016. Proceedings, Part II, pp 233–244
Hughes JF, Skaletsky H, Pyntikova T, Graves TA, van Daalen SK, Minx PJ, Fulton RS, McGrath SD, Locke DP, Friedman C, Trask BJ, Mardis ER, Warren WC, Repping S, Rozen S, Wilson RK, Page DC (2010) Chimpanzee and human Y chromosomes are remarkably divergent in structure and gene content. Nature 463:536–539
Article Google Scholar
Lee H, Ng R, Shim K (2011) Similarity join size estimation using locality sensitive hashing. In: Proceedings of the VLDB endowment (VLDB), vol 4(6), pp 338–349
Li Y, Yiu ML, Gong Z (2015) Quick-motif: an efficient and scalable framework for exact motif discovery. In: 2015 IEEE 31st international conference on data engineering (ICDE), pp 579–590
Lian X, Chen L (2009) Efficient join processing on uncertain data streams. In: Proceedings of the ACM conference on information and knowledge management (CIKM), pp 857–866
Luo W, Tan H, Mao H, Ni, LM (2012) Efficient similarity joins on massive high-dimensional datasets using MapReduce. In: 2012 IEEE 13th international conference on mobile data management (MDM), pp 1–10
Ma Y, Meng X, Wang S (2016) Parallel similarity joins on massive high-dimensional data using MapReduce. Concurr Comput 28(1):166–183
Article Google Scholar
Makonin SV (2013) AMPds: a public dataset for load disaggregation and eco-feedback research. In: 2013 IEEE electrical power & energy conference (EPEC)
Morales GDF, Gionis A (2016) Streaming similarity self-join. In: Proceedings of the VLDB endowment (VLDB), vol 9(10), pp 792–803
Motamedi-Fakhr S, Moshrefi-Torbati M, Hill M, Hill CM, White PR (2014) Signal processing techniques applied to human sleep EEG signals—a review. Biomed Signal Process Control 10(2014):21–33
Article Google Scholar
Mueen A, Hamooni H, Estrada T (2014) Time series join on subsequence correlation. In: Proceedings of the 2014 IEEE international conference on data mining (ICDM), pp 450–459
Mueen A, Keogh E, Young N (2011) Logical-shapelets: an expressive primitive for time series classification. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 1154–1162
Mueen A, Keogh E, Zhu Q, Cash S, Westover B (2009) Exact discovery of time series motif. In: Proceedings of the 2009 SIAM international conference on data mining (SDM), pp 473–484
Mueen A, Nath S, Liu J (2010) Fast approximate correlation for massive time-series data. In; Proceedings of the 2010 ACM SIGMOD international conference on management of data, pp 171–182
Murray D, Liao J, Stankovic L, Stankovic V, Hauxwell-Baldwin R, Wilson C, Coleman M, Kane T, Firth S (2015) A Data management platform for personalised real-time energy feedback. In: Proceedings of the 8th international conference on energy efficiency in domestic appliances and lighting (EEDAL), pp 1293–1307
Niennattrakul V, Keogh, EJ, Ratanamahatana, CA (2010) Data editing techniques to allow the application of distance-based outlier detection to streams. In: 2010 IEEE 10th international conference on data mining (ICDM), pp 947–952
Patnaik D, Manish M, Sharma RK, Ramakrishnan N (2009) Sustainable operation and management of data center chillers using temporal data mining. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1305–1314
Quick Motif (2015) http://degroup.cis.umac.mo/quickmotifs/
Rakthanmanon T, Champana B, Mueen A, Batista G, Westover B, Zhu Q, Zakaria J, Keogh E (2012) Searching and mining trillions of time series subsequences under dynamic time warping. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, pp 262–270
Rakthanmanon T, Champana B, Mueen A, Batista G, Westover B, Zhu Q, Zakaria J, Keogh E (2013a) Addressing big data time series: mining trillions of time series subsequences under dynamic time warping. ACM Trans Knowl Discov Data 7(3):10
Article Google Scholar
Rakthanmanon T, Keogh E (2013b) Fast shapelets: a scalable algorithm for discovering time series shapelets. In: Proceedings of the 2013 SIAM international conference on data mining (SDM), pp 668–676
Reiss A, Weber M, Stricker D (2011) Exploring and extending the boundaries of physical activity recognition. In: IEEE International conference on systems, man, and cybernetics (SMC), pp 46–50
Seidl T, Assent I, Kranen K, Krieger R, Herrmann J (2009) Indexing density models for incremental learning and anytime classification on data streams. In: Proceedings of the 12th international conference on extending database technology: advances in database technology (EDBT), pp 311–322
Shao H, Marwah M, Ramakrishnan N (2013) A temporal motif mining approach to unsupervised energy disaggregation: applications to residential and commercial buildings. In: Proceedings of the twenty-seventh AAAI conference on artificial intelligence (AAAI), pp 1327–1333
Supporting Page (2017) http://www.cs.ucr.edu/~eamonn/MatrixProfile.html
Truong CD, Anh DT (2015) An efficient method for motif and anomaly detection in time series based on clustering. Int J Bus Intell Data Min 10(4):356–377
Article Google Scholar
Tucker A, Liu X (2004) A Bayesian network approach to explaining time series with changing structure. Intell Data Anal 8(5):469–480
Google Scholar
Ueno K, Xi X, Keogh EJ, Lee D-J (2006) Anytime classification using the nearest neighbor algorithm with applications to stream mining. In: Sixth international conference on data mining (ICDM), 2006
Vlachos M, Meek C, Vagena Z, Gunopulos D (2004) Identifying similarities, periodicities and bursts for online search queries. In: Proceedings of the 2004 ACM SIGMOD international conference on management of data, pp 131–142
Vlachos M, Vagena Z, Yu PS, Athitsos V (2005) Rotation invariant indexing of shapes and line drawings. In: Proceedings of the 14th ACM international conference on information and knowledge management (CIKM), pp 131–138
Wintner A (1934) On analytic convolutions of Bernoulli distributions. Am J Math 56(1/4):659–663
Article MathSciNet MATH Google Scholar
Ye L and Keogh E (2009) Time series shapelets: a new primitive for data mining. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 947–956
Yeh C-C. M, van Herle H, Keogh E (2016a) Matrix profile III: the matrix profile allows visualization of salient subsequences in massive time series. In: 2016 IEEE 16th international conference on data mining (ICDM), pp 579–588
Yeh C-C M, Zhu Y, Ulanova L, Begum N, Ding Y, Dau H A, Silva D F, Mueen A, Keogh E (2016b) Matrix profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets. In: 2016 IEEE 16th international conference on data mining (ICDM), pp 1317–1322
Yoon C, O’Reilly O, Bergen K, Beroza G (2015) Earthquake detection through computationally efficient similarity search. Sci Adv 1(11):e1501057
Article Google Scholar
Zhou F, Torre F, Hodgins J (2008) Aligned cluster analysis for temporal segmentation of human motion. In: 2008 8th IEEE international conference on automatic face & gesture recognition, pp 1–7
Zhu Y, Zimmerman Z, Senobari N S, Yeh C-C M, Funning G, Mueen A, Brisk P, Keogh E (2016) Matrix profile II: exploiting a novel algorithm and GPUs to break the one hundred million barrier for time series motifs and joins. In: 2016 IEEE 16th international conference on data mining (ICDM), pp 739–748
Zilberstein S, Russell S (1995) Approximate reasoning using anytime algorithms. In: Imprecise and approximate computation, pp 43–62

Download references

Acknowledgements

We gratefully acknowledge funding from NSF IIS-1161997 II, MERL Labs and Samsung, and all the data donors.

Author information

Chin-Chia Michael Yeh and Yan Zhu contributed equally to this work, and should be considered joint first authors.

Authors and Affiliations

University of California, Riverside, Riverside, CA, USA
Chin-Chia Michael Yeh, Yan Zhu, Liudmila Ulanova, Nurjahan Begum, Yifei Ding, Hoang Anh Dau, Zachary Zimmerman & Eamonn Keogh
Universidade de São Paulo, São Carlos, Brazil
Diego Furtado Silva
University of New Mexico, Albuquerque, NM, USA
Abdullah Mueen

Authors

Chin-Chia Michael Yeh
View author publications
You can also search for this author in PubMed Google Scholar
Yan Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Liudmila Ulanova
View author publications
You can also search for this author in PubMed Google Scholar
Nurjahan Begum
View author publications
You can also search for this author in PubMed Google Scholar
Yifei Ding
View author publications
You can also search for this author in PubMed Google Scholar
Hoang Anh Dau
View author publications
You can also search for this author in PubMed Google Scholar
Zachary Zimmerman
View author publications
You can also search for this author in PubMed Google Scholar
Diego Furtado Silva
View author publications
You can also search for this author in PubMed Google Scholar
Abdullah Mueen
View author publications
You can also search for this author in PubMed Google Scholar
Eamonn Keogh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chin-Chia Michael Yeh.

Additional information

Responsible editor: Jian Pei.

Appendix: On the unpredictable time needed for state-of-the-art algorithms

In Section 4.8 we made some unintuitive observations about all known rival motif discovery/time series join algorithms. In essence, by making the problem apparently slightly easier, by either reducing the dimensionality or time series length, the time needed can get actually much worse (and vice versa). Here we sketch out an explanation for this fact.

The key observation is that these algorithms all use some form of pruning. The utility of pruning depends on two things; the distance between the discovered motifs (relative to all pairwise distances), and how quickly the algorithm can find these best motifs (or some good best-so-far motif) to enable the pruning strategy to extract the most benefit. Note that the former factor is a property of the data, not the algorithm.

Imagine we construct a dataset of length 100,000, and search for motifs of length 100. If our data is just random numbers, this is the worst case for both Li et al. (2015) and Mueen et al. (2009), as the intrinsic dimensionality is the same as the actual dimensionality. In MATLAB, we could create such a dataset with:

$$\begin{aligned}>> \mathtt{data1} = \mathtt{[rand(100000,1)]}; \end{aligned}$$

As this is the worst case for Li et al. (2015) and Mueen et al. (2009), both degenerate to brute force search and will take several hours to finish. Naturally the “motif” they discover will only be slightly closer than any randomly chosen pair of subsequences.

Now let us create a near identical dataset, but one which has a critical difference, this dataset has a perfect motif embedded at the beginning and at the end of the time series:

$$\begin{aligned}>> \mathtt{pattern}= & {} \mathtt{rand(100,1)};\\>> \mathtt{data2}= & {} \mathtt{[pattern; rand(99800,1); pattern]}; \end{aligned}$$

If we run motif discovery on this dataset, both Li et al. (2015) and Mueen et al. (2009) terminate much faster, in just seconds. This is because both will find the embedded motif early on, and this will allow very aggressive pruning.

Finally, suppose we consider a new dataset, which is simply data2 with the last point truncated:

$$\begin{aligned}>> \mathtt{data3} = \mathtt{data2(1:end-1)}; \end{aligned}$$

It is clear that although this dataset is very slightly smaller than data2, the time needed by either Li et al. (2015) or Mueen et al. (2009) will return to the many hours needed for than data1. This is because the best motif in data3 will once again be a time series pair that is only be slightly closer than any randomly chosen pair of subsequences, and the pruning will thus be ineffective. By similar reasoning we can construct the two other cases noted in Section 4.8.

Finally, we note that although the examples above are contrived and “worst case”, in practice both Li et al. (2015) and Mueen et al. (2009) do vary greatly in the time require to terminate, on real datasets that appear essentially identical to human inspection.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yeh, CC.M., Zhu, Y., Ulanova, L. et al. Time series joins, motifs, discords and shapelets: a unifying view that exploits the matrix profile. Data Min Knowl Disc 32, 83–123 (2018). https://doi.org/10.1007/s10618-017-0519-9

Download citation

Received: 07 October 2016
Accepted: 12 June 2017
Published: 24 June 2017
Issue Date: January 2018
DOI: https://doi.org/10.1007/s10618-017-0519-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Time series joins, motifs, discords and shapelets: a unifying view that exploits the matrix profile

Abstract

Access this article

Similar content being viewed by others

catch22: CAnonical Time-series CHaracteristics

A review and evaluation of elastic distance functions for time series clustering

Elastic similarity and distance measures for multivariate time series

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: On the unpredictable time needed for state-of-the-art algorithms

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Time series joins, motifs, discords and shapelets: a unifying view that exploits the matrix profile

Abstract

Access this article

Similar content being viewed by others

catch22: CAnonical Time-series CHaracteristics

A review and evaluation of elastic distance functions for time series clustering

Elastic similarity and distance measures for multivariate time series

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: On the unpredictable time needed for state-of-the-art algorithms

Appendix: On the unpredictable time needed for state-of-the-art algorithms

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation