Skip to main content

Online Anomaly Detection over Big Data Streams

  • Chapter
  • First Online:
Applied Data Science

Abstract

In many domains, high-quality data are used as a foundation for decision-making. An essential component to assess data quality lies in anomaly detection. We describe and empirically evaluate the design and implementation of a framework for data quality testing over real-world streams in a large-scale telecommunication network. This approach is both general—by using general-purpose measures borrowed from information theory and statistics—and scalable—through anomaly detection pipelines that are executed in a distributed setting over state-of-the-art big data streaming and batch processing infrastructures. We empirically evaluate our system and discuss its merits and limitations by comparing it to existing anomaly detection techniques, showing its high accuracy, efficiency, as well as its scalability in parallelizing operations across a large number of nodes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Dasu, T., Krishnan, S., Venkatasubramanian, S., & Yi, K. (2006). An information-theoretic approach to detecting changes in multi-dimensional data streams. In Proceedings of the 38th Symposium on the Interface of Statistics, Computing Science, and Applications.

    Google Scholar 

  • Datar, M., Gionis, A., Indyk, P., & Motwani, R. (2002). Maintaining stream statistics over sliding windows. SIAM Journal on Computing, 31, 1794–1813.

    Article  MathSciNet  Google Scholar 

  • Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified data processing on large clusters. In OSDI’04: Sixth Symposium on Operating System Design and Implementation.

    Google Scholar 

  • Flajolet, P., Fusy, É., Gandouet, O., & Meunier, F. (2007). HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm. In Conference on Analysis of Algorithms, AofA.

    Google Scholar 

  • Gupta, M., Gao, J., Aggarwal, C. C., & Han, J. (2014). Outlier detection for temporal data: A survey. IEEE Transactions on Knowledge and Data Engineering.

    Google Scholar 

  • Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22, 79–86.

    Article  MathSciNet  Google Scholar 

  • Lee, W., & Xiang, D. (2001). Information-theoretic measures for anomaly detection. In IEEE Symposium on Security and Privacy (pp. 130–143).

    Google Scholar 

  • Ma, Q., Muthukrishnan, S., & Sandler, M. (2013). Frugal streaming for estimating quantiles. In Space-Efficient Data Structures, Streams, and Algorithms (Vol. 8066, pp. 77–96). Berlin: Springer.

    Google Scholar 

  • Marz, N., & Warren, J. (2013). Big Data: Principles and best practices of scalable realtime data systems. Greenwich, CT: Manning Publications Co.

    Google Scholar 

  • Münz, G., Li, S., & Carle, G. (2007). Traffic anomaly detection using k-means clustering. In GI/ITG Workshop MMBnet.

    Google Scholar 

  • Muthukrishnan, S. (2005). Data streams: Algorithms and applications. Foundations and Trends in Theoretical Computer Science (Vol. 1).

    Google Scholar 

  • Papapetrou, O., Garofalakis, M., & Deligiannakis, A. (2012). Sketch-based querying of distributed sliding-window data streams. In Proceedings of the VLDB Endowment (Vol. 5, pp. 992–1003).

    Google Scholar 

  • The Apache Software Foundation. (2015). Spark Streaming programming guide. Retrieved from http://spark.apache.org/docs/1.0.0/streaming-programming-guide.html

  • Young, W. C., Blumenstock, J. E., Fox, E. B., & Mccormick, T. H. (2014). Detecting and classifying anomalous behavior in spatiotemporal network data. In The 20th ACM Conference on Knowledge Discovery and Mining (KDD ’14), Workshop on Data Science for Social Good.

    Google Scholar 

  • Zaharia, M., Chowdhury, M., Das, T., Dave, A., McCauley, M., Franklin, M. J., et al. (2012b). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation.

    Google Scholar 

  • Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing.

    Google Scholar 

  • Zaharia, M., Das, T., Li, H., Shenker, S., & Stoica, I. (2012a). Discretized streams: An efficient and fault-tolerant model for stream processing on large clusters. In Proceedings of the 4th USENIX Conference on Hot Topics in Cloud Computing.

    Google Scholar 

  • Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., & Stoica, I. (2013). Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles.

    Google Scholar 

  • Zhang, J., Lou, M., Ling, T. W., & Wang, H. (2004). HOS-Miner: A system for detecting outlying subspaces in high-dimensional data. In Proceedings of the 30th International Conference on Very Large Databases.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Philippe Cudré-Mauroux .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Rettig, L., Khayati, M., Cudré-Mauroux, P., Piorkówski, M. (2019). Online Anomaly Detection over Big Data Streams. In: Braschler, M., Stadelmann, T., Stockinger, K. (eds) Applied Data Science. Springer, Cham. https://doi.org/10.1007/978-3-030-11821-1_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-11821-1_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-11820-4

  • Online ISBN: 978-3-030-11821-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics