Optimizing Geo-Distributed Streaming Analytics

Chandra, Abhishek; Heintz, Benjamin; Sitaraman, Ramesh

doi:10.1007/978-3-319-63962-8_155-1

Abhishek Chandra³,
Benjamin Heintz³ &
Ramesh Sitaraman⁴

195 Accesses

Abstract

Rapid data streams are generated continuously from diverse sources including users, devices, and sensors located around the globe. Modern analytics services require the analysis of large quantities of such data streams derived from disparate geo-distributed sources. Further, the analytics requirements can be complex, resulting in complex trade-offs between cost, performance, and accuracy. A typical geo-distributed analytics service uses a hub-and-spoke model, comprising multiple edges connected by a wide area network (WAN) to a central data warehouse, which leads to the question of how much computation should be performed at the edges versus the center. While the traditional approach to analytics processing is to send all the data to a dedicated centralized location, an alternative approach would be to push all computing to the edge for in situ processing. However, neither approach is optimal for modern analytics requirements. Instead, the optimal solution often entails carefully orchestrating the analytics processing at both the center and the edges and is driven by factors such as application, data, and resource characteristics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

Agarwal S et al (2013) BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of EuroSys, pp 29–42
Google Scholar
Akidau T et al (2013) MillWheel: fault-tolerant stream processing at Internet scale. Proc VLDB Endow 6(11):1033–1044
Article Google Scholar
Akidau T et al (2015) The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc VLDB Endow 8:1792–1803
Article Google Scholar
Amur H et al (2013) Memory-efficient groupby-aggregate using compressed buffer trees. In: Proceedings of the symposium on cloud computing (SoCC)
Google Scholar
Apache Flink (2016) Scalable batch and stream data processing. http://flink.apache.org/
Apache Storm (2015) Storm, distributed and fault-tolerant realtime computation. http://storm.apache.org/
Beam (2016) Apache Beam (incubating). http://beam.incubator.apache.org/
Boykin O, Ritchie S, O’Connel I, Lin J (2014) Summingbird: a framework for integrating batch and online mapreduce computations. In: Proceedings of VLDB, vol 7, pp 1441–1451
Google Scholar
Chandrasekaran S et al (2003) TelegraphCQ: continuous dataflow processing for an uncertain world. In: Proceedings of the conference on innovative data systems research
Book Google Scholar
Chen GJ, Wiener JL, Iyer S, Jaiswal A, Lei R, Simha N, Wang W, Wilfong K, Williamson T, Yilmaz S (2016) Realtime data processing at facebook. In: Proceedings of SIGMOD, pp 1087–1098
Google Scholar
Das T, Zhong Y, Stoica I, Shenker S (2014) Adaptive stream processing using dynamic batch sizing. In: Proceedings of the ACM symposium on cloud computing, pp 16:1–16:13
Google Scholar
Flajolet P, Fusy É, Gandouet O et al (2007) HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In: Proceedings of the international conference on analysis of algorithms
MATH Google Scholar
Gray J et al (1997) Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min Knowl Discov 1(1):29–53
Article Google Scholar
Heintz B, Chandra A, Sitaraman RK (2015) Optimizing grouped aggregation in geo-distributed streaming analytics. In: Proceedings of the ACM symposium on high-performance parallel and distributed computing, pp 133–144
Google Scholar
Heintz B, Chandra A, Sitaraman RK (2016a) Trading timeliness and accuracy in geo-distributed streaming analytics. In: Proceedings of the ACM symposium on cloud computing
Book Google Scholar
Heintz B, Chandra A, Sitaraman RK, Weissman J (2016b) End-to-end optimization for geo-distributed mapreduce. IEEE Trans Cloud Comput 4(3):293–306
Article Google Scholar
Heintz B, Chandra A, Sitaraman RK (2017) Optimizing timeliness and cost in geo-distributed streaming analytics. IEEE Trans Cloud Comput. http://ieeexplore.ieee.org/document/8031021/
Hwang JH, Cetintemel U, Zdonik S (2008) Fast and highly-available stream processing over wide area networks. In: Proceedings of ICDE, pp 804–813
Google Scholar
Kulkarni S et al (2015) Twitter heron: stream processing at scale. In: Proceedings of SIGMOD, pp 239–250
Google Scholar
Larson PA (2002) Data reduction by partial preaggregation. In: Proceedings of ICDE, pp 706–715
Google Scholar
Madden S, Franklin MJ, Hellerstein JM, Hong W (2002) TAG: a Tiny AGgregation service for ad-hoc sensor networks. In: Proceedings of OSDI, pp 131–146
Google Scholar
Nygren E, Sitaraman RK, Sun J (2010) The Akamai network: a platform for high-performance internet applications. SIGOPS Oper Syst Rev 44(3):2–19
Article Google Scholar
Peterson L, Anderson T, Culler D, Roscoe T (2003) A blueprint for introducing disruptive technology into the Internet. SIGCOMM Comput Commun Rev 33(1): 59–64
Article Google Scholar
Pietzuch P et al (2006) Network-aware operator placement for stream-processing systems. In: Proceedings of ICDE
Book Google Scholar
PlanetLab (2015) http://planet-lab.org/
Podlipnig S, Böszörmenyi L (2003) A survey of web cache replacement strategies. ACM Comput Surv 35(4):374–398
Article Google Scholar
Pu Q, Ananthanarayanan G, Bodik P, Kandula S, Akella A, Bahl P, Stoica I (2015) Low latency geo-distributed data analytics. In: Proceedings of SIGCOMM, pp 421–434
Google Scholar
Qian Z et al (2013) TimeStream: reliable stream computation in the cloud. In: Proceedings of EuroSys, pp 1–14
Google Scholar
Rabkin A, Arye M, Sen S, Pai VS, Freedman MJ (2014) Aggregation and degradation in JetStream: streaming analytics in the wide area. In: Proceedings of NSDI, pp. 275–288
Google Scholar
Rajagopalan R, Varshney P (2006) Data-aggregation techniques in sensor networks: a survey. IEEE Commun Surv Tutor 8(4):48–63
Article Google Scholar
Vulimiri A, Curino C, Godfrey B, Karanasos K, Varghese G (2015a) WANalytics: analytics for a geo-distributed data-intensive world. In: Proceedings of CIDR
Book Google Scholar
Vulimiri A, Curino C, Godfrey PB, Jungblut T, Padhye J, Varghese G (2015b) Global analytics in the face of bandwidth and regulatory constraints. In: Proceedings of NSDI, pp 323–336
Google Scholar
Yu Y, Gunda PK, Isard M (2009) Distributed aggregation for data-parallel computing: interfaces and implementations. In: Proceedings of SOSP, pp 247–260
Google Scholar
Zaharia M, Das T, Li H, Hunter T, Shenker S, Stoica I (2013) Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of SOSP, pp 423–438
Google Scholar

Download references

Author information

Authors and Affiliations

University of Minnesota, Minneapolis, MN, USA
Abhishek Chandra & Benjamin Heintz
University of Massachusetts, Amherst, MA, USA
Ramesh Sitaraman

Authors

Abhishek Chandra
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Heintz
View author publications
You can also search for this author in PubMed Google Scholar
Ramesh Sitaraman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abhishek Chandra .

Editor information

Editors and Affiliations

Institute of Computer Science, University of Tartu, Tartu, Estonia
Sherif Sakr
Sch of Info Techno, Building J12, University of Sydney Sch of Info Techno, Building J12, Sydney, Australia
Albert Zomaya

Section Editor information

Delft University of Technology, Delft, Netherlands
Asterios Katsifodimos
School of Informatics, University of Edinburgh, Edinburgh, UK
Pramod Bhatotia

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Chandra, A., Heintz, B., Sitaraman, R. (2018). Optimizing Geo-Distributed Streaming Analytics. In: Sakr, S., Zomaya, A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-63962-8_155-1

Download citation

DOI: https://doi.org/10.1007/978-3-319-63962-8_155-1
Published: 01 February 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63962-8
Online ISBN: 978-3-319-63962-8
eBook Packages: Springer Reference MathematicsReference Module Computer Science and Engineering

Publish with us

Policies and ethics