Abstract
We consider the problem of estimating the number of distinct values in a data stream with repeated values. Distinct-values estimation was one of the first data stream problems studied: In the mid-1980’s, Flajolet and Martin gave an effective algorithm that uses only logarithmic space. Recent work has built upon their technique, improving the accuracy guarantees on the estimation, proving lower bounds, and considering other settings such as sliding windows, distributed streams, and sensor networks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
N. Alon, Y. Matias, M. Szegedy, The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58, 137–147 (1999)
Z. Bar-Yossef, T.S. Jayram, R. Kumar, D. Sivakumar, L. Trevisan, Counting distinct elements in a data stream, in Proc. 6th International Workshop on Randomization and Approximation Techniques (2002), pp. 1–10
Z. Bar-Yossef, R. Kumar, D. Sivakumar, Reductions in streaming algorithms, with an application to counting triangles in graphs, in Proc. 13th ACM-SIAM Symposium on Discrete Algorithms (SODA) (2002)
J. Bunge, M. Fitzpatrick, Estimating the number of species: a review. J. Am. Stat. Assoc. 88, 364–373 (1993)
M. Charikar, S. Chaudhuri, R. Motwani, V. Narasayya, Towards estimation error guarantees for distinct values, in Proc. 19th ACM Symp. on Principles of Database Systems (2000), pp. 268–279
S. Chaudhuri, R. Motwani, V. Narasayya, Random sampling for histogram construction: how much is enough? in Proc. ACM SIGMOD International Conf. on Management of Data (1998), pp. 436–447
E. Cohen, Size-estimation framework with applications to transitive closure and reachability. J. Comput. Syst. Sci. 55, 441–453 (1997)
J. Considine, F. Li, G. Kollios, J. Byers, Approximate aggregation techniques for sensor databases, in Proc. 20th International Conf. on Data Engineering (2004), pp. 449–460
G. Cormode, M. Datar, P. Indyk, S. Muthukrishnan, Comparing data streams using Hamming norms (how to zero in), in Proc. 28th International Conf. on Very Large Data Bases (2002), pp. 335–345
M. Datar, A. Gionis, P. Indyk, R. Motwani, Maintaining stream statistics over sliding windows. SIAM J. Comput. 31, 1794–1813 (2002)
M. Durand, P. Flajolet, Loglog counting of large cardinalities, in Proc. 11th European Symp. on Algorithms (2003), pp. 605–617
C. Estan, G. Varghese, M. Fisk, Bitmap algorithms for counting active flows on high speed links, in Proc. 3rd ACM SIGCOMM Conf. on Internet Measurement (2003), pp. 153–166
P. Flajolet, G.N. Martin, Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31, 182–209 (1985)
S. Ganguly, Counting distinct items over update streams, in Proc. 16th International Symp. on Algorithms and Computation (2005), pp. 505–514
S. Ganguly, M. Garofalakis, R. Rastogi, Tracking set-expression cardinalities over continuous update streams. VLDB J. 13, 354–369 (2004)
P.B. Gibbons, Distinct sampling for highly-accurate answers to distinct values queries and event reports, in Proc. 27th International Conf. on Very Large Data Bases (2001), pp. 541–550
P.B. Gibbons, S. Tirthapura, Estimating simple functions on the union of data streams, in Proc. 13th ACM Symp. on Parallel Algorithms and Architectures (2001), pp. 281–291
P.B. Gibbons, S. Tirthapura, Distributed streams algorithms for sliding windows, in Proc. 14th ACM Symp. on Parallel Algorithms and Architectures (2002), pp. 63–72
P.J. Haas, J.F. Naughton, S. Seshadri, L. Stokes, Sampling-based estimation of the number of distinct values of an attribute, in Proc. 21st International Conf. on Very Large Data Bases (1995), pp. 311–322
P.J. Haas, L. Stokes, Estimating the number of classes in a finite population. J. Am. Stat. Assoc. 93, 1475–1487 (1998)
W.C. Hou, G. Özsoyoǧlu, B.K. Taneja, Statistical estimators for relational algebra expressions, in Proc. 7th ACM Symp. on Principles of Database Systems (1988), pp. 276–287
W.C. Hou, G. Özsoyoǧlu, B.K. Taneja, Processing aggregate relational queries with hard time constraints, in Proc. ACM SIGMOD International Conf. on Management of Data (1989), pp. 68–77
A. Kumar, J. Xu, J. Wang, O. Spatscheck, L. Li, Space-code bloom filter for efficient per-flow traffic measurement, in Proc. IEEE INFOCOM (2004)
S. Nath, P.B. Gibbons, S. Seshan, Z. Anderson, Synopsis diffusion for robust aggregation in sensor networks, in Proc. 2nd ACM International Conf. on Embedded Networked Sensor Systems (2004), pp. 250–262
J.F. Naughton, S. Seshadri, On estimating the size of projections, in Proc. 3rd International Conf. on Database Theory (1990), pp. 499–513
F. Olken, Random sampling from databases. PhD thesis, Computer Science, UC, Berkeley (1993)
C.R. Palmer, P.B. Gibbons, C. Faloutsos, ANF: a fast and scalable tool for data mining in massive graphs, in Proc. 8th ACM SIGKDD International Conf. on Knowledge Discovery and Data Mining (2002), pp. 81–90
A. Pavan, S. Tirthapura, Range-efficient computation of \({F_{0}}\) over massive data streams, in Proc. 21st IEEE International Conf. on Data Engineering (2005), pp. 32–43
V. Poosala, Histogram-based estimation techniques in databases. PhD thesis, Univ. of Wisconsin-Madison (1997)
V. Poosala, Y.E. Ioannidis, P.J. Haas, E.J. Shekita, Improved histograms for selectivity estimation of range predicates, in Proc. ACM SIGMOD International Conf. on Management of Data (1996), pp. 294–305
Y. Tao, G. Kollios, J. Considine, F. Li, D. Papadias, Spatio-temporal aggregation using sketches, in Proc. 20th International Conf. on Data Engineering (2004), pp. 214–225
S. Venkataraman, D. Song, P.B. Gibbons, A. Blum, New streaming algorithms for high speed network monitoring and Internet attacks detection, in Proc. 12th ISOC Network and Distributed Security Symp. (2005)
K.Y. Whang, B.T. Vander-Zanden, H.M. Taylor, A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Syst. 15, 208–229 (1990)
D. Woodruff, Optimal space lower bounds for all frequency moments, in Proc. 15th ACM-SIAM Symp. on Discrete Algorithms (2004), pp. 167–175
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Gibbons, P.B. (2016). Distinct-Values Estimation over Data Streams. In: Garofalakis, M., Gehrke, J., Rastogi, R. (eds) Data Stream Management. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28608-0_6
Download citation
DOI: https://doi.org/10.1007/978-3-540-28608-0_6
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28607-3
Online ISBN: 978-3-540-28608-0
eBook Packages: Computer ScienceComputer Science (R0)