Abstract
Bloom filter is a compact memory-efficient probabilistic data structure supporting membership testing, i.e., to check whether an element is in a given set. However, as Bloom filter maps each element with random hash functions, little flexibility is provided even if the information of negative keys (elements are not in the set) is available, especially when the misidentification of negative keys brings different costs. The problem worsens when the hash functions are non-uniform, i.e., mapping each element into Bloom filter non-uniformly. To address the above problem, we propose a new hash adaptive Bloom filter (HABF) that supports customizing hash functions for keys. Besides, we propose a filter family, including f-HABF (fast hashing version), c-HABF (cache-friendly version), and s-HABF (stacked version). We show that HABF family is Pareto optimal among all comparison filters in terms of accuracy and query latency. We conduct extensive experiments on representative datasets, and the results show that HABF family outperforms the standard Bloom filter and its cutting-edge variants on the whole in terms of accuracy, construction/query time, and memory space consumption. All the source codes are available in our source codes (https://github.com/njulands/HashAdaptiveBF).
Similar content being viewed by others
Notes
The cache line size is usually 512 bits in modern X\(86-64\) processor.
References
Our source codes. https://github.com/njulands/HashAdaptiveBF
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Sears, R., Ramakrishnan, R.: BLSM: a general purpose log structured merge tree. In: Proceedings of the International Conference on Management of Data. ACM (2012)
O’Neil, P., Cheng, E., Gawlick, D., O’Neil, E.: The log-structured merge-tree (lsm-tree). Acta Informatica. Springer, pp. 351–385 (1996)
Leveldb. a fast and lightweight key/value database library (2011). http://code.google.com/p/leveldb/
A facebook fork of leveldb which is optimized for flash and big memory machines (2013). https://rocksdb.org/
Mackert, L.F., Lohman, G.M.: R* optimizer validation and performance evaluation for distributed queries. In: Proceedings of International Conference on Very Large Data Bases. VLDB Endowment (1986)
Xiao, B., Chen, W., He, Y.: A novel approach to detecting ddos attacks at an early stage. J. Supercomput. 1, 235–248 (2006)
Bruck, J., Gao, J., Jiang, A.: Weighted Bloom filter. In: Proceedings of International Symposium on Information Theory. IEEE (2006)
Kraska, T., Beutel, A., Chi, E.H., Dean, J., Polyzotis, N.: The case for learned index structures. In: Proceedings of the International Conference on Management of Data. ACM (2018)
Dai, Z., Shrivastava, A.: Adaptive learned Bloom filter (Ada-BF). Efficient utilization of the classifier. arXiv preprint (2019)
Mitzenmacher, M.: A model for learned Bloom filters and optimizing by sandwiching. In: Advances in Neural Information Processing Systems. Curran. (2018)
Deeds, K., Hentschel, B., Idreos, S.: Stacked filters: learning to filter by structure In: Proceedings of International Conference on Very Large Data Bases. VLDB Endowment (2021)
Dai, H.P., Zhong, Y.K., Liu, A.X., Wang, W., Li, M.: Noisy Bloom filters for multi-set membership testing. In: Proceedings of the International Conference on Measurement and Modeling of Computer Science. ACM (2016)
Graf, T.M., Lemire, D.: Xor filters: Faster and smaller than bloom and cuckoo filters. J. Experim. Algorithm. 1, 1–16 (2020)
Cohen, S., Matias, Y.: Spectral Bloom filters. In: Proceedings of the International Conference on Management of Data. ACM (2003)
Guo, D., Wu, J., Chen, H., Yuan, Y., Luo, X.: The dynamic Bloom filters. Trans. Knowl. Data Eng. 22, 120–133 (2009)
Kirsch, A., Mitzenmacher, M.: Less hashing, same performance: building a better bloom filter. In: Proceedings of European Symposium on Algorithms. Springer (2006)
Hao, F., Kodialam, M., Lakshman, T.: Building high accuracy Bloom filters using partitioned hashing. In: International Conference on Measurement and Modeling of Computer Systems. ACM (2007)
Deng, F., Rafiei, D.: Approximately detecting duplicates for streaming data using stable Bloom filters. In: Proceedings of the international conference on Management of data. ACM (2006)
Mitzenmacher, M.: Compressed bloom filters. Trans. Netw. 10, 604–612 (2002)
Henke, C., Schmoll, C., Zseby, T.: Empirical evaluation of hash functions for multipoint measurements. ACM SIGCOMM Comput. Commun. Rev. 38(3), 39–50 (2008)
Estébanez, C., Saez, Y., Recio, G., Isasi, P.: Performance of the most common non-cryptographic hash functions. Softw.: Pract. Experience 44(6), 681–698 (2014)
Lovett, K.: Miscellaneous hash functions. http://www.call-with-current-continuation.org/eggs/hashes.html
Rae, J.W., Bartunov, S., Lillicrap, T.P.: Meta-Learning Neural Bloom Filters. In: Proceedings of International Conference on Machine Learning. ACM (2019)
Bhattacharya, A., Bedathur, S., Bagchi, A.: Adaptive learned bloom filters under incremental workloads. In: Proceedings of India Joint International Conference on Data Science and Management of Data. ACM (2020)
Realtime URI Blacklist. http://uribl.com/
Babcock, B., Olston, C.: Distributed top-k monitoring. In: Proceedings of the International Conference on Management of Data. ACM (2003)
Cormode, G., Muthukrishnan, S.: What’s hot and what’s not: tracking most frequent items dynamically. Transactions on Database Systems. ACM, pp. 249–278 (2005)
Wu, F., Yang, M.H., Zhang, B., Du, D.H.: Ac-key: Adaptive caching for lsm-based key-value stores. In: Proceedings of Annual Technical Conference. USENIX (2020)
Breslau, L., Cao, P., Fan, L., Phillips, G., Shenker, S.: Web caching and zipf-like distributions: Evidence and implications. In: Proceedings of International Conference on Computer Communications. IEEE (1999)
Li, Y., Tian, C., Guo, F., Li, C., Xu, Y.: Elasticbf: elastic bloom filter with hotness awareness for boosting read performance in large key-value stores. In: Proceedings of Annual Technical Conference. USENIX (2019)
Xie, R.B., Li, M., Miao, Z.Y., Gu, R., Huang, H. Dai, H.P., Chen, G.H.: Hash Adaptive Bloom Filter. In: Proceedings of International Conference on Data Engineering. IEEE (2021)
Fan, L., Cao, P., Almeida, J., Broder, A.Z.: Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Trans. Netw. 8(3), 281–293 (2000)
Gosselin-Lavigne, M.A., Gonzalez, H., Stakhanova, N., Ghorbani, A.A.: A performance evaluation of hash functions for ip reputation lookup using Bloom filters. In: Proceedings of International Conference on Availability, Reliability and Security. IEEE (2015)
Broder, A., Mitzenmacher, M.: Network applications of Bloom filters: a survey. Internet mathematics ,pp. 485–509 (2004)
Peter C. Dillinger and Lorenz Hübschle-Schneider and Peter Sanders and Stefan Walzer: Fast Succinct Retrieval and Approximate Membership using Ribbon. arXiv preprint arXiv:abs/2109.01892 (2021)
Zhong, M., Lu, P., Shen, K., Seiferas, J.: Optimizing data popularity conscious Bloom filters. In: Proceedings of symposium on Principles of distributed computing. ACM (2008)
Dayan, N., Athanassoulis, M., Idreos, S.: Monkey: Optimal navigable key-value store. In: International Conference on Management of Data. ACM, pp. 79–94 (2017)
Byun, H., Lim, H.: Learned FBF: learning-based functional bloom filter for key-value storage. Trans. Comput 1, 1 (2021)
Dillinger, P.C.: Adaptive approximate state storage. Ph.D. thesis, Northeastern University (2010)
Hardy, G.H., Littlewood, J.E., Pólya, G., Littlewood, D.: Inequalities. Cambridge University Press (1952)
Appendix https://njulimn.github.io/assets/pdf/VLDBJ_Appendix.pdf
Putze, F., Sanders, P., Singler, J.: Cache-, hash- and space-efficient bloom filters. In: Proceedings of International conference on Experimental algorithms. Springer (2007)
Lang, H., Neumann, T., Kemper, A., Boncz, P.: Performance-optimal filtering: Bloom overtakes cuckoo at high throughput (2019)
Fastfilter. https://github.com/FastFilter/fastfilter_cpp
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Hopfield, J.J.: Neural networks and physical systems with emergent collective computational abilities. In: Proceedings of the national academy of sciences. National Acad Sciences (1982)
Keras. https://keras.io/
Cityhash. https://github.com/google/cityhash
Murmurhash. https://sites.google.com/site/murmurhash/
Smhasher. https://github.com/rurban/smhasher
R. jenkins. http://www.burtleburtle.net/bob/hash/doobs.html
Shalla’s blacklists. http://www.shallalist.de/index.html
Singhal, K., Weiss, P.: DeepBloom. https://github.com/karan1149/DeepBloom
Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking cloud serving systems with YCSB. In: Proceedings of Symposium on Cloud Computing. ACM (2010)
Powers, D.M.: Applications and explanations of Zipf’s law. In: Proceedings of Association for Computational Linguistics. ACL (1998)
Acknowledgements
We thank all the reviewers for their constructive comments and the help of Zheyu Miao along the way. This work was supported in part by the National Natural Science Foundation of China under Grant 61872178, in part by the National Natural Science Foundation of China (Nos. 61832005, 62072230, U1811461), in part by the Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing University, in part by the Jiangsu High-level Innovation and Entrepreneurship (Shuangchuang) Program, and in part by Alibaba Innovative Research Project.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, M., Xie, R., Chen, D. et al. A Pareto optimal Bloom filter family with hash adaptivity. The VLDB Journal 32, 525–548 (2023). https://doi.org/10.1007/s00778-022-00755-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-022-00755-z