DXML: Distributed Extreme Multilabel Classification

Kumar, Pawan

doi:10.1007/978-3-030-93620-4_22

Pawan Kumar ORCID: orcid.org/0000-0001-5632-6964¹³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 13147))

Included in the following conference series:

International Conference on Big Data Analytics

595 Accesses
1 Citations

Abstract

As a big data application, extreme multilabel classification has emerged as an important research topic with applications in ranking and recommendation of products and items. A scalable hybrid distributed and shared memory implementation of extreme classification for large scale ranking and recommendation is proposed. In particular, the implementation is a mix of message passing using MPI across nodes and using multithreading on the nodes using OpenMP. The expression for communication latency and communication volume is derived. Parallelism using work-span model is derived for shared memory architecture. This throws light on the expected scalability of similar extreme classification methods. Experiments show that the implementation is relatively faster to train and test on some large datasets. In some cases, model size is relatively small.

Code: https://github.com/misterpawan/DXML

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Open MPI: Open source high performance computing. https://www.open-mpi.org/
Openmp. https://www.openmp.org/
Kumar, P., Markidis, S., Lapenta, G., Meerbergen, K., Roose, D.: High performance solvers for implicit particle in cell simulation (special issue). Procedia Comput. Sci. 18, 2251–2258 (2013). https://doi.org/10.1016/j.procs.2013.05.396. https://www.sciencedirect.com/science/article/pii/S1877050913005395. 2013 International Conference on Computational Science
Bhatia, K., Jain, H., Kar, P., Varma, M., Jain, P.: Sparse local embeddings for extreme multi-label classification. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS 2015, vol. 1, pp. 730–738. MIT Press, Cambridge (2015)
Google Scholar
Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: an efficient multithreaded runtime system. SIGPLAN Not. 30(8), 207–216 (1995). https://doi.org/10.1145/209937.209958
Article Google Scholar
Jain, H., Prabhu, Y., Varma, M.: Extreme multi-label loss functions for recommendation, tagging, ranking and other missing label applications. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, pp. 935–944. Association for Computing Machinery, New York (2016). https://doi.org/10.1145/2939672.2939756
Jasinska, K., Dembczynski, K., Busa-Fekete, R., Pfannschmidt, K., Klerx, T., Hullermeier, E.: Extreme f-measure maximization using sparse probability estimates. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning, ICML 2016, vol. 48, pp. 1435–1444. JMLR.org (2016)
Google Scholar
Jayadev, N., Tanmay, S., Pawan, K.: A riemannian approach for constrained optimization problem in extreme classification problems. CoRR abs/2109.15021 (2021). https://arxiv.org/abs/2109.15021
Jayadev, N., Tanmay, S., Pawan, K.: A riemannian approach for extreme classification problems. In: CODS-COMAD 2021 (2021)
Google Scholar
Kocev, D., Vens, C., Struyf, J., Džeroski, S.: Ensembles of multi-objective decision trees. In: Kok, J.N., Koronacki, J., Mantaras, R.L., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 624–631. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74958-5_61
Chapter Google Scholar
Kumar, P.: Communication optimal least squares solver. In: 2014 IEEE International Conference on High Performance Computing and Communications, 2014 IEEE 6th Intl Symposium on Cyberspace Safety and Security, 2014 IEEE 11th International Conference on Embedded Software and Syst (HPCC, CSS, ICESS), pp. 316–319 (2014). https://doi.org/10.1109/HPCC.2014.55
Kumar, P.: Multithreaded direction preserving preconditioners. In: 2014 IEEE 13th International Symposium on Parallel and Distributed Computing, pp. 148–155 (2014). https://doi.org/10.1109/ISPDC.2014.23
Kumar, P.: Multilevel communication optimal least squares (special issue). Procedia Comput. Sci. 51, 1838–1847 (2015). https://doi.org/10.1016/j.procs.2015.05.410. https://www.sciencedirect.com/science/article/pii/S1877050915012181. International Conference On Computational Science, ICCS 2015
Kumar, P., Meerbergen, K., Roose, D.: Multi-threaded nested filtering factorization preconditioner. In: Manninen, P., Öster, P. (eds.) PARA 2012. LNCS, vol. 7782, pp. 220–234. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36803-5_16
Chapter Google Scholar
Prabhu, Y., Varma, M.: Fastxml: a fast, accurate and stable tree-classifier for extreme multi-label learning, KDD 2014, pp. 263–272. Association for Computing Machinery, New York (2014). https://doi.org/10.1145/2623330.2623651
Siblini, W., Meyer, F., Kuntz, P.: Craftml, an efficient clustering-based random forest for extreme multi-label learning. In: Dy, J.G., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, 10–15 July 2018. Proceedings of Machine Learning Research, vol. 80, pp. 4671–4680. PMLR (2018). http://proceedings.mlr.press/v80/siblini18a.html
Tagami, Y.: Annexml: approximate nearest neighbor search for extreme multi-label classification. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2017, pp. 455–464. Association for Computing Machinery, New York (2017). https://doi.org/10.1145/3097983.3097987
Tsoumakas, G., Katakis, I.: Multi-label classification: an overview. Int. J. Data Warehous. Min. 3, 1–13 (2007)
Article Google Scholar
Weinberger, K.Q., Dasgupta, A., Attenberg, J., Langford, J., Smola, A.J.: Feature hashing for large scale multitask learning. CoRR abs/0902.2206 (2009). http://arxiv.org/abs/0902.2206
Weston, J., Bengio, S., Usunier, N.: Wsabie: scaling up to large vocabulary image annotation, IJCAI 2011, pp. 2764–2770. AAAI Press (2011)
Google Scholar
Weston, J., Makadia, A., Yee, H.: Label partitioning for sublinear ranking. In: Proceedings of the 30th International Conference on International Conference on Machine Learning, ICML 2013, vol. 28, pp. II-181–II-189. JMLR.org (2013)
Google Scholar
Yen, I.E.H., Huang, X., Zhong, K., Ravikumar, P., Dhillon, I.S.: PD-sparse: a primal and dual sparse approach to extreme multiclass and multilabel classification. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning, ICML 2016, vol. 48, pp. 3069–3077. JMLR.org (2016)
Google Scholar
Yen, I.E., Huang, X., Dai, W., Ravikumar, P., Dhillon, I., Xing, E.: PPDSparse: a parallel primal-dual sparse method for extreme classification. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2017, pp. 545–553. Association for Computing Machinery, New York (2017). https://doi.org/10.1145/3097983.3098083
Yu, H.F., Jain, P., Kar, P., Dhillon, I.S.: Large-scale multi-label learning with missing labels. In: Proceedings of the 31st International Conference on International Conference on Machine Learning, ICML 2014, vol. 32, pp. I-593–I-601. JMLR.org (2014)
Google Scholar
Zhang, M., Zhou, Z.: A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 26(8), 1819–1837 (2014). https://doi.org/10.1109/TKDE.2013.39
Article Google Scholar
Zhang, M.L., Zhou, Z.H.: ML-KNN: a lazy learning approach to multi-label learning. Pattern Recogn. 40(7), 2038–2048 (2007). https://doi.org/10.1016/j.patcog.2006.12.019
Article MATH Google Scholar

Download references

Acknowledgement

This work was done at IIIT, Hyderabad using IIIT seed grant. The author acknowledges all the support by institute. This project was partially supported by RIPPLE center of excellence at IIIT, Hyderabad.

Author information

Authors and Affiliations

International Institute of Information Technology, Hyderabad, Hyderabad, 500032, India
Pawan Kumar

Authors

Pawan Kumar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pawan Kumar .

Editor information

Editors and Affiliations

University of Hyderabad, Hyderabad, India
Satish Narayana Srirama
Western Norway University of Applied Sciences, Bergen, Norway
Jerry Chun-Wei Lin
University of Cincinnati, Cincinnati, OH, USA
Raj Bhatnagar
Indian Institute of Information Technology Allahabad, Prayagraj, India
Sonali Agarwal
International Institute of Information Technology, Hyderabad, India
P. Krishna Reddy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kumar, P. (2021). DXML: Distributed Extreme Multilabel Classification. In: Srirama, S.N., Lin, J.CW., Bhatnagar, R., Agarwal, S., Reddy, P.K. (eds) Big Data Analytics. BDA 2021. Lecture Notes in Computer Science(), vol 13147. Springer, Cham. https://doi.org/10.1007/978-3-030-93620-4_22

Download citation

DOI: https://doi.org/10.1007/978-3-030-93620-4_22
Published: 18 December 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-93619-8
Online ISBN: 978-3-030-93620-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics