A Multi-scale Convolutional Neural Network Architecture for Music Auto-Tagging

Dabral, Tanmaya Shekhar; Deshmukh, Amala Sanjay; Malapati, Aruna

doi:10.1007/978-981-13-1592-3_60

Tanmaya Shekhar Dabral¹⁹,
Amala Sanjay Deshmukh¹⁹ &
Aruna Malapati¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 816))

804 Accesses

Abstract

The application of deep neural networks, particularly convolutional neural networks, in the field of music auto-tagging has been gaining traction in recent times. These deep networks relieve the engineers from the burden of handcrafting domain-specific features. However, musical features often show great temporal diversity which traditional deep networks are unable to capture. Keeping this in mind, we propose a convolutional neural network architecture which attempts to learn features over multiple timescales. The architecture runs multiple convolutions over various subsampled versions of the original audio spectrogram. These convolution streams are then concatenated to make the tag predictions. We evaluate the architecture on the MagnaTagATune dataset, and we show that the proposed architecture yields results close to the state of the art and comprehensively beats shallow classifiers trained on handcrafted features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Davis, S.B., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Sig. Process. 28(4), 357–366 (1980)
Article Google Scholar
Hinton, G.E., Osindero, S., Teh, Y.-W.: A fast learning algorithm for deep belief nets. Neural Comput. 18, 1527–1554 (2006)
Article MathSciNet Google Scholar
Schmidhuber, J.: Deep learning in neural networks: an overview (2014). arXiv:1404.7828
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of the Neural Information Processing Systems Conference (2012)
Google Scholar
Hinton, G., Deng, L., Dong, Y., Dahl, G., Mohamed, A.-R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition. IEEE Sig. Process. Mag. 29, 82–97 (2012)
Article Google Scholar
Humphrey, E.J., Bello, J.P., LeCun, Y.: Moving beyond feature design: deep architecture and automatic feature learning in music informatics. In: Proceedings of the 13th International Society for Music Information Retrieval Conference (2012)
Google Scholar
Choi, K., Fazekas, G., Sandler, M., Cho, K.: Convolutional recurrent neural networks for music classification (2016). arXiv:1609.04243
Briot, J.-P., Hadjeres, G., Pachet, F.: Deep learning techniques for music generation—A survey (2017). arXiv:1709.01620
Multiscale approaches to music audio feature learning. In: Proceedings of the 14th International Society for Music Information Retrieval Conference (2013)
Google Scholar
Law, E., West, K., Mandel, M., Bay, M., Downie, J.S.: Evaluation of algorithms using games: the case of music annotation. In: Proceedings of the 10th International Conference on Music Information Retrieval (ISMIR) (2009)
Google Scholar
Wulfing, J., Riedmiller, M.: Unsupervised learning of local features for music classification. In: Proceedings of the 13th International Society for Music Information Retrieval Conference (2012)
Google Scholar
Nam, J., Herrera, J., Slaney, M., Smith, J.: Learning sparse feature representations for music annotation and retrieval. In: Proceedings of the 13th International Society for Music Information Retrieval Conference (2012)
Google Scholar
Nam, J., Herrera, J., Lee, K.: A deep bag-of-features model for music auto-tagging (2015). arXiv:1508.04999
Choi, K., Fazekas, G., Sandler, M.: Automatic tagging using deep convolutional neural networks (2016). arXiv:1606.00298
van den Oord, A., Dieleman, S., Schrauwen, B.: Deep content-based music recommendation. In: Proceedings of the Neural Information Processing Systems Conference (2013)
Google Scholar
Dieleman, S., Schrauwen, B.: End-to-end learning for music audio. In: Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) (2014)
Google Scholar
Lee, J., Park, J., Kim, K.L., Nam, J.: Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms (2017). arXiv:1703.01789
Hamel, P., Bengio, Y., Eck, D.: Building musically-relevant audio features through multiple timescale representations. In: Proceedings of the 13th International Society for Music Information Retrieval Conference (2012)
Google Scholar
Mesgarani, N., Shamma, S., Slaney, M.: Speech discrimination based on multiscale spectro-temporal modulations. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2004)
Google Scholar
Lee, J., Nam, J.: Multi-level and multi-scale feature aggregation using pre-trained convolutional neural networks for music auto-tagging. arXiv:1703.01793 (2017)
LeCun, Y., Boser, B.E., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.E., Jackel, L.D.: Handwritten digit recognition with a back-propagation network. In: Proceedings of the Neural Information Processing Systems Conference (1989)
Google Scholar
Sainath, T.N., Mohamed, A.-R., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks For LVCSR. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013)
Google Scholar
Dorfler, M., Bammer, R., Grill, T.: Inside the spectrogram: convolutional neural networks in audio processing. In: Proceedings of the International Conference on Sampling Theory and Applications (SampTA) (2017)
Google Scholar
Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27, 861–874 (2006)
Article Google Scholar
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML) (2010)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift (2015). arXiv:1502.03167
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)
MathSciNet MATH Google Scholar
Kingma, D.P., Adam, J.B.: A method for stochastic optimization (2014). arXiv:1412.6980
McFee, B., Raffel, C., Liang, D., Ellis, D.P.W., McVicar, M., Battenberg, E., Nieto, O.: librosa: audio and music signal analysis in python. In: Proceedings of the 14th Python in Science Conference, pp. 18–25 (2015)
Google Scholar
Theano Development Team, Theano: A python framework for fast computation of mathematical expressions (2016). arXiv:1605.02688

Download references

Author information

Authors and Affiliations

Birla Institute of Technology and Science, Pilani, Hyderabad Campus, Hyderabad, India
Tanmaya Shekhar Dabral, Amala Sanjay Deshmukh & Aruna Malapati

Authors

Tanmaya Shekhar Dabral
View author publications
You can also search for this author in PubMed Google Scholar
Amala Sanjay Deshmukh
View author publications
You can also search for this author in PubMed Google Scholar
Aruna Malapati
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tanmaya Shekhar Dabral .

Editor information

Editors and Affiliations

Department of Mathematics, South Asian University New Delhi , New Delhi, India
Jagdish Chand Bansal
Department of Mathematics, National Institute Of Technology Silchar Department of Mathematics, Silchar, Assam, India
Kedar Nath Das
Department of Mathematics and Computer Science, Faculty of Science, , Liverpool Hope University, Liverpool, UK
Atulya Nagar
Department of Mathematics, Indian Institute of Technology Roor Department of Mathematics, Roorkee, Uttarakhand, India
Kusum Deep
School of Basic Sciences, Indian Institute of Technology Bhubanesw School of Basic Sciences, Bhubaneswar, Odisha, India
Akshay Kumar Ojha

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dabral, T.S., Deshmukh, A.S., Malapati, A. (2019). A Multi-scale Convolutional Neural Network Architecture for Music Auto-Tagging. In: Bansal, J., Das, K., Nagar, A., Deep, K., Ojha, A. (eds) Soft Computing for Problem Solving. Advances in Intelligent Systems and Computing, vol 816. Springer, Singapore. https://doi.org/10.1007/978-981-13-1592-3_60

Download citation

DOI: https://doi.org/10.1007/978-981-13-1592-3_60
Published: 14 December 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1591-6
Online ISBN: 978-981-13-1592-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics