An evaluation of deep neural network models for music classification using spectrograms

Li, Jingxian; Han, Lixin; Li, Xiaoshuang; Zhu, Jun; Yuan, Baohua; Gou, Zhinan

doi:10.1007/s11042-020-10465-9

An evaluation of deep neural network models for music classification using spectrograms

1193: Intelligent Processing of Multimedia Signals
Published: 09 February 2021

Volume 81, pages 4621–4647, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Jingxian Li ORCID: orcid.org/0000-0002-2450-6052^1,2,
Lixin Han¹,
Xiaoshuang Li¹,
Jun Zhu¹,
Baohua Yuan³ &
…
Zhinan Gou⁴

1353 Accesses
17 Citations
Explore all metrics

Abstract

Deep Neural Network (DNN) models have lately received considerable attention for that the network structure can extract deep features to improve classification accuracy and achieve excellent results in the field of image. However, due to the different content forms of music and images, transferring deep learning to music classification is still a problem. To address this issue, in the paper, we transfer the state-of-the-art DNN models to music classification and evaluate the performance of the models using spectrograms. Firstly, we convert the music audio files into spectrograms by modal transformation, and then classify music through deep learning. In order to alleviate the problem of overfitting during training, we propose a balanced trusted loss function and build the balanced trusted model ResNet50_trust. Finally, we compare the performance of different DNN models in music classification. Furthermore, this work adds music sentiment analysis based on the newly constructed music emotion dataset. Extensive experimental evaluations on three music datasets show that our proposed model Resnet50_trust consistently outperforms other DNN models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Methods for image denoising using convolutional neural network: a review

Article Open access 10 June 2021

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Article Open access 07 May 2022

A review on the long short-term memory model

Article 13 May 2020

References

Aguiar RL, Costa YMG, Nanni L (2016) Music genre recognition using spectrograms with harmonic-percussive sound separation. In 35th International Conference of the Chilean Computer Science Society, Valparaiso, Chile, pp 1–7
Bengio Y (2009) Learning deep architectures for AI. Foundations and trends in Machine Learning 2(1):1–127
Article MathSciNet Google Scholar
Chaurasiya H (2020) Time-Frequency Representations: Spectrogram, Cochleogram and Correlogram. Procedia Computer Science 167:1901–1910
Article Google Scholar
Choi K, Fazekas G, Sandler M (2016) Automatic tagging using deep convolutional neural networks. arXiv preprint arXiv:1606.00298.
Costa YMG, Oliveira LS, Silla JCN, Silla CN Jr (2017) An evaluation of convolutional neural networks for music classification using spectrograms. Applied soft computing 52:28–38
Article Google Scholar
Defferrard M, Benzi K, Vandergheynst P et al (2016) Fma: A dataset for music analysis. arXiv preprint arXiv:1612.01840.
Deng L, Yu D (2014) Deep learning: methods and applications. Foundations and Trends in Signal Processing 7(3–4):197–387
Article MathSciNet Google Scholar
Ferraro A, Bogdanov D, Jeon JH et al (2019) Music Auto-tagging Using CNNs and Mel-spectrograms with Reduced Frequency and Time Resolution. arXiv preprint arXiv:1911.04824.
Glauner PO (2015) Deep Convolutional Neural Networks for Smile Recognition (MSc Thesis). Imperial College London, Department of Computing. arXiv:1508.06535.
Gulli A, Pal S (2017) Deep learning with Keras. Packt Publishing Ltd.
He K, Zhang X, Ren S et al. (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778.
Howard A G, Zhu M, Chen B et al (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
Huang G, Liu Z, Van Der Maaten L et al (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp: 4700–4708.
Khunarsal P, Lursinsap C, Raicharoen T (2013) Very short time environmental sound classification based on spectrogram pattern matching. Information Sciences 243:57–74
Article Google Scholar
Kim T, Lee J, Nam J (2018) Sample-level CNN architectures for music auto-tagging using raw waveforms. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp: 366–370.
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Kobayashi T, Kubota A, Suzuki Y (2018) Audio feature extraction based on sub-band signal correlations for music genre classification. In 2018 IEEE International Symposium on Multimedia. ISM, pp 180–181.
Kong Q, Feng X, Li Y (2014) Music genre classification using convolutional neural network. In Proc. Int. Soc. Music Inform. Retrieval (ISMIR).
LeCun Y, Bottou L, Bengio Y et al (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324
Article Google Scholar
LeCun Y, Bengio Y, Hinton G (2015) Hinton. Deep learning. Nature 521(7553):436–444
Article Google Scholar
Lidy T, Schindler A (2016) Parallel convolutional neural networks for music genre and mood classification. MIREX2016.
Liu X, Chen Q, Wu X et al (2017) CNN based music emotion classification. arXiv preprint arXiv:1704.05665.
Ma X, Wu Z, Jia J et al (2018) Emotion Recognition from Variable-Length Speech Segments Using Deep Learning on Spectrograms. In Interspeech, pp 3683–3687
McKinney M, Breebaart J (2003) Features for audio and music classification. In Proc. ISMIR, pp 151–158.
Nam J, Choi K, Lee J et al (2018) Deep learning for audio-based music classification and tagging: Teaching computers to distinguish rock from bach. IEEE Signal Processing Magazine 36(1):41–51
Article Google Scholar
Panagakis Y, Kotropoulos C, Arce GR (2009) Music genre classification via sparse representations of auditory temporal modulations, In 2009 17th European Signal Processing Conference, IEEE, pp 1–5.
Papakostas M, Giannakopoulos T (2018) Speech-music discrimination using deep visual feature extractors. Expert Systems with Applications 114:334–344
Article Google Scholar
Pons J, Serra X (2019) Randomly weighted CNNs for (music) audio classification. In ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 336–340
Sainath TN, Mohamed A, Kingsbury B et al (2013) Deep convolutional neural networks for LVCSR. In 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, pp 8614–8618.
Sandler M, Howard A, Zhu M et al (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4510–4520.
Satt A, Rozenberg S, Hoory R (2017) Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms. In INTERSPEECH, pp 1089–1093
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Song G, Wang Z, Han F et al (2018) Music auto-tagging using deep Recurrent Neural Networks. Neurocomputing 292:104–110
Article Google Scholar
Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE Trans. Speech Audio Process 10(5):293–302
Article Google Scholar
Valerio V D, Pereira R M, Costa YMG et al (2018) A Resampling Approach for Imbalanceness on Music Genre Classification Using Spectrograms. In The Thirty-First International Flairs Conference.
Zhang W, Lei W, Xu X et al (2016) Improved Music Genre Classification with Convolutional Neural Networks. In INTERSPEECH, pp 3304–3308.
Zhou ZH, Feng J (2019) Deep forest. National Science Review 6(1):74–86
Article MathSciNet Google Scholar
Zoph B, Le Q V (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578.
Zoph B, Vasudevan V, Shlens J et al (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8697–8710.

Download references

Acknowledgments

This work was supported in part by the Natural Science Foundation of the Colleges and Universities in Anhui Province of China under Grant No.KJ2020A0035; and in part by the Scientific Research Project of Hebei Education Department of China under Grant No.QN2020198.

Author information

Authors and Affiliations

School of Computer and Information, Hohai University, Nanjing, China
Jingxian Li, Lixin Han, Xiaoshuang Li & Jun Zhu
School of Software Engineering, Jinling Institute of Technology, Nanjing, China
Jingxian Li
Department of Electrical Engineering, City University of Hong Kong, Hong Kong, China
Baohua Yuan
College of Information Technology, Hebei University of Economics and Business, Shijiazhuang, China
Zhinan Gou

Authors

Jingxian Li
View author publications
You can also search for this author in PubMed Google Scholar
Lixin Han
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoshuang Li
View author publications
You can also search for this author in PubMed Google Scholar
Jun Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Baohua Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Zhinan Gou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lixin Han.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, J., Han, L., Li, X. et al. An evaluation of deep neural network models for music classification using spectrograms. Multimed Tools Appl 81, 4621–4647 (2022). https://doi.org/10.1007/s11042-020-10465-9

Download citation

Received: 02 July 2020
Revised: 05 October 2020
Accepted: 29 December 2020
Published: 09 February 2021
Issue Date: February 2022
DOI: https://doi.org/10.1007/s11042-020-10465-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An evaluation of deep neural network models for music classification using spectrograms

Abstract

Access this article

Similar content being viewed by others

Methods for image denoising using convolutional neural network: a review

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

A review on the long short-term memory model

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An evaluation of deep neural network models for music classification using spectrograms

Abstract

Access this article

Similar content being viewed by others

Methods for image denoising using convolutional neural network: a review

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

A review on the long short-term memory model

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation