Basic filters for convolutional neural networks applied to music: Training or design?

Dörfler, Monika; Grill, Thomas; Bammer, Roswitha; Flexer, Arthur

doi:10.1007/s00521-018-3704-x

Basic filters for convolutional neural networks applied to music: Training or design?

S.I. : Deep learning for music and audio
Published: 24 September 2018

Volume 32, pages 941–954, (2020)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Monika Dörfler¹,
Thomas Grill ORCID: orcid.org/0000-0002-0962-6224²,
Roswitha Bammer¹ &
…
Arthur Flexer²

729 Accesses
13 Citations
Explore all metrics

Abstract

When convolutional neural networks are used to tackle learning problems based on music or other time series, raw one-dimensional data are commonly preprocessed to obtain spectrogram or mel-spectrogram coefficients, which are then used as input to the actual neural network. In this contribution, we investigate, both theoretically and experimentally, the influence of this pre-processing step on the network’s performance and pose the question whether replacing it by applying adaptive or learned filters directly to the raw data can improve learning success. The theoretical results show that approximately reproducing mel-spectrogram coefficients by applying adaptive filters and subsequent time-averaging on the squared amplitudes is in principle possible. We also conducted extensive experimental work on the task of singing voice detection in music. The results of these experiments show that for classification based on convolutional neural networks the features obtained from adaptive filter banks followed by time-averaging the squared modulus of the filters’ output perform better than the canonical Fourier transform-based mel-spectrogram coefficients. Alternative adaptive approaches with center frequencies or time-averaging lengths learned from training data perform equally well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Genre Classification in Music using Convolutional Neural Networks

Improving Single-Network Single-Channel Separation of Musical Audio with Convolutional Layers

Convolutional Neural Networks in Speech Emotion Recognition – Time-Domain and Spectrogram-Based Approach

Notes

This observation seems to have served as one motivation to introduce the so-called scattering transform, which consists of repeated composition of convolution, a nonlinearity in the form of taking the absolute value and time-averaging. In that framework, mel-spectrogram coefficients are interpreted as first-order scattering coefficients.

References

Abreu LD, Romero JL (2017) MSE estimates for multitaper spectral estimation and off-grid compressive sensing. IEEE Trans Inf Theory 63(12):7770–7776
Article MathSciNet Google Scholar
Andén J, Mallat S (2014) Deep scattering spectrum. IEEE Trans Signal Process 62(16):4114–4128
Article MathSciNet Google Scholar
Anselmi F, Leibo JZ, Rosasco L, Mutch J, Tacchetti A, Poggio TA (2013) Unsupervised learning of invariant representations in hierarchical architectures. CoRR arxiv:1311.4158
Balazs P, Dörfler M, Jaillet F, Holighaus N, Velasco G (2011) Theory, implementation and applications of nonstationary gabor frames. J Comput Appl Math 236(6):1481–1496
Article MathSciNet Google Scholar
Balazs P, Dörfler M, Kowalski M, Torrésani B (2013) Adapted and adaptive linear time-frequency representations: a synthesis point of view. IEEE Signal Process Mag 30(6):20–31
Article Google Scholar
Bammer R, Dörfler M (2017) Invariance and stability of Gabor scattering for music signals. In: Sampling theory and applications (SampTA), 2017 international conference on. IEEE, pp 299–302
Boulanger-Lewandowski N, Bengio Y, Vincent P (2012) Modeling temporal dependencies in high-dimensional sequences: application to polyphonic music generation and transcription. arXiv preprint arXiv:1206.6392
Choi K, Fazekas G, Sandler M, Cho K (2018) The effects of noisy labels on deep convolutional neural networks for music tagging. IEEE Trans Emerg Top Comput Intell 2(2):139–149
Article Google Scholar
Choi K, Fazekas G, Sandler M (2016) Automatic tagging using deep convolutional neural networks. In: Proceddings of the 17th international society for music information retrieval conference
Chuan CH, Herremans D (2018) Modeling temporal tonal relations in polyphonic music through deep networks with a novel image-based representation. In: Thirty-second AAAI conference on artificial intelligence
Dieleman S, Brakel P, Schrauwen B (2011) Audio-based music classification with a pretrained convolutional network. In: 12th international society for music information retrieval conference (ISMIR-2011). University of Miami, pp 669–674
Dieleman S, Schrauwen B (2014) End-to-end learning for music audio. In: Acoustics, speech and signal processing (ICASSP), 2014 IEEE international conference on, pp 6964–6968. https://doi.org/10.1109/ICASSP.2014.6854950
Dörfler M (2001) Time-frequency analysis for music signals: a mathematical approach. J New Music Res 30(1):3–12
Article Google Scholar
Dörfler M, Bammer R, Grill T (2017) Inside the spectrogram: convolutional neural networks in audio processing. In: International conference on sampling theory and applications (SampTA). IEEE, pp 152–155
Dörfler M, Torrésani B (2010) Representation of operators in the time-frequency domain and generalized Gabor multipliers. J Fourier Anal Appl 16(2):261–293
Article MathSciNet Google Scholar
Feichtinger HG, Kozek W (1998) Quantization of TF lattice-invariant operators on elementary LCA groups. In: Feichtinger HG, Strohmer T (eds) Gabor analysis and algorithms, applied and numerical harmonic analysis. Birkhäuser, Boston, pp 233–266
Google Scholar
Feichtinger HG, Nowak K (2003) A first survey of Gabor multipliers. In: Feichtinger HG, Strohmer T (eds) Advances in Gabor analysis, applied and numerical harmonic analysis. Birkhäuser, Boston, pp 99–128
Google Scholar
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, Cambridge
MATH Google Scholar
Grill T, Schlüter J (2015) Music boundary detection using neural networks on combined features and two-level annotations. In: Proceedings of the 16th international society for music information retrieval conference (ISMIR 2015). Malaga, Spain, pp 531–537
Grohs P, Wiatowski T, Bölcskei H (2016) Deep convolutional neural networks on cartoon functions. In: Information theory (ISIT), 2016 IEEE international symposium on. IEEE, pp 1163–1167
Holighaus N, Dörfler M, Velasco GA, Grill T (2013) A framework for invertible, real-time constant-Q transforms. IEEE Trans Audio Speech Lang Process 21(4):775–785
Article Google Scholar
Humphrey EJ, Bello JP (2012) Rethinking automatic chord recognition with convolutional neural networks. In: Machine learning and applications (ICMLA), 2012 11th international conference on. IEEE, vol 2, pp 357–362
Humphrey EJ, Montecchio N, Bittner R, Jansson A, Jehan T (2017) Mining labeled data from web-scale collections for vocal activity detection in music. In: Proceedings of the 18th international society for music information retrieval conference (ISMIR), Suzhou, China
Kingma D, Ba J (2015) Adam: a method for stochastic optimization. In: Proceedings of the 6th international conference on learning representations (ICLR). San Diego, USA
Korzeniowski F, Widmer G (2016) A fully convolutional deep auditory model for musical chord recognition. In: Machine learning for signal processing (MLSP), 2016 IEEE 26th international workshop on. IEEE, pp 1–6
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Article Google Scholar
Lee H, Pham P, Largman Y, Ng AY (2009) Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Advances in neural information processing systems, pp 1096–1104
Leglaive S, Hennequin R, Badeau R (2015) Singing voice detection with deep recurrent neural networks. In: Acoustics, speech and signal processing (ICASSP), 2015 IEEE international conference on. IEEE, pp 121–125
Lehner B, Schlüter J, Widmer G (2018) Online, loudness-invariant vocal detection in mixed music signals. IEEE/ACM Trans Audio Speech Lang Process 26(8):1369–1380
Article Google Scholar
Malik M, Adavanne S, Drossos K, Virtanen T, Ticha D, Jarina R (2017) Stacked convolutional and recurrent neural networks for music emotion recognition. arXiv preprint arXiv:1706.02292
Mallat S (2012) Group invariant scattering. Commun Pure Appl Math 65(10):1331–1398
Article MathSciNet Google Scholar
Mallat S (2016) Understanding deep convolutional networks. Philos Trans R Soc Lond A Math Phys Eng Sci 374(2065). https://doi.org/10.1098/rsta.2015.0203. URL http://rsta.royalsocietypublishing.org/content/374/2065/20150203
Article Google Scholar
Schlüter J, Böck S (2013) Musical onset detection with convolutional neural networks. In: 6th international workshop on machine learning and music (MML), Prague, Czech Republic
Schlüter J, Böck S (2014) Improved musical onset detection with convolutional neural networks. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP 2014). Florence, Italy
Schlüter J, Grill T (2015) Exploring data augmentation for improved singing voice detection with neural networks. In: Proceedings of the 16th international society for music information retrieval conference (ISMIR 2015). Malaga, Spain
Ullrich K, Schlüter J, Grill T (2014) Boundary detection in music structure analysis using convolutional neural networks. In: Proceedings of the 15th international society for music information retrieval conference (ISMIR 2014). Taipei, Taiwan
Waldspurger I (2015) Wavelet transform modulus: phase retrieval and scattering. Ph.D. thesis, Ecole normale supérieure-ENS PARIS
Waldspurger I (2017) Exponential decay of scattering coefficients. In: 2017 international conference on sampling theory and applications (SampTA), pp 143–146. https://doi.org/10.1109/SAMPTA.2017.8024473
Wiatowski T, Grohs P, Bölcskei H (2017) Energy propagation in deep convolutional neural networks. arXiv preprint arXiv:1704.03636
Wiatowski T, Tschannen M, Stanic A, Grohs P, Bölcskei H (2016) Discrete deep feature extraction: a theory and new architectures. In: Proceedings of the international conference on machine learning, pp 2149–2158

Download references

Acknowledgements

This research has been supported by the Vienna Science and Technology Fund (WWTF) through Project MA14-018.

Author information

Authors and Affiliations

Faculty of Mathematics, University of Vienna, 1090, Vienna, Austria
Monika Dörfler & Roswitha Bammer
Austrian Research Institute for Artificial Intelligence (OFAI), Freyung 6/6, 1010, Vienna, Austria
Thomas Grill & Arthur Flexer

Authors

Monika Dörfler
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Grill
View author publications
You can also search for this author in PubMed Google Scholar
Roswitha Bammer
View author publications
You can also search for this author in PubMed Google Scholar
Arthur Flexer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thomas Grill.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Appendix A: Proof of Theorem 1

In order to include the situation described in Theorem 1, we assume the situation in which the original spectrogram is sub-sampled, in other words, we start the computations concerning a signal f from

$$\begin{aligned} S_0 ( \alpha l, \beta k)= |{\mathcal {V}} _g f (\alpha l, \beta k) |^2 = |{\mathcal {F}} (f\cdot T_{\alpha l} g)(\beta k)|^2. \end{aligned}$$

The proof is based on the observation that the mel-spectrogram can be written via the operation of so-called STFT- or Gabor multipliers, cf. [17], on any given function in the sense of a bilinear form. Before deriving the involved correspondence, we thus introduce this important class of operators.

Given a window function g, time- and frequency-sub-sampling parameters $\alpha , \beta$, respectively, and a function $\mathbf{{m}}: {\mathbb {Z}} \times {\mathbb {Z}} \mapsto {\mathbb {C}}$, the corresponding Gabor multiplier $G^{\alpha ,\beta }_{g, \mathbf{{m}}}$ is defined as

$$\begin{aligned} G^{\alpha ,\beta }_{g, \mathbf{{m}}} f = \sum _k \sum _l \mathbf{m} (k,l) \langle f, M_{\beta k} T_{\alpha l} g\rangle M_{\beta k} T_{\alpha l} g . \end{aligned}$$

We next derive the expression of a mel-spectrogram by an appropriately chosen Gabor multiplier. Using sub-sampling factors $\alpha$ in time and $\beta$ in frequency as before, we start from (4) and reformulate as follows:

$$\begin{aligned} {{\text {MS}}}_{g}(f) (b,\nu )=&\sum _k |{\mathcal {F}} (f\cdot T_b g)(\beta k)|^2 \cdot \varLambda _\nu (\beta k)\\ =&\sum _k \langle f, M_{\beta k} T_b g\rangle \overline{\langle f, M_{\beta k} T_b g\rangle } \varLambda _\nu (\beta k)\\ =&\left\langle \sum _k \varLambda _\nu ( \beta k) \langle f, M_{\beta k} T_b g\rangle M_{\beta k} T_b g , f\right\rangle \\ =&\left\langle \sum _k \sum _l \mathbf{m} (k,l) \langle f, M_{\beta k} T_{\alpha l} g\rangle M_{\beta k} T_{\alpha l} g , f\right\rangle \end{aligned}$$

with $\mathbf{m} (k,l) = \delta (\alpha l-b)\varLambda _\nu (\beta k)$. We see that the mel-coefficients can thus be interpreted via a Gabor multiplier: ${{\text {MS}}}_{g}(f) (b,\nu ) = \langle G^{\alpha ,\beta }_{g, \mathbf{{m}}}f, f \rangle$.

The next step is to switch to an alternative operator representation. Indeed, as shown in [16], every operator H can equally be written by means of its spreading function$\eta _H$ as

$$\begin{aligned} Hf (t) = \int _x \int _\xi \eta _H (x,\xi ) f (t-x) e^{2\pi i t \xi }{\mathrm{d}}\xi {\mathrm{d}}x. \end{aligned}$$

(12)

We note that two operators $H_1$, $H_2$ are equal if and only if their spreading functions coincide, see [15, 16] for details.

As shown in [15], a Gabor multiplier’s spreading function $\eta ^{\alpha ,\beta }_{{g, \mathbf{m}} }$ is given by

$$\begin{aligned} \eta ^{\alpha ,\beta }_{{g, \mathbf{m}} } (x,\xi ) = {\mathcal {M}} (x,\xi ) {\mathcal {V}} _g g(x,\xi ), \end{aligned}$$

(13)

where ${\mathcal {M}} (x,\xi )$ denotes the $(\beta ^{-1}, \alpha ^{-1})$-periodic symplectic Fourier transform of $\mathbf{m}$, i.e.,

$$\begin{aligned} {\mathcal {M}} (x,\xi ) = \mathcal {F}_s ( \mathbf{m} )(x,\xi ) = \sum _k\sum _l \mathbf{m} (k,l) e^{-2\pi i (\alpha l \xi - \beta kx )}. \end{aligned}$$

(14)

We now equally rewrite the time-averaging operation applied to a filtered signal, as defined in (6), as a Gabor multiplier. As before, we set $\check{h}_\nu (t) = \overline{h_\nu (-t)}$ and have

$$\begin{aligned} {{\text {FB}}}_{h_\nu }(f) (b,\nu )&=\sum _l |(f*h_\nu )(\alpha l)|^2 \cdot \varpi _\nu (\alpha l-b) = \sum _l |\sum _n f(n) \check{h}_\nu (n-\alpha l)|^2 \cdot \varpi _\nu (\alpha l-b)\\&=\sum _k \sum _l |\langle f, M_{\beta k} T_{\alpha l} \check{h}_\nu \rangle |^2 \cdot \varpi _\nu (\alpha l-b)\delta (\beta k)= \langle G^{\alpha ,\beta }_{\check{h}_{\nu }, \mathbf{m}_F} f, f\rangle . \end{aligned}$$

with $\mathbf{m}_F (k,l) = T_b \varpi _\nu (l) \delta (\beta k)$. To obtain the error estimate in Corollary 1, first note that by straightforward computation using the operators’ representation by their spreading functions as in (12)

$$\begin{aligned}&|{{\text {MS}}}_{g}(f) (b,\nu )-{{\text {FB}}}_{h_\nu }(f)(b,\nu )| = \left| \left\langle \left( G^{\alpha ,\beta }_{g, \mathbf{m}} - G^{\alpha ,\beta }_{\check{h}_{\nu }, \mathbf{m}_F}\right) f, f\right\rangle \right| \nonumber \\&\quad = \left| \left\langle \left( \eta _{g_\alpha ^\beta , \mathbf{m}} - \eta _{\check{h}_{\alpha \nu }^\beta , \mathbf{m}_F}\right) , {\mathcal {V}}_f f\right\rangle \right| \le \left\| \eta ^{\alpha , \beta }_{g, \mathbf{m}} - \eta ^{\alpha , \beta }_{\check{h}_{ \nu } ,\mathbf{m}_F}\right\| \cdot \Vert f\Vert _2^2 \end{aligned}$$

and we can estimate the error by the difference of the spreading functions. We write the sampled version of $\varLambda _\nu$ by using the Dirac comb Ш$_\beta$: $\varLambda _\nu (\beta k) = ($Ш$_\beta \varLambda _\nu ) (t) = \sum _k \varLambda _\nu (t) \delta (t-\beta k)$ and analogously for $\varpi _\nu$ using Ш$_\alpha$ to obtain $\mathbf{m} =T_b \delta (\alpha l) \cdot$Ш$_\beta \varLambda _\nu$ and $\mathbf{m}_F =$Ш$_\alpha T_b \varpi _\nu \cdot \delta (\beta k)$. Applying the symplectic Fourier transform (14) to $\mathbf{m}$ then gives:

Now it is a well-known fact that the Fourier transform turns sampling with sampling interval $\beta$ into periodization by $1/\beta$, in other words, into a convolution with Ш$_{\frac{1}{\beta }}$:

hence

$$\begin{aligned} {\mathcal {M}}^\nu (x,\xi ) = \sum _l T_{\frac{l}{\beta } }{\mathcal {F}}^{-1} (\varLambda _\nu ) (x) \cdot e^{-2\pi i b \xi }. \end{aligned}$$

Completely analogous considerations for $\varpi _\nu$ and Ш$_\alpha$ lead to the periodization of $\mathcal {F}(\varpi _\nu )$ and thus the following expression for the symplectic Fourier transform of $\mathbf{m}_F$:

$$\begin{aligned} {\mathcal {M}}^\nu _F(x,\xi ) = \sum _l T_{\frac{l}{\alpha } }{\mathcal {F}}(\varpi _\nu ) (\xi ) \cdot e^{-2\pi i b \xi }. \end{aligned}$$

Plugging these expressions into (13) gives the bound (8).

Remark 5

It is interesting to interpret the action of an operator in terms of its spreading function. In view of (12), we see that the spreading function determines the amount of shift in time and frequency, which the action of the operator imposes on a function. For Gabor multipliers, if well-concentrated window functions are used, it is immediately obvious that the amount of shifting is moderate as well as determined by the window’s eccentricity. At the same time, the aliasing effects introduced by coarse sub-sampling are reflected in the periodic nature of ${\mathcal {M}}$. Since, for $\mathcal {F}^{-1} (\varLambda _\nu )$ the sub-sampling density in frequency, determined by $\beta$, and for $\mathcal {F}(\varpi _\nu )$ the sub-sampling density in time, determined by $\alpha$, determine the amount of aliasing, the overall approximation quality deteriorates with increasing sub-sampling factors.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dörfler, M., Grill, T., Bammer, R. et al. Basic filters for convolutional neural networks applied to music: Training or design?. Neural Comput & Applic 32, 941–954 (2020). https://doi.org/10.1007/s00521-018-3704-x

Download citation

Received: 01 December 2017
Accepted: 10 September 2018
Published: 24 September 2018
Issue Date: February 2020
DOI: https://doi.org/10.1007/s00521-018-3704-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Basic filters for convolutional neural networks applied to music: Training or design?

Abstract

Access this article

Similar content being viewed by others

Genre Classification in Music using Convolutional Neural Networks

Improving Single-Network Single-Channel Separation of Musical Audio with Convolutional Layers

Convolutional Neural Networks in Speech Emotion Recognition – Time-Domain and Spectrogram-Based Approach

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Appendix A: Proof of Theorem 1

Remark 5

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Basic filters for convolutional neural networks applied to music: Training or design?

Abstract

Access this article

Similar content being viewed by others

Genre Classification in Music using Convolutional Neural Networks

Improving Single-Network Single-Channel Separation of Musical Audio with Convolutional Layers

Convolutional Neural Networks in Speech Emotion Recognition – Time-Domain and Spectrogram-Based Approach

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Appendix A: Proof of Theorem 1

Appendix A: Proof of Theorem 1

Remark 5

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation