Coreset-Based Neural Network Compression

Dubey, Abhimanyu; Chatterjee, Moitreya; Ahuja, Narendra

doi:10.1007/978-3-030-01234-2_28

Abhimanyu Dubey¹⁷,
Moitreya Chatterjee¹⁸ &
Narendra Ahuja¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11211))

Included in the following conference series:

European Conference on Computer Vision

5869 Accesses
40 Citations

Abstract

We propose a novel Convolutional Neural Network (CNN) compression algorithm based on coreset representations of filters. We exploit the redundancies extant in the space of CNN weights and neuronal activations (across samples) in order to obtain compression. Our method requires no retraining, is easy to implement, and obtains state-of-the-art compression performance across a wide variety of CNN architectures. Coupled with quantization and Huffman coding, we create networks that provide AlexNet-like accuracy, with a memory footprint that is 832\(\times \) smaller than the original AlexNet, while also introducing significant reductions in inference time as well. Additionally these compressed networks when fine-tuned, successfully generalize to other domains as well.

A. Dubey and M. Chatterjee—Equal Contribution.

You have full access to this open access chapter, Download conference paper PDF

Neural Network Compression Framework for Fast Model Inference

Reproducing the Sparse Huffman Address Map Compression for Deep Neural Networks

Neural Network Compression by Joint Sparsity Promotion and Redundancy Reduction

1 Introduction

Convolutional neural networks, while immensely powerful, often are resource-intensive [24, 26, 35, 50, 54]. Popular CNN models such as AlexNet [35] and VGG-16 [50], for instance, have 61 and 138 million parameters and consume in excess of 200 MB and 500 MB of memory space respectively. This characteristic of deep CNN architectures reduces their portability, and poses a severe bottleneck for implementation in resource constrained environments [17]. Additionally, design choices for CNN architectures, such as network depth, filter sizes, and number of filters seem arbitrary and motivated purely by empirical performance at a particular task, permitting little room for interpretability. Moreover, the architecture design is not necessarily fully optimized for the network to be yielding a certain level of precision, making these models highly resource-inefficient.

Several prior approaches have thus sought to reduce the computational complexity of these models. Work aimed at designing efficient CNN architectures, such as Residual Networks (ResNets) [25] and DenseNets [28] have shown promise at alleviating the challenge of model complexity. These CNNs provide higher performance on classification at only a fraction of the number of parameters of their more resource intensive counterparts. However, despite being more compact, redundancies remain in such networks, leaving room for further compression.

In this work, we propose a novel method that exploits inter-filter dependencies extant in the convolutional filter banks of CNNs to compress pre-trained computationally intensive neural networks. Additionally we leverage neuronal activation patterns across samples to prune out irrelevant filters. Our compression pipeline consists of finely pruning the filters of every layer of the CNN based on sample activation patterns, followed by the construction of efficient filter coreset representations to exploit the inter-filter dependencies. Our method does not require retraining, is applicable to both fully-connected and convolution layers, and maintains classification performance similar to the uncompressed network. We display state-of-the-art compression rates on several popular CNN models, including multiple ResNets, which show increases from 9.2\(\times \) to 16.2\(\times \) in compression rate over prior state-of-the-art techniques. Coupled with Deep Compression, we are additionally able to compress other popular CNN models such as VGGNet-16 [50] and AlexNet [35] by 238\(\times \) and 55\(\times \) respectively. Moreover, we demonstrate the presence of filter redundancies even in highly efficient models such as SqueezeNet [29], by reducing their parameters by 50% with almost no loss in classification performance, giving us AlexNet-level precision but with 832\( \times \) smaller model size, compared to the original AlexNet model. Finally, we empirically validate the generalizability of these compressed CNNs to newer domains.

In the next section, we discuss relevant prior work in this area. In Sect. 3, we present the details of our algorithm. This is followed by Sect. 4, where a discussion on the empirical evaluation of our method vis-à-vis other competing compression techniques is presented. We finally conclude in Sect. 5, laying out some avenues for future research in this area.

2 Related Work

Network Compression: Compressing neural networks has been a topic of active research interest lately. Prior work in this area can be grouped into three distinct categories. The first category of methods direct their attention to the construction of parameter-efficient neural network architectures. For instance, Iandola et al. [29] propose SqueezeNets, a neural architecture class containing the parameter efficient, fully convolutional, ‘fire’ modules. Other examples of such architectures include Residual Networks (ResNets)[25], and Densely Connected Neural Networks (DenseNets)[28], which provide higher classification performance with models much smaller than the previous state-of-the-art, using ‘skip-connections’ between layers of the network. More recent approaches have sought to adapt CNN architectures so as to make them robust to common transformations (e.g. rotation) within the data, by modifying the filter banks of a CNN [10, 61] or by enforcing sparsity while training [2]. While these approaches seem to hold promise but they fail to fully exploit the inter-filter dependencies, allowing room for further compression of such networks. Meta-learning approaches attempt to decipher the optimum CNN architecture by searching over the space of a gigantic number of possible candidates. However, these techniques are prohibitively resource intensive, (needing well in excess of 400 GPUs to run), and often yield only a locally optimum architecture [46].

A second broad category of compression methods attempt to prune the unimportant network parameters. Han et al. [20] demonstrate an efficient pruning-retraining method, based on pruning weights by their \(\ell ^p\) norms. Srinivas and Babu [53] remove individual neurons instead of weights, with impressive results. The importance of ordering filters for the purpose of pruning has also been highlighted in Yu et al., He et al., and Molchanov et al. [27, 42, 60]. These approaches have been modified in the works of Polyak et al. [45] and Luo et al. [41], that focus on removing weights grouped by characteristics of filters (such as norm of filter weights, etc.). Li et al. [40] extend the ideas of filter pruning by removing filters from a network following an ‘importance’ criterion. However, the re-training step in these algorithms is time intensive.

Finally, the third theme of compression techniques is to employ weight-approximation and information-theoretic principles for the compression of neural network parameters. An early example of such work is the approach by Denton et al. [9] that uses low-rank approximations to compress fully-connected layers of neural networks. However, this technique doesn’t apply to the convolution layers. Lebdev et al. fixes this problem and employs a low-rank decomposition approach to the full CNN to construct more efficient representations but their technique’s principal bottleneck is re-training, which we avoid [36]. Rosenfield et al. consider an efficient utilization of CNN filters by representing them as a linear combination of a bases set [48]. However, our algorithm, different from this line of work, is additionally also capable of introducing structure, such as sparsity, in the approximated weights resulting from the decomposition, which further aids compression. Han et al. [20] introduce Deep Compression, that uses several steps such as weight-pruning, weight-sharing, and Huffman coding to reduce neural network size. However, their algorithm requires special hardware for inference in the compressed state, making it hard to deploy the compressed networks across platforms.

Our method aims at handling the shortcomings in each of these individual themes of CNN compression. Contrary to the first category of architecture search, our method is applicable to a wide variety of models, is less resource intensive, and does not require any retraining. While we do prune filters inspired by the work of Polyak et al. [45] (following the second theme of compression), our criterion for filter pruning, however, is motivated by the accurate reconstruction of sample activations, instead of the magnitude of filter weights. Finally, our compression technique does not require special hardware for running inference unlike Han et al. [19] and scales to both fully-connected and convolutional layers, unlike the low-rank (SVD) approach by Denton et al. [9].

Coresets for Point Selection: Coresets have been widely studied in computational geometry. They were introduced first by Agarwal et al. [1] for approximating a set of points with a smaller set, while preserving some desired criteria, on k-means and k-median problems. Badoiu et al. [5] propose a coreset formulation to cluster points using a subset of the total set of points to generate the optimal solution. Har-Peled and Mazumdar [22] give an alternate solution for coresets that include points not in the original set. Feldman et al. [13] demonstrate that weak coreset representations can be generated with the number of points independent of the underlying data distribution. These formulations have recently been applied to several problems within computer vision and machine learning [12, 14, 15], and are primarily used to approximate a set of n points in d dimensions, originating from a domain \(\mathbf {S}\), with a smaller set of \(\tilde{n} \ll n\) points, while preserving some criterion such as similar pairwise distances. However, coresets have remained unexplored in the context of CNN compression, which constitutes a major novelty of our work.

3 Method

We begin with a fully-trained CNN and compress it without retraining, first by pruning out unimportant filters, followed by extraction of efficient coreset representation of these filters. Some of the major advantages of our method include: (i) Lack of retraining, therefore a major reduction in processing time, (ii) Capacity of our algorithm to significantly compress both convolutional and fully connected layers, and (iii) Ability of the compressed CNN to generalize to newer tasks.

3.1 Background and Notation

An n-layered neural network can be described as a union of the parameter tensors of every layer, . The parameters of layer k have the shape \(N_k \times C_k \times h_k \times w_k\), where \(N_k\) denotes the number of filters, \(C_k\) denotes the number of input channels of the filter (since this is typically equal to the number of filters in the previous layer, \(C_k = N_{k-1}\)), and \(h_k\) and \(w_k\) denote the height and width of a filter. We can rewrite the parameter tensor as a 2D matrix of the shape \(N_k \times (C_k h_k w_k)\). Next, we append the biases of the filters to , to make it a matrix of dimensions \(N_k \times (C_k h_k w_k + 1)\). It is well known that using this representation of the weights and biases of a layer, , we can represent the output activation of any fully connected layer as a matrix product of with the incoming activation tensor . This notion can be extended to convolution layers by re-casting the matrices in an appropriate Toeplitz form [56].

The goal of compression is to obtain a compressed representation of the parameters for each layer such that it is smaller and computationally efficient, and preserves the final classification accuracy. Our approach is to construct compressed filter ‘coresets’ of the parameters of each layer (where \(\hat{N}_k < N_k\)), such that the output activations (obtained after the Toeplitz matrix multiplication), are well approximated. Ensuring that the output activations remain largely the same at every layer, post compression, ensures that the final classification performance remains largely unchanged. Since the elements of these coresets are typically linear functions of the original parameters, we will additionally require a decompression matrix to obtain an approximation to the initial set of parameters, starting with the coreset representation.

Coresets are an effective technique of approximating a large set of points with a smaller set, which need not necessarily be a part the original set, while preserving some desirable property such as mean pairwise distances, diameter of the point set, etc. We seek to obtain a reduced matrix (coreset) representation of the original filter weights of every layer, which we do via 3 different approaches, as described below.

3.2 k-Means Coresets

A first approach to constructing such a coreset would be to obtain a reduced representation of the parameter matrix that approximates the sum of distances in the space of neuronal activations of an arbitrary sample between each of the filters. Feldman et al. [14] demonstrate that this problem is equivalent to finding a low-rank approximation of the filter matrix. This is representable as follows:

(1)

Their formulation for constructing a compact set of \(\hat{N}_k \ll N\) points using the sum of distances criterion leads to the k-Means Coresets, where the coreset representation is given by the solution to the above optimization problem:

(2)

Here, the matrices and are the \(\hat{N}_k\)-truncated versions of the matrices and , which satisfy the property:

(3)

are unitary matrices, while is a diagonal matrix. Such a decomposition can be obtained using Singular Value Decomposition (SVD), where the extent of truncation is specified as an input to the algorithm. The truncation determines the amount of compression we get.

Intuitively a significant truncation, while yielding greater compression, leads to a weaker approximation of the filter weights. This also results in a weaker approximation of the output activations, manifesting itself as a drop in classification accuracy. We seek the optimum compression, across all layers, such that the classification accuracy does not deviate by more than 0.5%.

SVD for compressing neural network weights has been investigated previously in [9], however, with two key differences - (i) the naive SVD approach has been applied only to fully-connected layers of neural networks, with limited success, whereas our coreset-based formulation scales to both convolution and fully connected layers, and (ii) our method for selecting the number of components to be retained \(\hat{N}_k \) is data-dependent, based on training error obtained on random subsets of the training data, instead of an arbitrary initialization followed by retraining. However, since this decomposition does not explicitly encode any structure on the approximated weights, such as sparsity or considers the impact of activations, we build upon this formulation to create stronger coreset representations. This sets us apart from prior work, which employ simple low-rank decomposition for constructing efficient CNNs [36] (Fig. 1).

3.3 Structured Sparse Coresets

If we consider the previous coreset decomposition, the optimization problem can be rewritten as (subject to constraints on each of the variables and ):

(4)

To induce sparsity in the obtained decomposition, Jenatton et al. [30] introduce a technique known as Structured Sparse PCA, which optimizes the following:

(5)

This problem can be solved by a cyclic optimization of two convex problems [30], and provides us with a decomposition that possesses structured sparsity. The motivation behind using such a formulation is to obtain a decomposition that is sparse in the number of components used, while minimizing reconstruction error. While techniques such as SPCA [32] or NMF [39] also construct representations that are sparse in the projected space, this formulation returns a decomposition that makes the approximation in the original space sparse as well, hence, both and are sparse. Moreover, this formulation allows us to discard those filters for which the corresponding column vector in is a null vector, leading to further compression.

The hyper-parameters \(\hat{N}_k\) and \(\lambda \) are chosen jointly so as to obtain the maximum compression while restricting the deviation in classification performance to within 0.5% of the uncompressed network, post the compression of all layers. We observe that this technique provides much more compression than k-Means Coreset, however, this does not take into account the relative importance of the filters during reconstruction, which leads us to our final coreset formulation.

3.4 Activation-Weighted Coresets

Our final coreset formulation is obtained by introducing a relative importance score to every filter (based on their activation magnitudes over the training set), while inducing sparsity. However, if we attempt to directly learn a coreset representation by minimizing the reconstruction error over all the training set activations, the resulting optimization problem will be difficult to solve, owing to the large size of the activation matrix and its degenerate nature. We thus employ an alternate formulation: for each filter f in a layer, we compute its ‘importance’ \(i^{(f)}_k\) as the mean value of its activation over all training set points, normalized over all filters, in the \(k^{th}\) layer. This is given by the following:

(6)

Here, is the activation of the \(f^{th}\) filter of layer k, for training sample j, and T denotes the total number of training samples. We then construct the Importance Matrix for the layer k by tiling the column vectors \((i^{(f)}_k)_{f=1}^{N_k}\), for \((C_k \times h_k \times w_k + 1)\) times, creating an Importance Matrix \(\in \mathbb R^{N_k \times (C_k h_k w_k + 1)}\), where each row denotes the ‘importance’ of each filter, normalized over all filters of the current layer.

We create this form of the importance matrix with every element of a row containing identical values, since we do not want to weigh each component of a particular filter differently. Note, additionally, that we can compute this matrix in only one forward pass of the entire training set. This leads us to the following optimization problem:

(7)

Here \(\odot \) denotes the Hadamard (elementwise) product. This problem is essentially, a weighted low-rank decomposition, studied previously by Srerbo and Jaakkola [52] and Delchambre [8] and is solved using an efficient Expectation-Maximization (EM) algorithm [52].

The intuition behind this weighted formulation is to ascribe a relative importance to the filters that contribute most to the activations in the training set (on average in the Frobenius norm sense), instead of attempting to reconstruct all activations with equal priority. Molchanov et al. [42] also use the notion of an importance criteria for compression but rather than using it as a weighting scheme in the optimization objective, like we do, they directly use it to prune the ‘less important’ filters. In this case as well, we compute the optimum number of components to be kept by selecting the least number of components that can be selected such that the classification accuracy is bounded within 0.5% of the original network, once the entire network has been compressed.

3.5 Activation-Based Filter Pruning

In related work, Liet al. observe that not all filters are equally important in the context of classification [40]. This motivates us to perform a pre-processing step before coreset compression, to first eliminate unimportant filters pre-emptively, based on the mean of their activation norms over the training set. This step is essential to remove unimportant weights, since pruning out a filter in a layer can completely remove the weights corresponding to that filter, in the next layer as well, inducing greater sparsity. Using the notation from earlier we can write the size of filters as:

By layer-wise pruning of complete filters, we can hence set the number of post-pruning filters at layer \(k-1\) to be \(N^*_{k-1} < N_{k-1}\), permitting further compression. In networks with skip-connections (e.g. ResNets), \(C_k \ne N_{k-1}\), but it is a positive linear combination of the number of filters of the “source” layers of the (skip) connections, hence the proportionality still holds.

Starting from the first layer in the network, we proceed to evaluate the activation values for the entire training set, layer-by-layer. Inspired by standard “Max-Pool” sub-sampling techniques prevalent in modern CNNs [35, 50], we approximate the response from each filter in the convolution layers (a 2D matrix) with its maximum value (a scalar). Once we have this set of pooled filter-wise activations for all samples, we compute the mean squared norm of each filter over all the training samples, and sort the filters by this value. This technique of ordering filters differentiates us from prior pruning-based techniques. We maximize the number of pruned filters, ensuring that the divergence in classification accuracy is only 0.5%, after the pruning has been carried out across all layers. Once we obtain a reduced set of filters with the crucial filters preserved, we compress this set of filters using coresets, as discussed earlier.

3.6 Compression Pipeline and Computational Complexity Analysis

The entire pipeline for compression can be summarized in two stages - (i) activation-based pruning, followed by (ii) coreset-based compression. The pruning procedure can be summarized in the following steps:

1.
Sort the layers of the network in order of descending parameter size.
2.
For each layer of the sorted network, repeat the following steps:
1. (a)
  Compute activations for every input in the training set, and store the maximum value for each filter activation (max-pool over spatial dimensions).
2. (b)
  Sort the filters in descending order of the mean value of the max-pooled activations, over the entire training set.
3. (c)
  Find the smallest number of filters \(N^*_k\) that can be retained while performance deviation, post the compression of all layers, is within 0.5% of original performance - using binary search.

We can see that the complexity of the individual steps are \(\mathcal O(n \log n)\) for the first sorting step, and \(\mathcal O(n \cdot (A + N_k \log N_k + A \log N_k))\) for layer-wise activation computation, filter sorting, and binary-search. A denotes the complexity to do one feed-forward operation on the entire training set. Since \(A \gg N_k \gg 1\), the total complexity of the filter pruning is \(\mathcal O(n \cdot A \cdot \log \max _k N_k)\), requiring a maximum of \(n \log \max _k N_k\) epochs of feed-forward operations, which, for most neural network architectures, we find to be much smaller than the complexity of fine-tuning.

After filter pruning, we proceed to the coreset-based compression stage. This procedure for compression can be summarized in the following steps:

For each layer in the network, starting from the shallowest, do:

1.
Compute the complete decomposition according to the coreset formulation used.
2.
Find the minimum number of coreset filters \({\hat{N}_k}\) that can be retained while performance is within 0.5% of the network prior to coreset compression, post the compression of all layers - by searching over a random subset of the training data using binary search.

The complexity for the coreset construction set is \(\mathcal O(n \cdot (B + sA \log N^*_k))\), where B is the complexity for the matrix decomposition, and \(0 \le s \le 1\) is the fraction of random training points used. For our experiments, we set \(s = 0.005\). We find that for most networks, \(sA\log N^*_k > B\), and hence the total complexity of the compression pipeline is \(\mathcal {O}(n\cdot ~A\cdot ~\log ~N_k\cdot (1+s))\). Note that post the cascading of activation-based pruning with coreset compression across all layers, the total deviation allowed in classification performance is 1% (0.5% for pruning and 0.5% for coreset compression).

4 Experimental Evaluation

We implement our method in PyTorch [44] and Caffe [31], and evaluate on a cluster of NVIDIA TITAN Xp and Tesla GPUs. All of our implementation and other details are available here ^{Footnote 1}. For all experiments, we evaluate all 3 coreset construction techniques, as well as the impact of activation-based pruning coupled with each, and report all results together with the baseline and comparable recent work. The Activation-Based Pruning pipeline is reported as AP, while the coreset techniques are reported as - (1) k-Means Coreset (Coreset-K), (2) Structured Sparse Coreset (Coreset-S) and (3) Activation-Weighted Coreset (Coreset-A). We compare our compression performance with recent compression benchmarks, such as Fast-Food [59], SVD [9], Weight-Based Pruning [21], Deep Compression [20], memory-bounded CNNs [6], Compression Aware Training [2], etc. on a wide array of CNN architectures, including the highly efficient SqueezeNet [29].

4.1 LeNet-5 on MNIST

The first architecture we evaluate is the LeNet-5 network [38] on the MNIST dataset [37]. This is a popular benchmark for network compression, and high values of compression are reported by various recent work, which makes it a very competitive setup. The results for this experiment are summarized in Table 2. We can see that the coreset-based methods outperform the recent work comfortably, with a relative improvement of 18% over the existing state-of-the-art.

Table 1. Compression (Comp.) results for both AlexNet [35] and VGGNet-16 [50] trained on the ImageNet dataset, along with variation in performance with Deep Compression

Full size table

4.2 Large-Scale ImageNet Models

The next set of experiments we perform are on the large-scale ImageNet-trained models - the very deep networks such as Residual Networks [24], AlexNet [35] and VGGNet-16 [50]. These architectures are ubiquitous for countless applied computer vision tasks [7, 23], and several recent compression techniques demonstrate remarkable compression on these models which makes them an appropriate benchmark for evaluating compression performance. For these networks, we also demonstrate the impact of coupling Deep Compression (which involves quantization, pruning, re-training iteratively) with our method.

Table 3 summarizes the empirical evaluation on Residual Networks. We find state-of-the-art performance achieved by all three coreset methods, and a substantial increase from previous baselines as well. Even in 101-layer deep networks such as ResNet-101, we are able to obtain consistent compression, similar to the shallower ResNets. Note that this improvement is entirely on convolutional layers, which typically have very few redundancies when compared to fully-connected layers. We additionally observe that activation-based pruning buys us significant compression, providing in essence a cascading additive effect.

Table 1 summarizes the empirical evaluation on AlexNet and VGGNet-16 networks, the two of the largest image classification networks in use today. We demonstrate substantial improvements over the state-of-the-art, by compressing AlexNet by 19\(\times \), and VGGNet-16 by 17\(\times \) from their baseline sizes. When combined with Deep Compression, these ratios increase, up to 55\(\times \) and 238\(\times \) respectively, yielding models with a memory footprint of less than 4 MB. The results additionally highlight the improvement that the activation-based pruning (AP) provides, which is most prominent in the Coreset-K and Coreset-S models.

Table 2. Compression (Comp.) results on LeNet-5

Full size table

Table 3. Compression results on Residual Networks. Columns Acc. and Comp. represent the Top-1 accuracy and compression factor respectively

Full size table

Table 4. Comparison with SqueezeNet [29] trained on the ImageNet dataset. We can compress SqueezeNet to create a model that is 832\(\times \) smaller than AlexNet [35] with the same performance.

Full size table

4.3 SqueezeNet

We evaluate our method on the highly parameter-efficient SqueezeNet architecture to evaluate if further redundancies still persist after such a compression in the architecture space and if those can be eliminated via efficient filter bank representations. We find that despite beginning with 50\(\times \) less parameters than AlexNet (while providing the same performance), SqueezeNet can be compressed further (results in Table 4). Using our method, we are able to compress SqueezeNet to half its parameters, providing accuracy similar to AlexNet at 100\(\times \) compression. By coupling with Deep Compression, we obtain a net compression in model size to the tune of 16.64\(\times \) over the original model (or 832\(\times \) from AlexNet) while maintaining classification performance.

Table 5. LeNet-5 layer-wise compression of our method (denoted by identifiers) vis-á-vis prior work. The entries represent the fraction of parameters retained post compression

Full size table

4.4 Additional Observations

Further, we observe that Coreset-S and Coreset-A formulations consistently outperform Coreset-K. We surmise that large extant model redundancies tend to benefit Coreset-A and S formulations where sparsity is explicitly enforced in the objective. Moreover, we observe that for deeper models Coreset-S tends to achieve the most compression. Table 5 shows the superior layer-wise compression achieved by our algorithm vis-á-vis state-of-the-art compression techniques on LeNet-5. The results clearly bring out the efficacy of using our compression technique, especially for convolution layers. For layer-wise compression results on other CNNs, please refer to the supplementary.

Runtime Analysis: We also perform a study of runtime analysis in both training and inference performance. Since we do not undertake retraining, our method is considerably faster - on our hardware, one forward pass and backward pass of AlexNet (batch size 256) takes 16 ms naively, which corresponds to a total epoch training time (on ImageNet) of 2.5 min. We use this as a base measurement to compare the total training time (inclusive of the coreset operations). Table 1 describes the comparison of training times across methods. The previous state of the art method, Dynamic Net Surgery [18], requires 140 epochs (in time units) whereas our method takes at most 28 epochs (in time units), a significant reduction of 80%. During inference, we observe a reduction in inference time as well, which can be optimized by using efficient tensor multiplication [51]. On ResNet-50, VGGNet-16 and AlexNet, the naive (uncompressed) runtimes per epoch are: 36 ms, 45 ms and 8 ms respectively. Our best runtimes for these networks (with Coreset-S) are 19 ms, 21 ms and 3.5 ms on average, which is an average improvement of around 50% (Fig. 4).

4.5 Ablation Analysis

To demonstrate the effect of individual components in our method, we perform some ablation studies as well. We first compare the effect of activation-based pruning (AP) on all coreset compression techniques on three models - AlexNet [35], VGGNet-16 [50] and SqueezeNet [29], and observe that pruning benefits all methods of coreset compression, as described in Tables 1 and 4.

Next, we compare activation-based pruning with weight-based pruning, without re-training, and the pruning technique of Li et al. [40] for AlexNet. The results of these comparisons are summarized in Fig. 2. We obtain consistently better performance at all compression ratios, substantiating the merit of data-dependent filter pruning approaches over those based on the magnitudes of filter weights.

Finally, we analyze the variation of performance with compression factor for all coreset compression techniques on the AlexNet classification model, described in Fig. 3. We observe that Coreset-K (with and without AP), while stronger than SVD and Pruning approaches, worsens much more rapidly in comparison to other corresponding coreset techniques. This observation is consistent across all models. For additional results, more layer-wise compression analysis, and filter visualizations we refer the reader to the supplementary material.

Table 6. Performance of coreset-based compression on domain-adaptation tasks.

Full size table

4.6 Domain Adaptibility

To measure the generalizability of our compressed models to newer tasks, we evaluate compressed models on domain adaptation benchmarks, following the experimental pipeline proposed in [41]. We evaluate the performance of the compressed CNN model VGGNet-16 [50] on target domain adaptation datasets - CUB-2011 [57] and Stanford-Dogs [34], two popular datasets for fine-grained image classification. These results are summarized in Table 6. We observe that our compressed coreset models are able to provide classification performance close to the uncompressed networks, while surpassing networks compressed by other techniques. This exhibits the versatility of coreset-compressed models to domain adaptation tasks, as well.

5 Conclusions and Future Work

In this paper we introduce a novel technique that exploits redundancies in the space of convolutional filter weights and sample activations to reduce neural network size, using the long-existing concepts of coresets, coupled with an activation-based pooling technique. The lack of a re-training step in our algorithmic pipeline makes the implementation simple. Empirical evaluation reveals that our algorithm outperforms all other competing methods at compressing a wide array of popular CNN architectures. Our findings uncover the existence of redundancies even in the most compressed CNNs, such as SqueezeNets, which can be further exploited to improve efficiency.

Our method does not require any retraining, scales to both convolution and fully connected layers, and is extensively generalizable to different neural network models without being computationally intensive. Thus, we hope that our algorithm will serve as a valuable tool to obtain leaner and more efficient CNNs. As future work, we hope to apply our algorithm to compress other types of deep neural networks, such as Recurrent Neural Networks (RNNs) which are applicable to time-varying sequential inputs.

Notes

1.
https://sites.google.com/site/metrosmiles/research/research-projects/compress_cnn.

References

Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Geometric approximation via coresets. Comb. Comput. Geom. 52, 1–30 (2005)
MathSciNet MATH Google Scholar
Alvarez, J.M., Salzmann, M.: Compression-aware training of deep networks. In: Advances in Neural Information Processing Systems, pp. 856–867 (2017)
Google Scholar
Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Google Scholar
Ashok, A., Rhinehart, N., Beainy, F., Kitani, K.M.: N2N learning: network to network compression via policy gradient reinforcement learning. arXiv preprint arXiv:1709.06030 (2017)
Bādoiu, M., Har-Peled, S., Indyk, P.: Approximate clustering via core-sets. In: Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, pp. 250–257. ACM (2002)
Google Scholar
Collins, M.D., Kohli, P.: Memory bounded deep convolutional networks. arXiv preprint arXiv:1412.1442 (2014)
Dai, J., et al.: Deformable convolutional networks. CoRR, abs/1703.06211 1(2), 3 (2017)
Google Scholar
Delchambre, L.: Weighted principal component analysis: a weighted covariance eigendecomposition approach. Mon. Not. R. Astron. Soc. 446(4), 3545–3555 (2014)
Article Google Scholar
Denton, E.L., et al.: Exploiting linear structure within convolutional networks for efficient evaluation. In: Advances in NIPS 2014, pp. 1269–1277 (2014)
Google Scholar
Dieleman, S., De Fauw, J., Kavukcuoglu, K.: Exploiting cyclic symmetry in convolutional neural networks. arXiv preprint arXiv:1602.02660 (2016)
Dosovitskiy, A., Springenberg, J.T., Riedmiller, M., Brox, T.: Discriminative unsupervised feature learning with convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 766–774 (2014)
Google Scholar
Dubey, A., Naik, N., Raviv, D., Sukthankar, R., Raskar, R.: Coreset-based adaptive tracking. arXiv preprint arXiv:1511.06147 (2015)
Feldman, D., Monemizadeh, M., Sohler, C.: A PTAS for k-means clustering based on weak coresets. In: Proceedings of the Twenty-Third Annual Symposium on Computational Geometry, pp. 11–18. ACM (2007)
Google Scholar
Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1434–1453. SIAM (2013)
Chapter Google Scholar
Feldman, D., Volkov, M., Rus, D.: Dimensionality reduction of massive sparse datasets using coresets. arXiv preprint arXiv:1503.01663 (2015)
Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Google Scholar
Gokhale, V., Jin, J., Dundar, A., Martini, B., Culurciello, E.: A 240 G-ops/s mobile coprocessor for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 682–687 (2014)
Google Scholar
Guo, Y., Yao, A., Chen, Y.: Dynamic network surgery for efficient DNNs. In: Advances In Neural Information Processing Systems, pp. 1379–1387 (2016)
Google Scholar
Han, S., et al.: EIE: efficient inference engine on compressed deep neural network. In: Proceedings of the 43rd International Symposium on Computer Architecture, pp. 243–254. IEEE Press (2016)
Google Scholar
Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding (2015)
Google Scholar
Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems, pp. 1135–1143 (2015)
Google Scholar
Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: ACM Symposium on Theory of Computing (2004)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988. IEEE (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. arXiv preprint arXiv:1603.05027 (2016)
He, Y., et al.: Channel pruning for accelerating very deep NNS. In: ICCV 2017 (2017)
Google Scholar
Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. arXiv preprint arXiv:1608.06993 (2016)
Iandola, F.N., Moskewicz, M.W., Ashraf, K., Han, S., Dally, W.J., Keutzer, K.: SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and \(<\)1mb model size. arXiv preprint arXiv:1602.07360 (2016)
Jenatton, R., Obozinski, G., Bach, F.: Structured sparse principal component analysis. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 366–373 (2010)
Google Scholar
Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678. ACM (2014)
Google Scholar
Jolliffe, I.T., Trendafilov, N.T., Uddin, M.: A modified principal component technique based on the LASSO. J. Comput. Graph. Stat. 12(3), 531–547 (2003)
Article MathSciNet Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Google Scholar
Khosla, A., Jayadevaprakash, N., Yao, B., Li, F.F.: Novel dataset for fine-grained image categorization: Stanford dogs
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Lebedev, V., Ganin, Y., Rakhuba, M., Oseledets, I., Lempitsky, V.: Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553 (2014)
LeCun, Y.: The MNIST database of handwritten digits (1998). http://yann.lecun.com/exdb/mnist/
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)
Article Google Scholar
Li, H., et al.: Pruning filters for efficient ConvNets. In: ICLR 2017 (2017)
Google Scholar
Luo, J.H., Wu, J., Lin, W.: ThiNet: a filter level pruning method for deep neural network compression. arXiv preprint arXiv:1707.06342 (2017)
Molchanov, P., et al.: Pruning convolutional neural networks for resource efficient transfer learning. In: ICLR 2017 (2017)
Google Scholar
Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4293–4302 (2016)
Google Scholar
Paszke, A., Gross, S., Chintala, S.: PyTorch (2017)
Google Scholar
Polyak, A., Wolf, L.: Channel-level acceleration of deep face representations. IEEE Access 3, 2163–2175 (2015)
Article Google Scholar
Real, E., et al.: Large-scale evolution of image classifiers. arXiv preprint arXiv:1703.01041 (2017)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Google Scholar
Rosenfeld, A., Tsotsos, J.K.: Incremental learning through deep adaptation. arXiv preprint arXiv:1705.04228 (2017)
Shin, H.C., et al.: Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 35(5), 1285–1298 (2016)
Article Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Solomonik, E.: Provably efficient algorithms for numerical tensor algebra. University of California, Berkeley (2014)
Google Scholar
Srebro, N., Jaakkola, T.: Weighted low-rank approximations. In: Proceedings of the 20th International Conference on Machine Learning (ICML 2003), pp. 720–727 (2003)
Google Scholar
Srinivas, S., Babu, R.V.: Data-free parameter pruning for deep neural networks. arXiv preprint arXiv:1507.06149 (2015)
Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Google Scholar
Ullrich, K., Meeds, E., Welling, M.: Soft weight-sharing for neural network compression. arXiv preprint arXiv:1702.04008 (2017)
Vasudevan, A., Anderson, A., Gregg, D.: Parallel multi channel convolution using general matrix multiplication. arXiv preprint arXiv:1704.04428 (2017)
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD birds-200-2011 dataset (2011)
Google Scholar
Wang, S., Cai, H., Bilmes, J., Noble, W.: Training compressed fully-connected networks with a density-diversity penalty (2016)
Google Scholar
Yang, Z., et al.: Deep fried ConvNets. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1476–1483 (2015)
Google Scholar
Yu, R., et al.: NISP: pruning networks using neuron importance score propagation. arXiv preprint arXiv:1711.05908 (2017)
Zhai, S., Cheng, Y., Zhang, Z.M., Lu, W.: Doubly convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1082–1090 (2016)
Google Scholar

Download references

Acknowledgments

We are grateful to Prof. Ramesh Raskar for his insightful comments. MC additionally acknowledges Po-han Huang for helpful discussions and NVIDIA for providing the GPUs used for this research.

Author information

Authors and Affiliations

Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
Abhimanyu Dubey
University of Illinois at Urbana-Champaign, Champaign, IL, 61820, USA
Moitreya Chatterjee & Narendra Ahuja

Authors

Abhimanyu Dubey
View author publications
You can also search for this author in PubMed Google Scholar
Moitreya Chatterjee
View author publications
You can also search for this author in PubMed Google Scholar
Narendra Ahuja
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abhimanyu Dubey .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 9715 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dubey, A., Chatterjee, M., Ahuja, N. (2018). Coreset-Based Neural Network Compression. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11211. Springer, Cham. https://doi.org/10.1007/978-3-030-01234-2_28

Download citation

DOI: https://doi.org/10.1007/978-3-030-01234-2_28
Published: 06 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01233-5
Online ISBN: 978-3-030-01234-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Coreset-Based Neural Network Compression

Abstract

Similar content being viewed by others

Neural Network Compression Framework for Fast Model Inference

Reproducing the Sparse Huffman Address Map Compression for Deep Neural Networks

Neural Network Compression by Joint Sparsity Promotion and Redundancy Reduction

1 Introduction

2 Related Work