1 Introduction

Convolutional neural networks, while immensely powerful, often are resource-intensive [24, 26, 35, 50, 54]. Popular CNN models such as AlexNet [35] and VGG-16 [50], for instance, have 61 and 138 million parameters and consume in excess of 200 MB and 500 MB of memory space respectively. This characteristic of deep CNN architectures reduces their portability, and poses a severe bottleneck for implementation in resource constrained environments [17]. Additionally, design choices for CNN architectures, such as network depth, filter sizes, and number of filters seem arbitrary and motivated purely by empirical performance at a particular task, permitting little room for interpretability. Moreover, the architecture design is not necessarily fully optimized for the network to be yielding a certain level of precision, making these models highly resource-inefficient.

Several prior approaches have thus sought to reduce the computational complexity of these models. Work aimed at designing efficient CNN architectures, such as Residual Networks (ResNets) [25] and DenseNets [28] have shown promise at alleviating the challenge of model complexity. These CNNs provide higher performance on classification at only a fraction of the number of parameters of their more resource intensive counterparts. However, despite being more compact, redundancies remain in such networks, leaving room for further compression.

In this work, we propose a novel method that exploits inter-filter dependencies extant in the convolutional filter banks of CNNs to compress pre-trained computationally intensive neural networks. Additionally we leverage neuronal activation patterns across samples to prune out irrelevant filters. Our compression pipeline consists of finely pruning the filters of every layer of the CNN based on sample activation patterns, followed by the construction of efficient filter coreset representations to exploit the inter-filter dependencies. Our method does not require retraining, is applicable to both fully-connected and convolution layers, and maintains classification performance similar to the uncompressed network. We display state-of-the-art compression rates on several popular CNN models, including multiple ResNets, which show increases from 9.2\(\times \) to 16.2\(\times \) in compression rate over prior state-of-the-art techniques. Coupled with Deep Compression, we are additionally able to compress other popular CNN models such as VGGNet-16 [50] and AlexNet [35] by 238\(\times \) and 55\(\times \) respectively. Moreover, we demonstrate the presence of filter redundancies even in highly efficient models such as SqueezeNet [29], by reducing their parameters by 50% with almost no loss in classification performance, giving us AlexNet-level precision but with 832\( \times \) smaller model size, compared to the original AlexNet model. Finally, we empirically validate the generalizability of these compressed CNNs to newer domains.

In the next section, we discuss relevant prior work in this area. In Sect. 3, we present the details of our algorithm. This is followed by Sect. 4, where a discussion on the empirical evaluation of our method vis-à-vis other competing compression techniques is presented. We finally conclude in Sect. 5, laying out some avenues for future research in this area.

2 Related Work

Network Compression: Compressing neural networks has been a topic of active research interest lately. Prior work in this area can be grouped into three distinct categories. The first category of methods direct their attention to the construction of parameter-efficient neural network architectures. For instance, Iandola et al. [29] propose SqueezeNets, a neural architecture class containing the parameter efficient, fully convolutional, ‘fire’ modules. Other examples of such architectures include Residual Networks (ResNets)[25], and Densely Connected Neural Networks (DenseNets)[28], which provide higher classification performance with models much smaller than the previous state-of-the-art, using ‘skip-connections’ between layers of the network. More recent approaches have sought to adapt CNN architectures so as to make them robust to common transformations (e.g. rotation) within the data, by modifying the filter banks of a CNN [10, 61] or by enforcing sparsity while training [2]. While these approaches seem to hold promise but they fail to fully exploit the inter-filter dependencies, allowing room for further compression of such networks. Meta-learning approaches attempt to decipher the optimum CNN architecture by searching over the space of a gigantic number of possible candidates. However, these techniques are prohibitively resource intensive, (needing well in excess of 400 GPUs to run), and often yield only a locally optimum architecture [46].

A second broad category of compression methods attempt to prune the unimportant network parameters. Han et al. [20] demonstrate an efficient pruning-retraining method, based on pruning weights by their \(\ell ^p\) norms. Srinivas and Babu [53] remove individual neurons instead of weights, with impressive results. The importance of ordering filters for the purpose of pruning has also been highlighted in Yu et al., He et al., and Molchanov et al. [27, 42, 60]. These approaches have been modified in the works of Polyak et al. [45] and Luo et al. [41], that focus on removing weights grouped by characteristics of filters (such as norm of filter weights, etc.). Li et al. [40] extend the ideas of filter pruning by removing filters from a network following an ‘importance’ criterion. However, the re-training step in these algorithms is time intensive.

Finally, the third theme of compression techniques is to employ weight-approximation and information-theoretic principles for the compression of neural network parameters. An early example of such work is the approach by Denton et al. [9] that uses low-rank approximations to compress fully-connected layers of neural networks. However, this technique doesn’t apply to the convolution layers. Lebdev et al. fixes this problem and employs a low-rank decomposition approach to the full CNN to construct more efficient representations but their technique’s principal bottleneck is re-training, which we avoid [36]. Rosenfield et al. consider an efficient utilization of CNN filters by representing them as a linear combination of a bases set [48]. However, our algorithm, different from this line of work, is additionally also capable of introducing structure, such as sparsity, in the approximated weights resulting from the decomposition, which further aids compression. Han et al. [20] introduce Deep Compression, that uses several steps such as weight-pruning, weight-sharing, and Huffman coding to reduce neural network size. However, their algorithm requires special hardware for inference in the compressed state, making it hard to deploy the compressed networks across platforms.

Our method aims at handling the shortcomings in each of these individual themes of CNN compression. Contrary to the first category of architecture search, our method is applicable to a wide variety of models, is less resource intensive, and does not require any retraining. While we do prune filters inspired by the work of Polyak et al. [45] (following the second theme of compression), our criterion for filter pruning, however, is motivated by the accurate reconstruction of sample activations, instead of the magnitude of filter weights. Finally, our compression technique does not require special hardware for running inference unlike Han et al. [19] and scales to both fully-connected and convolutional layers, unlike the low-rank (SVD) approach by Denton et al. [9].

Coresets for Point Selection: Coresets have been widely studied in computational geometry. They were introduced first by Agarwal et al. [1] for approximating a set of points with a smaller set, while preserving some desired criteria, on k-means and k-median problems. Badoiu et al. [5] propose a coreset formulation to cluster points using a subset of the total set of points to generate the optimal solution. Har-Peled and Mazumdar [22] give an alternate solution for coresets that include points not in the original set. Feldman et al. [13] demonstrate that weak coreset representations can be generated with the number of points independent of the underlying data distribution. These formulations have recently been applied to several problems within computer vision and machine learning [12, 14, 15], and are primarily used to approximate a set of n points in d dimensions, originating from a domain \(\mathbf {S}\), with a smaller set of \(\tilde{n} \ll n\) points, while preserving some criterion such as similar pairwise distances. However, coresets have remained unexplored in the context of CNN compression, which constitutes a major novelty of our work.

3 Method

We begin with a fully-trained CNN and compress it without retraining, first by pruning out unimportant filters, followed by extraction of efficient coreset representation of these filters. Some of the major advantages of our method include: (i) Lack of retraining, therefore a major reduction in processing time, (ii) Capacity of our algorithm to significantly compress both convolutional and fully connected layers, and (iii) Ability of the compressed CNN to generalize to newer tasks.

3.1 Background and Notation

An n-layered neural network can be described as a union of the parameter tensors of every layer, . The parameters of layer k have the shape \(N_k \times C_k \times h_k \times w_k\), where \(N_k\) denotes the number of filters, \(C_k\) denotes the number of input channels of the filter (since this is typically equal to the number of filters in the previous layer, \(C_k = N_{k-1}\)), and \(h_k\) and \(w_k\) denote the height and width of a filter. We can rewrite the parameter tensor as a 2D matrix of the shape \(N_k \times (C_k h_k w_k)\). Next, we append the biases of the filters to , to make it a matrix of dimensions \(N_k \times (C_k h_k w_k + 1)\). It is well known that using this representation of the weights and biases of a layer, , we can represent the output activation of any fully connected layer as a matrix product of with the incoming activation tensor . This notion can be extended to convolution layers by re-casting the matrices in an appropriate Toeplitz form [56].

The goal of compression is to obtain a compressed representation of the parameters for each layer such that it is smaller and computationally efficient, and preserves the final classification accuracy. Our approach is to construct compressed filter ‘coresets’ of the parameters of each layer (where \(\hat{N}_k < N_k\)), such that the output activations (obtained after the Toeplitz matrix multiplication), are well approximated. Ensuring that the output activations remain largely the same at every layer, post compression, ensures that the final classification performance remains largely unchanged. Since the elements of these coresets are typically linear functions of the original parameters, we will additionally require a decompression matrix to obtain an approximation to the initial set of parameters, starting with the coreset representation.

Coresets are an effective technique of approximating a large set of points with a smaller set, which need not necessarily be a part the original set, while preserving some desirable property such as mean pairwise distances, diameter of the point set, etc. We seek to obtain a reduced matrix (coreset) representation of the original filter weights of every layer, which we do via 3 different approaches, as described below.

3.2 k-Means Coresets

A first approach to constructing such a coreset would be to obtain a reduced representation of the parameter matrix that approximates the sum of distances in the space of neuronal activations of an arbitrary sample between each of the filters. Feldman et al. [14] demonstrate that this problem is equivalent to finding a low-rank approximation of the filter matrix. This is representable as follows:

(1)

Their formulation for constructing a compact set of \(\hat{N}_k \ll N\) points using the sum of distances criterion leads to the k-Means Coresets, where the coreset representation is given by the solution to the above optimization problem:

(2)

Here, the matrices and are the \(\hat{N}_k\)-truncated versions of the matrices and , which satisfy the property:

(3)

are unitary matrices, while is a diagonal matrix. Such a decomposition can be obtained using Singular Value Decomposition (SVD), where the extent of truncation is specified as an input to the algorithm. The truncation determines the amount of compression we get.

Intuitively a significant truncation, while yielding greater compression, leads to a weaker approximation of the filter weights. This also results in a weaker approximation of the output activations, manifesting itself as a drop in classification accuracy. We seek the optimum compression, across all layers, such that the classification accuracy does not deviate by more than 0.5%.

SVD for compressing neural network weights has been investigated previously in [9], however, with two key differences - (i) the naive SVD approach has been applied only to fully-connected layers of neural networks, with limited success, whereas our coreset-based formulation scales to both convolution and fully connected layers, and (ii) our method for selecting the number of components to be retained \(\hat{N}_k \) is data-dependent, based on training error obtained on random subsets of the training data, instead of an arbitrary initialization followed by retraining. However, since this decomposition does not explicitly encode any structure on the approximated weights, such as sparsity or considers the impact of activations, we build upon this formulation to create stronger coreset representations. This sets us apart from prior work, which employ simple low-rank decomposition for constructing efficient CNNs [36] (Fig. 1).

Fig. 1.
figure 1

Visual representation of compression pipeline for layer k of a neural network. Our algorithm proceeds in two steps - (i) filter pruning, and (ii) filter compression, as illustrated

3.3 Structured Sparse Coresets

If we consider the previous coreset decomposition, the optimization problem can be rewritten as (subject to constraints on each of the variables and ):

(4)

To induce sparsity in the obtained decomposition, Jenatton et al. [30] introduce a technique known as Structured Sparse PCA, which optimizes the following:

(5)

This problem can be solved by a cyclic optimization of two convex problems [30], and provides us with a decomposition that possesses structured sparsity. The motivation behind using such a formulation is to obtain a decomposition that is sparse in the number of components used, while minimizing reconstruction error. While techniques such as SPCA [32] or NMF [39] also construct representations that are sparse in the projected space, this formulation returns a decomposition that makes the approximation in the original space sparse as well, hence, both and are sparse. Moreover, this formulation allows us to discard those filters for which the corresponding column vector in is a null vector, leading to further compression.

The hyper-parameters \(\hat{N}_k\) and \(\lambda \) are chosen jointly so as to obtain the maximum compression while restricting the deviation in classification performance to within 0.5% of the uncompressed network, post the compression of all layers. We observe that this technique provides much more compression than k-Means Coreset, however, this does not take into account the relative importance of the filters during reconstruction, which leads us to our final coreset formulation.

3.4 Activation-Weighted Coresets

Our final coreset formulation is obtained by introducing a relative importance score to every filter (based on their activation magnitudes over the training set), while inducing sparsity. However, if we attempt to directly learn a coreset representation by minimizing the reconstruction error over all the training set activations, the resulting optimization problem will be difficult to solve, owing to the large size of the activation matrix and its degenerate nature. We thus employ an alternate formulation: for each filter f in a layer, we compute its ‘importance’ \(i^{(f)}_k\) as the mean value of its activation over all training set points, normalized over all filters, in the \(k^{th}\) layer. This is given by the following:

(6)

Here, is the activation of the \(f^{th}\) filter of layer k, for training sample j, and T denotes the total number of training samples. We then construct the Importance Matrix for the layer k by tiling the column vectors \((i^{(f)}_k)_{f=1}^{N_k}\), for \((C_k \times h_k \times w_k + 1)\) times, creating an Importance Matrix \(\in \mathbb R^{N_k \times (C_k h_k w_k + 1)}\), where each row denotes the ‘importance’ of each filter, normalized over all filters of the current layer.

We create this form of the importance matrix with every element of a row containing identical values, since we do not want to weigh each component of a particular filter differently. Note, additionally, that we can compute this matrix in only one forward pass of the entire training set. This leads us to the following optimization problem:

(7)

Here \(\odot \) denotes the Hadamard (elementwise) product. This problem is essentially, a weighted low-rank decomposition, studied previously by Srerbo and Jaakkola [52] and Delchambre [8] and is solved using an efficient Expectation-Maximization (EM) algorithm [52].

The intuition behind this weighted formulation is to ascribe a relative importance to the filters that contribute most to the activations in the training set (on average in the Frobenius norm sense), instead of attempting to reconstruct all activations with equal priority. Molchanov et al. [42] also use the notion of an importance criteria for compression but rather than using it as a weighting scheme in the optimization objective, like we do, they directly use it to prune the ‘less important’ filters. In this case as well, we compute the optimum number of components to be kept by selecting the least number of components that can be selected such that the classification accuracy is bounded within 0.5% of the original network, once the entire network has been compressed.

3.5 Activation-Based Filter Pruning

In related work, Liet al. observe that not all filters are equally important in the context of classification [40]. This motivates us to perform a pre-processing step before coreset compression, to first eliminate unimportant filters pre-emptively, based on the mean of their activation norms over the training set. This step is essential to remove unimportant weights, since pruning out a filter in a layer can completely remove the weights corresponding to that filter, in the next layer as well, inducing greater sparsity. Using the notation from earlier we can write the size of filters as:

By layer-wise pruning of complete filters, we can hence set the number of post-pruning filters at layer \(k-1\) to be \(N^*_{k-1} < N_{k-1}\), permitting further compression. In networks with skip-connections (e.g. ResNets), \(C_k \ne N_{k-1}\), but it is a positive linear combination of the number of filters of the “source” layers of the (skip) connections, hence the proportionality still holds.

Starting from the first layer in the network, we proceed to evaluate the activation values for the entire training set, layer-by-layer. Inspired by standard “Max-Pool” sub-sampling techniques prevalent in modern CNNs [35, 50], we approximate the response from each filter in the convolution layers (a 2D matrix) with its maximum value (a scalar). Once we have this set of pooled filter-wise activations for all samples, we compute the mean squared norm of each filter over all the training samples, and sort the filters by this value. This technique of ordering filters differentiates us from prior pruning-based techniques. We maximize the number of pruned filters, ensuring that the divergence in classification accuracy is only 0.5%, after the pruning has been carried out across all layers. Once we obtain a reduced set of filters with the crucial filters preserved, we compress this set of filters using coresets, as discussed earlier.

3.6 Compression Pipeline and Computational Complexity Analysis

The entire pipeline for compression can be summarized in two stages - (i) activation-based pruning, followed by (ii) coreset-based compression. The pruning procedure can be summarized in the following steps:

  1. 1.

    Sort the layers of the network in order of descending parameter size.

  2. 2.

    For each layer of the sorted network, repeat the following steps:

    1. (a)

      Compute activations for every input in the training set, and store the maximum value for each filter activation (max-pool over spatial dimensions).

    2. (b)

      Sort the filters in descending order of the mean value of the max-pooled activations, over the entire training set.

    3. (c)

      Find the smallest number of filters \(N^*_k\) that can be retained while performance deviation, post the compression of all layers, is within 0.5% of original performance - using binary search.

We can see that the complexity of the individual steps are \(\mathcal O(n \log n)\) for the first sorting step, and \(\mathcal O(n \cdot (A + N_k \log N_k + A \log N_k))\) for layer-wise activation computation, filter sorting, and binary-search. A denotes the complexity to do one feed-forward operation on the entire training set. Since \(A \gg N_k \gg 1\), the total complexity of the filter pruning is \(\mathcal O(n \cdot A \cdot \log \max _k N_k)\), requiring a maximum of \(n \log \max _k N_k\) epochs of feed-forward operations, which, for most neural network architectures, we find to be much smaller than the complexity of fine-tuning.

After filter pruning, we proceed to the coreset-based compression stage. This procedure for compression can be summarized in the following steps:

For each layer in the network, starting from the shallowest, do:

  1. 1.

    Compute the complete decomposition according to the coreset formulation used.

  2. 2.

    Find the minimum number of coreset filters \({\hat{N}_k}\) that can be retained while performance is within 0.5% of the network prior to coreset compression, post the compression of all layers - by searching over a random subset of the training data using binary search.

The complexity for the coreset construction set is \(\mathcal O(n \cdot (B + sA \log N^*_k))\), where B is the complexity for the matrix decomposition, and \(0 \le s \le 1\) is the fraction of random training points used. For our experiments, we set \(s = 0.005\). We find that for most networks, \(sA\log N^*_k > B\), and hence the total complexity of the compression pipeline is \(\mathcal {O}(n\cdot ~A\cdot ~\log ~N_k\cdot (1+s))\). Note that post the cascading of activation-based pruning with coreset compression across all layers, the total deviation allowed in classification performance is 1% (0.5% for pruning and 0.5% for coreset compression).

4 Experimental Evaluation

We implement our method in PyTorch [44] and Caffe [31], and evaluate on a cluster of NVIDIA TITAN Xp and Tesla GPUs. All of our implementation and other details are available here Footnote 1. For all experiments, we evaluate all 3 coreset construction techniques, as well as the impact of activation-based pruning coupled with each, and report all results together with the baseline and comparable recent work. The Activation-Based Pruning pipeline is reported as AP, while the coreset techniques are reported as - (1) k-Means Coreset (Coreset-K), (2) Structured Sparse Coreset (Coreset-S) and (3) Activation-Weighted Coreset (Coreset-A). We compare our compression performance with recent compression benchmarks, such as Fast-Food [59], SVD [9], Weight-Based Pruning [21], Deep Compression [20], memory-bounded CNNs [6], Compression Aware Training [2], etc. on a wide array of CNN architectures, including the highly efficient SqueezeNet [29].

4.1 LeNet-5 on MNIST

The first architecture we evaluate is the LeNet-5 network [38] on the MNIST dataset [37]. This is a popular benchmark for network compression, and high values of compression are reported by various recent work, which makes it a very competitive setup. The results for this experiment are summarized in Table 2. We can see that the coreset-based methods outperform the recent work comfortably, with a relative improvement of 18% over the existing state-of-the-art.

Table 1. Compression (Comp.) results for both AlexNet [35] and VGGNet-16 [50] trained on the ImageNet dataset, along with variation in performance with Deep Compression

4.2 Large-Scale ImageNet Models

The next set of experiments we perform are on the large-scale ImageNet-trained models - the very deep networks such as Residual Networks [24], AlexNet [35] and VGGNet-16 [50]. These architectures are ubiquitous for countless applied computer vision tasks [7, 23], and several recent compression techniques demonstrate remarkable compression on these models which makes them an appropriate benchmark for evaluating compression performance. For these networks, we also demonstrate the impact of coupling Deep Compression (which involves quantization, pruning, re-training iteratively) with our method.

Table 3 summarizes the empirical evaluation on Residual Networks. We find state-of-the-art performance achieved by all three coreset methods, and a substantial increase from previous baselines as well. Even in 101-layer deep networks such as ResNet-101, we are able to obtain consistent compression, similar to the shallower ResNets. Note that this improvement is entirely on convolutional layers, which typically have very few redundancies when compared to fully-connected layers. We additionally observe that activation-based pruning buys us significant compression, providing in essence a cascading additive effect.

Table 1 summarizes the empirical evaluation on AlexNet and VGGNet-16 networks, the two of the largest image classification networks in use today. We demonstrate substantial improvements over the state-of-the-art, by compressing AlexNet by 19\(\times \), and VGGNet-16 by 17\(\times \) from their baseline sizes. When combined with Deep Compression, these ratios increase, up to 55\(\times \) and 238\(\times \) respectively, yielding models with a memory footprint of less than 4 MB. The results additionally highlight the improvement that the activation-based pruning (AP) provides, which is most prominent in the Coreset-K and Coreset-S models.

Table 2. Compression (Comp.) results on LeNet-5
Table 3. Compression results on Residual Networks. Columns Acc. and Comp. represent the Top-1 accuracy and compression factor respectively
Table 4. Comparison with SqueezeNet [29] trained on the ImageNet dataset. We can compress SqueezeNet to create a model that is 832\(\times \) smaller than AlexNet [35] with the same performance.

4.3 SqueezeNet

We evaluate our method on the highly parameter-efficient SqueezeNet architecture to evaluate if further redundancies still persist after such a compression in the architecture space and if those can be eliminated via efficient filter bank representations. We find that despite beginning with 50\(\times \) less parameters than AlexNet (while providing the same performance), SqueezeNet can be compressed further (results in Table 4). Using our method, we are able to compress SqueezeNet to half its parameters, providing accuracy similar to AlexNet at 100\(\times \) compression. By coupling with Deep Compression, we obtain a net compression in model size to the tune of 16.64\(\times \) over the original model (or 832\(\times \) from AlexNet) while maintaining classification performance.

Table 5. LeNet-5 layer-wise compression of our method (denoted by identifiers) vis-á-vis prior work. The entries represent the fraction of parameters retained post compression

4.4 Additional Observations

Further, we observe that Coreset-S and Coreset-A formulations consistently outperform Coreset-K. We surmise that large extant model redundancies tend to benefit Coreset-A and S formulations where sparsity is explicitly enforced in the objective. Moreover, we observe that for deeper models Coreset-S tends to achieve the most compression. Table 5 shows the superior layer-wise compression achieved by our algorithm vis-á-vis state-of-the-art compression techniques on LeNet-5. The results clearly bring out the efficacy of using our compression technique, especially for convolution layers. For layer-wise compression results on other CNNs, please refer to the supplementary.

Runtime Analysis: We also perform a study of runtime analysis in both training and inference performance. Since we do not undertake retraining, our method is considerably faster - on our hardware, one forward pass and backward pass of AlexNet (batch size 256) takes 16 ms naively, which corresponds to a total epoch training time (on ImageNet) of 2.5 min. We use this as a base measurement to compare the total training time (inclusive of the coreset operations). Table 1 describes the comparison of training times across methods. The previous state of the art method, Dynamic Net Surgery [18], requires 140 epochs (in time units) whereas our method takes at most 28 epochs (in time units), a significant reduction of 80%. During inference, we observe a reduction in inference time as well, which can be optimized by using efficient tensor multiplication [51]. On ResNet-50, VGGNet-16 and AlexNet, the naive (uncompressed) runtimes per epoch are: 36 ms, 45 ms and 8 ms respectively. Our best runtimes for these networks (with Coreset-S) are 19 ms, 21 ms and 3.5 ms on average, which is an average improvement of around 50% (Fig. 4).

Fig. 2.
figure 2

The comparison of Activation-Based Pruning (AP) with weight-based filter pruning (without re-training), on AlexNet

Fig. 3.
figure 3

Variation of classification performance with compression for all coreset compression techniques evaluated on AlexNet [35]

Fig. 4.
figure 4

The comparison of Activation-Based Pruning (AP) with the pruning techniques of Han et al. and Li et al. [21, 40], on AlexNet

4.5 Ablation Analysis

To demonstrate the effect of individual components in our method, we perform some ablation studies as well. We first compare the effect of activation-based pruning (AP) on all coreset compression techniques on three models - AlexNet [35], VGGNet-16 [50] and SqueezeNet [29], and observe that pruning benefits all methods of coreset compression, as described in Tables 1 and 4.

Next, we compare activation-based pruning with weight-based pruning, without re-training, and the pruning technique of Li et al. [40] for AlexNet. The results of these comparisons are summarized in Fig. 2. We obtain consistently better performance at all compression ratios, substantiating the merit of data-dependent filter pruning approaches over those based on the magnitudes of filter weights.

Finally, we analyze the variation of performance with compression factor for all coreset compression techniques on the AlexNet classification model, described in Fig. 3. We observe that Coreset-K (with and without AP), while stronger than SVD and Pruning approaches, worsens much more rapidly in comparison to other corresponding coreset techniques. This observation is consistent across all models. For additional results, more layer-wise compression analysis, and filter visualizations we refer the reader to the supplementary material.

Table 6. Performance of coreset-based compression on domain-adaptation tasks.

4.6 Domain Adaptibility

To measure the generalizability of our compressed models to newer tasks, we evaluate compressed models on domain adaptation benchmarks, following the experimental pipeline proposed in [41]. We evaluate the performance of the compressed CNN model VGGNet-16 [50] on target domain adaptation datasets - CUB-2011 [57] and Stanford-Dogs [34], two popular datasets for fine-grained image classification. These results are summarized in Table 6. We observe that our compressed coreset models are able to provide classification performance close to the uncompressed networks, while surpassing networks compressed by other techniques. This exhibits the versatility of coreset-compressed models to domain adaptation tasks, as well.

5 Conclusions and Future Work

In this paper we introduce a novel technique that exploits redundancies in the space of convolutional filter weights and sample activations to reduce neural network size, using the long-existing concepts of coresets, coupled with an activation-based pooling technique. The lack of a re-training step in our algorithmic pipeline makes the implementation simple. Empirical evaluation reveals that our algorithm outperforms all other competing methods at compressing a wide array of popular CNN architectures. Our findings uncover the existence of redundancies even in the most compressed CNNs, such as SqueezeNets, which can be further exploited to improve efficiency.

Our method does not require any retraining, scales to both convolution and fully connected layers, and is extensively generalizable to different neural network models without being computationally intensive. Thus, we hope that our algorithm will serve as a valuable tool to obtain leaner and more efficient CNNs. As future work, we hope to apply our algorithm to compress other types of deep neural networks, such as Recurrent Neural Networks (RNNs) which are applicable to time-varying sequential inputs.