Keywords

1 Introduction

In an intelligent wireless vision sensor networks (iWVSN), the vision analysis task is performed on the compressed images. Therefore, the reconstruction quality of the compressed image, as well as the encoder design and configuration, will have direct impact on the subsequent vision analysis performance. The latest standardization efforts in compression coding have led to the specification of high efficiency video coding (HEVC) [1]. Studies have been performed to analyze and model the complexity behavior of the HEVC encoder. In [2], the encoding complexity is incorporated into the rate-distortion analysis to reduce the encoder’s energy consumption, where the macroblock-level computational complexity of the H.264 encoder is modeled for each prediction mode. Authors in [3] proposed a rate-power allocation scheme for wireless video chat applications, where the transmission parameters are adaptively adjusted based on a power-rate-distortion model.

Recently, researchers have recognized the importance of joint design of image compression and vision analysis. For traffic surveillance, an unequal error protection scheme was developed in [4] to increase the vehicle tracking accuracy by allocating more resources to the image region of interest. By classifying macroblocks into different groups in video frames, a rate control method was also proposed for preserving the important local image features [5]. For moving object surveillance, a dynamic rate control scheme was developed in [6] to achieve higher image quality for the regions of interest. For lossy image compression of plant phenotyping, a λ-domain HEVC rate-distortion model was implemented to reduce the object segmentation errors at different bit rates [7].

In this work, we choose the deep convolutional neural networks (DCNN) for object classification of target images at the server end. Deep neural networks are able to construct complex representations and automatically learn a compositional relationship between inputs and outputs, mapping input images to output labels [8]. Once a DCNN is trained using the back-propagation learning procedure, the classification or test is a purely feedforward process [9]. During the past several years, a significant amount of works have been done to push the performance limits of DCNN in vision analysis. However, the join design of image compression, wireless transmission, and DCNN-based object classification has not been studied.

Within the context of iWVSN with DCNN-based target classification, this work has identified two important system control parameters, image sampling ratio (\( S \)) and quantization parameter (\( Q \)) of the HEVC intra encoder, play a critical role in determining the encoder complexity, coding bit rate, energy consumption in encoding and wireless transmission, reconstructed image quality, and object classification precision. Following an operational approach with extensive experiments, we establish models to characterize the behaviors of coding bit rate, encoding energy, wireless transmission energy, and DCNN classification precision with respect to two control parameters. Based on these models, we then develop optimal resource allocation schemes to minimize the sensor-node energy consumption while achieving the object classification precision.

2 Energy-Precision Control Framework

As discussed in the above, the task objective of the iWVSN is to identify targets. The target images are collected, encoded, transmitted and analyzed for automated classification. As illustrated in Fig. 1, each iWVSN sensor node encodes the target image using the HEVC intra encoder. The compressed bit stream is transmitted over a wireless channel, and then forwarded to the cloud server through Internet. At the server side, the bit stream is decoded to reconstruct the image. The DCNN is then applied to classify this reconstructed image to determine the target class. The iWVSN system is controlled by two important parameters: (1) the sampling ratio \( S \) and (2) the quantization parameter \( Q \). Specifically, before encoding, we perform down-sampling on the target image X with a sampling ratio of \( S \). As we know, the sampling ratio S has a direct impact on the following: (1) the encoding complexity which translates into encoder power consumption, (2) the coding bit rate which translates into power consumption in wireless transmission, and (3) the complexity and precision of the DCNN classifier. The quantization parameter Q has a direct impact on (1) the coding bit rate, (2) the quality of reconstructed images, and (3) the precision of target classification.

Fig. 1.
figure 1

The module diagram of energy-precision control framework.

HEVC image compression and wireless transmission are two major tasks for each node, consuming most of its energy. With S and Q as the control parameters, \( \varvec{P}(S, Q) \) denotes the classification precision in percentage (%), and \( \varvec{R}(S, Q) \) denotes the coding bit rate per image in Kbps, and \( \varvec{C}(S, Q) \) denotes the average complexity per image in millisecond (ms). The node-end energy consumption includes two additive components: the encoding energy \( \varvec{E}_{\text{c}} \) for compressing images, and the transmission energy \( \varvec{E}_{\text{t}} \) for sending bit data to a cloud server. The encoding energy \( \varvec{E}_{\text{c}} \) is related to the computational complexity \( \varvec{C}(S, Q) \) of the encoder, which depends on the two control parameters: S and Q. In other words, we have

$$ \varvec{E}_{\text{c}} \,{ = }\,\Phi \left[ {\varvec{C}\left( {S, Q} \right)} \right] $$
(1)

where \( \Phi \left[ \cdot \right] \) is a task-specific mapping the computational complexity or processor cycles into energy consumption. The transmission energy \( \varvec{E}_{\text{t}} \) is related to the bit rate \( \varvec{R}(S, Q) \) of the compressed image data stream which also depends on \( (S, Q ) \). Therefore, we write

$$ \varvec{E}_{\text{t}} \,{ = }\,\Theta \left[ {\varvec{R}\left( {S, Q} \right)} \right] $$
(2)

where \( \Theta \left[ \cdot \right] \) is also a task-specific mapping which depends on the wireless transmission scheme. In this work, we consider the concise mapping mechanism for \( \Phi \left[ \cdot \right] \) and \( \Theta \left[ \cdot \right] \). In iWVSN, the node-end processor power is stable and the wireless transmission is delay-tolerant. The encoding energy \( \varvec{E}_{\text{c}} \) exhibits a linear relation with the computational complexity \( \varvec{C} (S, {\text{Q)}} \), and the wireless transmission energy \( \varvec{E}_{\text{t}} \) also exhibits a linear relation with the coding bit rate \( \varvec{R} (S, {\text{Q)}} \) [10]. In this way, the total amount of energy consumption by the sensor node is given as follows:

$$ \varvec{E} (S, Q)\,{ = }\,\varvec{E}_{\text{c}} { + }\varvec{E}_{\text{t}} \,{ = }\,p_{\text{c}} \cdot \varvec{C} (S, Q) {\text{ + e}}_{\text{t}} \cdot \varvec{R} (S, Q) $$
(3)

where the encoding power \( p_{\text{c}} \) is a constant in J/ms, and the wireless transmission power \( {\text{e}}_{\text{t}} \) is another constant in J/Kbps. At the server end, the HEVC decoder decodes the received bit stream and reconstructs the image. The reconstructed image is then used as input to the DCNN module for target classification. Note that overall objective of the iWVSN is to determine the target classes. Therefore, we propose to use the classification precision \( \varvec{P} (S, Q ) \) as the performance metric, which depends on the size and quality of the input image.

One major motivation of this work is from the following observation: the vision sensor nodes may have spent too much computational and energy resources in encoding and transmitting the image samples whose quality is much higher than that needed for accurate target classification. In other words, from the target classification perspective, the sensor nodes may have wasted a lot of energy. This leads to the optimal resource allocation and control problem under DCNN precision constraints:

$$ {\text{min }}\varvec{E} (S, Q ) \;\;\;\;\;\;\;\;\;\; {\text{s}} . {\text{t}} . \,\varvec{P} (S, Q ) \ge P_{ \hbox{min} } $$
(4)

In this work, we aim to minimize the energy consumption of the iWVSN node while achieving the required precision \( P_{ \hbox{min} } \) for target classification. To successfully solve the above control problem, we need to establish those precision-rate-complexity models: \( \varvec{P} (S, Q ) \), \( \varvec{R} (S, Q ) \) and \( \varvec{C} (S, Q ) \), which will be presented in the following section.

3 Precision-Rate-Complexity Modeling

Through extensive experiments, we will establish models to characterize the behaviors of rate, complexity, and precision with respect to the two control parameters: S and Q.

3.1 Datasets and Experimental Setup

In this paper, we consider the application scenario of remote wildlife monitoring and protection. A network of vision sensors are deployed to monitor wildlife and human presence in the monitoring region. Triggered by animal motion, the sensor node will capture an image and transmit it to the cloud server for object classification: animal, human, or no-object. For example, if a human is detected in the wildlife protection zone, an alarm will be generated. To test the DCNN classification module, we have assembled a dataset of 1001 images of size 640 × 480, with about 1/3 images for each class. The basic unit of HEVC is a coded tree block (CTB) whose minimum size is 16 × 16 pixels. Let \( (W, H ) \) and \( (W_{\text{d}} , H_{\text{d}} ) \) be the (width, height) of the original image X and its down-sampled image \( \varvec{X}_{\text{d}} \), respectively. With a given sampling ratio S, the (width, height) of the down-sampled image \( \varvec{X}_{\text{d}} \) can be denoted as follows:

$$ (W_{\text{d}} , H_{\text{d}} )\,{ = }\, ( [W /\sqrt S ] , [H /\sqrt S ] ) $$
(5)

where [k] denotes a multiple of 16 that is closest to k; the width and height of a down-sampled image uniformly increase or decrease. For each target image, we will use the HEVC intra encoder to compress the image with different sampling ratios \( {\text{S}} \) and quantization parameters Q. The candidate values of S are \( \varOmega_{s} \,{ = }\,\left\{ { 1 , 2 , \cdots , 5 0} \right\} \) and the candidate values of Q are \( \varOmega_{q} \,{ = }\,\left\{ { 0 , 1 , 2 , \cdots , 5 1} \right\} \). In total, we have 50 × 52 different (S, Q) configurations. In this paper, we assume that the compressed bit stream is correctly received at the server side for successful image decoding and reconstruction. The DCNN is then applied to classify the reconstructed image into one of three classes: Human, Animal, and Background. The DCNN model is previously trained with a large set of labeled images, which are uncompressed and have the original resolution of 640 × 480.

3.2 Precision-Rate-Complexity Analysis

Note that S and Q are two independently control parameters. We propose to firstly analyze the precision-rate-complexity behaviors with respect to each individual parameter. Once we have understood and established these 1-Dimensional models, we then proceed to establish the joint model with these two control parameters. Figure 2(a) shows the actual \( \varvec{P} (S, Q ) \) curves at different S and different \( Q \). We can see that for small values of \( Q \), for example, from 0 to 30, the compressed image quality is high, and the precision does not change much. When \( Q \) is larger than the threshold (e.g., 30), the precision drop exponentially. This implies that the image quality does not affect the DCNN classification performance if it is above a certain threshold. This example suggests that the sensor node will waste the bits and energy resources if the image quality is already above the threshold since an even higher image quality level does not help the DCNN classification. We can see that the \( \varvec{P}\text{(}S\text{)} \) curves follows a decreasing near-exponential behavior. For actual coding bit rate, Fig. 2(b) plots the actual \( \varvec{R} (S, Q ) \) curves at different \( S \) and different \( Q \), whose average bit rate is 1725 Kbps. These curves show an exponentially decreasing relationship with the increasing \( S \) or \( Q \). For a given encoder, its computational complexity is directly related to its encoding time. Figure 2(c) plots the actual \( \varvec{C} (S, Q ) \) curves at different \( S \) and different \( Q \), whose average complexity is 258 ms. We can see that the quantization parameter \( Q \) does not affect the complexity much. Certainly, the complexity will decrease for smaller input images or larger sampling ratios.

Fig. 2.
figure 2

The actual behaviors of classification precision, coding bit-rate and complexity: (a) \( \varvec{P} (S, Q ) \) curves, (b) \( \varvec{R} (S, Q ) \) curves, (c) \( \varvec{C} (S, Q ) \) curves.

3.3 Precision-Rate-Complexity Bivariate Models

A fundamental goal of the precision-rate-complexity modeling is to solve the node-end energy minimization problem under server-end classification precision constraints. By heuristically feeding actual data into the constrained minimization task in (4), the actual distribution of all optimal control parameters can be obtained by exhaustively testing all possible \( (S, Q ) \) configurations. With all cases, Fig. 3 shows the distribution of actual optimal \( Q \) values at different precisions, where a dot denotes an optimal \( Q \) value at its precision. It can be seen that all optimal \( Q \) values are limited to a range from \( Q \) = 24 and \( Q \) = 51. When the smaller \( Q \) values vary from 0 to 23, the resulting precision (bit rate, complexity) have no influence on the optimal solution of the energy-precision optimization task, which motivates us neglect some (S, Q) configurations so as to produce more accurate precision-rate-complexity models.

Fig. 3.
figure 3

The distribution of actual optimal Q values.

Based on the experiments, our curve fitting goal may only considers those larger Q values in the range of [24, 51] and all 50 possible values for S. Thus, we have 50 × 28 possible (S, Q) configurations that needs to be fitted. It can be seen that the curves of actual precision and complexity also exhibits a certain linear behavior, and a first-order polynomial may approximate such a behavior. We relax the maximum value constraint in a smaller fitting space. By comparing various exponential forms and their parameters, the precision-rate-complexity bivariate models can be constructed as follows:

$$ \varvec{P}(S, Q) = \beta_{p1} - \beta_{p2} \cdot e^{{\beta_{p3} \cdot Q + \beta_{p4} \cdot S}} - \beta_{p5} \cdot Q - \beta_{p6} \cdot S $$
(6)
$$ \varvec{R}(S, Q) = \beta_{r1} \cdot e^{{\beta_{r2} \cdot Q + \beta_{r3} \cdot S}} + \beta_{r4} \cdot e^{{\beta_{r5} \cdot Q + \beta_{r6} \cdot S}} $$
(7)
$$ \varvec{C}(S, Q) = \beta_{c1} \cdot e^{{\beta_{c2} \cdot Q + \beta_{c3} \cdot S}} + \beta_{c4} \cdot e^{{\beta_{c5} \cdot Q + \beta_{c6} \cdot S}} + \beta_{c7} \cdot Q + \beta_{c8} \cdot S + \beta_{c9} $$
(8)

By continuous approximation, Table 1 reports the optimal parameter values of the precision-rate-complexity bivariate models. With better fitting results, the bivariate models can be used to search the appropriate S and Q for the energy-precision optimization task.

Table 1. The parameters values of precision-rate-complexity bivariate models.

4 Resource Allocation and Energy Minimization

In the above section, we have established models to predict the encoder computational complexity \( \varvec{C} (S, Q ) \), coding bit rate \( \varvec{R} (S, Q ) \), and the DCNN precision \( \varvec{P} (S, Q ) \). Based on these models, we are ready to study the resource allocation problem, answering the following important question: what is the minimum energy consumption that the iWVSN node needs to spend in order to achieve the desired DCNN object classification precision at the server end? As discussed in the above section, the iWVSN resource allocation problem can be formulated by:

$$ {\text{min }}\varvec{E} (S, Q )\,{ = }\,p_{\text{c}} \cdot \varvec{C} (S, Q) { + }e_{\text{t}} \cdot \varvec{R} (S, Q)\,\,\,\,\,\,{\text{s}} . {\text{t}} .\,\,\varvec{P} (S, Q ) \ge P_{\text{T}} $$
(9)

In the above section, we have obtained analytical models for the encoder complexity \( \varvec{C} (S, Q) \), the encoding bit rate \( \varvec{R} (S, Q) \), and the DCNN classification precision \( \varvec{P} (S, Q) \). We resort to a numerical solution. Specifically, with the precision-rate-complexity bivariate models, we are able to compute the values of \( \varvec{P} (S, Q) \), \( \varvec{R} (S, Q) \), and \( \varvec{C} (S, Q) \) for a dense grid of points \( (S, Q ) \). We then find the set of grid points which satisfy the precision constraint. Finally, within this set, we find the optimal \( (S, Q ) \) which has the minimum energy \( \varvec{E} (S, Q) \). Figure 4 shows the optimal sampling ratio \( S^{ *} \) and encoder quantization parameter \( Q^{ *} \) for a given target classification precision \( P_{\text{T}} \). Each dot represents an optimal look-up-table value of \( S^{ *} \) or \( Q^{ *} \) for a given target precision \( P_{\text{T}} \). The jig-saw effect is caused by the fact that the quantization parameter \( Q \) has to be an integer and the input image size has to be a multiple of 16. For easy implementation in actual system control, we propose to approximate optimal sampling ratio \( S^{ *} (P_{\text{T}} ) \) using a piece-wise linear function as shown in Fig. 4(a) in solid lines, and approximate the optimal encoder quantization parameter \( Q^{ *} (P_{\text{T}} ) \) using an exponential function as shown in Fig. 4(b):

Fig. 4.
figure 4

The look-up-table solution and analytic solution for energy-precision optimization: (a) the \( S^{ *} (P_{\text{T}} ) \) function; (b) the \( Q^{ *} (P_{\text{T}} ) \) function.

$$ S^{*} \left( {P_{T} } \right) = \left\{ {\begin{array}{*{20}c} {Round\left( {\omega_{1} \cdot P_{T} + \omega_{2} } \right), P_{T} < P_{0} } \\ {Round\left( {\omega_{3} \cdot P_{T} + \omega_{4} } \right), P_{T} \ge P_{0} } \\ \end{array} } \right. $$
(10)
$$ Q^{ *} \left( {P_{\text{T}} } \right) {\text{ = Round}}\left( {\tau_{ 1} \cdot {\text{e}}^{{\tau_{ 2} \cdot P_{\text{T}} }} { + }\tau_{ 3} \cdot {\text{e}}^{{\tau_{ 4} \cdot P_{\text{T}} }} } \right) $$
(11)

where the values of \( S^{ *} (P_{\text{T}} ) \) belong to \( \{ 1, 2, \cdots , 49, 50\} \), and the values of \( Q^{ *} (P_{\text{T}} ) \) belong to \( \{ 24, 25, \cdots , 50, 51\} \). The model parameters are listed in Table 2.

Table 2. The coefficients of analytic functions.

5 Experimental Results

In this section, we conduct experiments to evaluate the proposed method. Our test dataset consists of 1001 uncompressed camera-trap images. The original input image size is 640 × 480 in RGB color format. The DCNN classifier is constructed and trained by using CAFFE which has 5 convolutional layers followed by 3 fully connected layers [11]. The DCNN classifier has been well trained and tested on original target images, where the target images are categorized into three object classes, namely: Human, Animal, and Background. The image sampling ratio \( S \) and quantization parameter \( Q \) jointly affect the complexity, bit rate, and object classification precision. The candidate values of \( S \) are set to be {1, 2, 3, …, 49, 50}, and the candidate values of \( Q \) are set to be {0, 1, 2, …, 50, 51}. For image compression, we adopt the HM-16.7 main profile HEVC intra coding [12]. During simulation, to translate the computational complexity into computational energy, we set the thermal design power (TDP) of the microprocessor to be \( p_{\text{c}} \) = 0.14 J/ms. The transmission power \( e_{\text{t}} \) is set to 2.6 × 10−3 Kbps [10].

Figures 5, 6 and 7 show the estimation results by the precision-rate-complexity bivariate models obtained from the above section. Specifically, Fig. 5(a) shows the estimation results for the \( \varvec{P} (Q ) \) curve at different \( S \). Figure 5(b) shows the estimation results for the \( \varvec{P} (S ) \) curve for different \( Q \). We can see that the model is able to accurately capture the behavior of actual classification precision. For the estimation performance, we have \( {\text{R - square}} \) = 0.9548, and \( {\text{RMSE}} \) = 3.057%. Figure 6(a) shows the estimation results for the \( \varvec{R} (Q ) \) curve at different \( S \). Figure 6(b) shows the estimation results for the \( \varvec{R} (S ) \) curve for different \( Q \). We can see that this rate model is very accurate with \( {\text{R - square}} \) = 0.991. Figure 7(a) shows the estimation results for the \( \varvec{C} (Q ) \) curves at different \( S \). Figure 7(b) shows the estimation results for the \( \varvec{C} (S ) \) distributions for different \( Q \). We can see that the complexity model is very accurate with \( {\text{R - square}} \) = 0.997.

Fig. 5.
figure 5

The fitting results for precision model: (a) \( \varvec{P} (Q ) \) at different \( S \); (b) \( \varvec{P} (S ) \) at different \( Q \).

Fig. 6.
figure 6

The fitting results for rate model: (a) \( \varvec{R} (Q ) \) at different \( S \); (b) \( \varvec{R} (S ) \) at different \( Q \).

Fig. 7.
figure 7

The fitting results for complexity model: (a) \( \varvec{C} (Q ) \) at different \( S \); (b) \( \varvec{C} (S ) \) at different \( Q \).

Figure 8(a) shows the minimum energy consumption (in lines with circles) of the iWVSN node to achieve the target DCNN classification precision at the server end using the precision-rate-complexity bivariate model and resource allocation. For comparison, we also include the actual optimal value of minimum energy consumption (in lines with crosses) which are obtained from brute-force search based on experiments with all possible combinations of control parameters \( (S, Q ) \). We can see that our analysis and optimization approaches the actual optimal values. Figure 8(b) shows the operating bit rate and complexity of the iWVSN node. We can see that, if we allow a very small percentage of performance drop, for example, from dropping the precision from 97% to 96%, we can save the total energy at the iWVSN node by up to 2 times, which is very significant.

Fig. 8.
figure 8

The actual optimal energy consumption vs. function-guided energy consumption.

6 Conclusion

In this paper, we have studied the resource modeling, allocation, and optimization problem for an intelligent wireless vision sensor network which collects image samples of the targets, encodes and transmits the data to a cloud server for object classification using DCNN. We developed a new framework for energy-precision analysis and optimization. Specifically, we use the HEVC intra encoder for image compression configured with two control parameters: the image sampling ratio and quantization parameter. Through extensive experiments, we construct the precision-rate-complexity bivariate models to understand the behaviors of the HEVC intra encoder and the DCNN, and characterize the inherent relationship between bit rate, encoding complexity, classification precision and these two control parameters. Based on these models, we study the problem of optimization control of the wireless vision sensor node so that the node-end energy can be minimized subject to the server-end object classification precision. Our experimental results demonstrate that the proposed control method is able to effectively adjust the energy consumption of the sensor node while achieving the target classification performance.