Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Many important morphological properties, such as the distribution of muscle fibers and their nuclei with respect to the perimysium, are important biomarkers for early diagnosis of many muscle diseases [6]. To compute these spatial morphological parameters, accurate and efficient segmentation of perimysium is an essential prerequisite. However, muscle perimysium often shares similar appearances to other structures in the muscle, such as endomysium, epimysium, and blood vessels. The large variations in staining intensity, global structure, and morphology further complicate the automated segmentation task.

Fig. 1.
figure 1

One exemplar architecture. Spatial CW-RNN and dense layer represent the proposed spatial clockwork RNN and fully connected layer, respectively. Sweepings in different directions are illustrated in spatial CW-RNN layer using colorful arrows. Activations from four sweepings are concatenated together in the concatenation layer as the global context information for each local patch. The mapping between the output of dense layer and the predicted mask (overlaid on the original image) for each local patch is illustrated using brown arrows.

Recently, deep learning based methods have achieved great success in object detection and segmentation, among which many are mainly dominated by variations of the convolutional neural network (CNN) [10]. One popular strategy is applying sliding-window to local image patch either by classification [5] or regression [14]. This group of methods fails to exploit the global semantic information. Recently, recurrent neural (RNN) network and it’s variations (gated recurrent neural network (GRU) [3] and long short-term memory network (LSTM) [8]) have witnessed great success in modeling global semantic information in chain-structured data. Both GRU [3] and LSTM [8] are proposed to handle the exploding or vanishing gradient issue of plain RNN with the cost of a heavier computational load. Recently, the clockwork RNN (CW-RNN), which contains even a smaller number of parameters than plain RNN, has been proposed in [9] and proven effective in modeling long-term dependency. CW-RNN separates the hidden recurrent units into different groups, each runs their own computation at specific, discrete clock period. Since only a portion of the modules are active at each time step, it is more efficient than plain RNN. Furthermore, it is also shown to outperform RNN and even LSTM in various tasks [9].

Enormous efforts have been devoted to utilizing RNN on computer vision tasks. Francesco [12] applies GRU [3] to sweep the images as one chain-structured data but along four different directions to model the context information. Some pioneering works [2, 7] that exploit the potentials of multi-dimensional RNN in semantic image segmentation have also achieved promising results. However, 2D plain RNN [7] suffers from the exploding or the vanishing gradient problem for large images, and 2D LSTM [2] contains much more parameters than 2D RNN, which makes it inefficient at the runtime, and sometimes over-fit can happen especially when the amount of training data is limited.

In this paper, we propose a 2D spatial clockwork RNN which extends the applicability of chain structured CW-RNN [9] to 2D image domain for efficient perimysium segmentation. Our model directly exploits the 2D structure of images and encodes the global context information among local image patches. Different from [2, 7], our model contains a much smaller number of parameters, which makes it computationally efficient and suitable for medical image segmentation with limited training data. In our algorithm, instead of conducting inefficient patch-wise classification, we integrate the structured regression [14] into the proposed algorithm. This allows us to use non-overlapping stride in both training and testing stages. Extensive experimental results demonstrate the effectiveness and efficiency of our proposed model. To the best of our knowledge, this is the first work to propose a 2D spatial CW-RNN that achieves promising results on biomedical image segmentation.

2 Methodology

2.1 Recurrent Neural Network Revisited

The recurrent neural network (RNN) is one type of neural network that is equipped with recurrent connections, which enable the network to memorize past input patterns. For the simple RNN (SRNN), at each time step, it’s current hidden state \(\varvec{h}^t\) is a non-linear transformation of the current input \(\varvec{x}^t\) and the hidden state \(\varvec{h}^{t-1}\) from the last step. The output \(\varvec{o}^t\) is directly connected to \(\varvec{h}^t\). Mathematically, those relationships can be expressed by the following equations:

$$\begin{aligned} \varvec{h}^t = f(\varvec{Wx}^t + \varvec{Uh}^{t-1} + \varvec{b}_h), \end{aligned}$$
(1)
$$\begin{aligned} \varvec{o}^t = g(\varvec{Vh}^t + \varvec{b}_o), \end{aligned}$$
(2)

where f(.), g(.) represent the nonlinear activation functions, \(\varvec{W}\) and \(\varvec{U}\) are weight matrices connecting input units to hidden units, and hidden units to themselves, respectively. \(\varvec{V}\) is the weight matrix connecting hidden units to the output units. \(\varvec{b}_h\) and \(\varvec{b}_o\) represent the bias terms for the hidden and output layer, respectively.

SRNN is usually trained with a discriminative objective function using the back propagation through time (BPTT) algorithm [13]. However, the fact that the computed gradients of SRNN are either exploding or vanishing when T becomes large hinders the SRNN from learning long-term temporal dependencies. Instead of introducing gated connections [3, 8] to complicate the model, clockwork RNN (CW-RNN) [9] addresses the long-term dependency issue by using a clever trick. Specifically, the hidden units \(\varvec{h}\) are partitioned into M modules (\(\varvec{h}^m\) for \(i=1,...,M\)), each is of size k and associated with a clock (or temporal) period \(T_i\in \{T_1,...,T_M\}\). The total length of the hidden units is \(hid = M \times k\). At each time step t, the neurons in module i will be updated only when t satisfies \((t~ mod ~T_i)=0\). Units corresponding to slower rates are thus capable of preserving long-term information. In addition, connections between hidden units are restricted that faster modules can only receive information from slower ones and not vice-versa, this mechanism further reduces the total number of active weights.

2.2 Spatial Clockwork RNN

Since there are no existing sequences presented in static images, the aforementioned CW-RNN is not directly applicable to our application. To ameliorate this problem, we extend the CW-RNN to a two-dimensional domain, in which current state can receive information from it’s predecessors in both row and column directions. In order for the spatial CW-RNN to process the image, all image patches need to be sorted to an acyclic sequence.

Specifically, we maintain one sub-hidden state for both row and column dimension, denoted as \(\varvec{\widehat{h}}\) and \(\varvec{\widetilde{h}}\), which are composed together as the hidden states \(\varvec{H}=[\varvec{\widehat{h}},~\varvec{\widetilde{h}}]\). Denote respectively the weights matrix connecting the current hidden states to it’s row and column predecessor as \(\varvec{\widehat{{U}}}\) and \(\varvec{\widetilde{{U}}}\), which are split into four \(hid \times hid\) block matrices; \(\varvec{W}\) connecting the input units to hidden units is partitioned into 2 \(input\_dim \times hid\) blocks-columns; the bias \(\varvec{b}^h\) is also evenly separated into 2 groups: \( \varvec{\widehat{{U}}}= \left( \begin{array}{lll} \varvec{\widehat{U}}^{(1,1)} ~ &{} \varvec{\widehat{{U}}}^{(1,2)} \\ \varvec{\widehat{{U}}}^{(2,1)} ~ &{} \varvec{\widehat{{U}}}^{(2,2)} \end{array} \right) \) , \( \varvec{\widetilde{{U}}} = \left( \begin{array}{lll} \varvec{\widetilde{{U}}}^{(1,1)} ~ &{} \varvec{\widetilde{{U}}}^{(1,2)}\\ \varvec{\widetilde{{U}}}^{(2,1)} ~ &{} \varvec{\widetilde{U}}^{(2,2)} \end{array} \right) \) , \( \varvec{{W}} = \left( \begin{array}{lll} \varvec{W}^1 ~&{} \varvec{{W}}^2 \\ \end{array} \right) \), and \( \varvec{b} = \left( \begin{array}{lll} \varvec{b}^1 ~&{} \varvec{b}^2 \\ \end{array} \right) . \) Each block matrix \(\varvec{\widehat{{U}}}^{(m,n)}\), \(\varvec{\widetilde{{U}}}^{(m,n)}\) are further partitioned into \(M\times M\) smaller block-matrices with the same size \(k\times k\). \(\varvec{W}^m\) and \(\varvec{b}^m\) is partitioned into M blocks-columns as well; \( \varvec{\widehat{{U}}}^{(m,n)}= \left( \begin{array}{lll} \varvec{\widehat{{U}}}^{(m,n)}_{(1,1)} &{} \cdots &{} \varvec{\widehat{{U}}}^{(m,n)}_{(1,M)} \\ \vdots &{} \cdots &{} \vdots \\ \varvec{\widehat{U}}^{(m,n)}_{(M,1)} &{} \cdots &{} \varvec{\widehat{{U}}}^{(m,n)}_{(M,M)} \\ \end{array} \right) \), \( \varvec{\widetilde{{U}}}^{(m,n)}= \left( \begin{array}{lll} \varvec{\widetilde{{U}}}^{(m,n)}_{(1,1)} &{} \cdots &{} \varvec{\widetilde{{U}}}^{(m,n)}_{(1,M)} \\ \vdots &{} \cdots &{} \vdots \\ \varvec{\widetilde{U}}^{(m,n)}_{(M,1)} &{} \cdots &{} \varvec{\widetilde{{U}}}^{(m,n)}_{(M,M)} \\ \end{array} \right) \), \( \varvec{W}^m = \left( \begin{array}{lll} \varvec{W}^m_{1} &{} \cdots &{} \varvec{W}^m_{M} \\ \end{array} \right) \) and \( \varvec{b}^m = \left( \begin{array}{lll} \varvec{b}_{1}^m &{} \cdots &{} \varvec{b}_{M}^m \\ \end{array} \right) . \) Recall that each sub-hidden state \(\varvec{\widehat{h}}\) and \(\varvec{\widetilde{h}}\) is partitioned into M module, each runs at specific temporal rate. Denote \(i,j \in \{1,...,M\}\) as the modules index, \(u \in \{1,2\}\) matrix identifier, and (rc) as the time-step. For brief narrative, we define the following general matrix placeholders: \(\varvec{\mathcal {H}}^{(r,c)}_i = \left( \varvec{\widehat{h}}^{(r, c)}_i ~~ \varvec{\widetilde{h}}^{(r, c)}_i\right) \), \( \varvec{\mathcal {\widehat{U}}}^{u}_{ij}= \left( \begin{array}{ll} \varvec{\widehat{U}}^{(u,1)}_{ij} \\ \varvec{{\widehat{U}}}^{(u,2)}_{ij} \\ \end{array} \right) \), and \( \varvec{\mathcal {\widetilde{U}}}^{u}_{ij}= \left( \begin{array}{ll} \varvec{\widetilde{U}}^{(u,1)}_{ij} \\ \varvec{{\widetilde{U}}}^{(u,2)}_{ij} \\ \end{array} \right) \). The updating rule for the i-th module of \(\varvec{\widehat{h}}\) (similar case for \(\varvec{\widetilde{h}}\)) at time step (rc) is given as:

$$\begin{aligned} \varvec{\widehat{h}}^{(r,c)}_{i}= {\left\{ \begin{array}{ll} f(\varvec{x}^{(r,c)}\varvec{W}_i^1 + \mathop \sum \limits _{j = i}^M\left( \varvec{\mathcal {H}}^{(r-1,c)}_j \varvec{\mathcal {\widehat{U}}}^{1}_{ij} + \varvec{\mathcal {H}}^{(r,c-1)}_j \varvec{\mathcal {\widetilde{U}}}^{1}_{ij} + \varvec{b}^1_i \right) ) &{} \text{ if } (r~ mod ~T_i) = 0 , \\ \varvec{\widehat{h}}^{(r-1,c)}_{i} &{} \text{ otherwise }. \end{array}\right. } \end{aligned}$$
(3)

Note that the aforementioned method only considers the 4 connected neighborhood, namely, every patch only receives information from it’s left, right, up and lower adjacent patches. But it is trivial to extend our method to 8 connected neighborhood. Both of the two cases are evaluated in the experiment part.

2.3 Structured Prediction with Full Sweeping

Due to the temporal dependency property of spatial CW-RNN, each local patch only receives context information from the region spanned by it’s predecessors. However, in 2D images, each local patch is surrounded by both it’s predecessors and postdecessors, we thus want the model to be aware of such bi-directional context information. To this end, we sweep the input image (or feature map) from four different corners (upper-left, lower-left, upper-right, lower-right) to the opposite corners. For each local image patch, activations from four directional sweepings are concatenated together as the full-context representation, which is fed to the successive layers to produce the final prediction output. The illustration of this process is shown in Fig. 1.

Now, we omit the module index i in \(\varvec{\mathcal {H}}^{(r,c)}\) and define \(\varvec{\mathcal {H}}^{(r,c)}_{\searrow },\varvec{\mathcal {H}}^{(r,c)}_{\swarrow }, \varvec{\mathcal {H}}^{(r,c)}_{\nearrow }\) and \(\varvec{\mathcal {H}}^{(r,c)}_{\nwarrow }\) as the total hidden activations (containing all the modules) for each directional sweeping at time step (rc). The output \(\varvec{\mathcal {O}}^{(r,c)}\) after applying one dense layer to those concatenated features can be computed as:

$$\begin{aligned} {\mathcal{O}^{(r,c)}} = f(\mathop \sum \limits _{d'} \mathcal{H}_{d'}^{(r,c)}{{\varvec{W}}_{d'}} + {\varvec{b}}), \end{aligned}$$
(4)

where \(d' \in \{{\searrow }, {\swarrow }, {\nearrow }, {\nwarrow }\}\) denotes different sweeping direction. Please note that dense layer is applied individually across all the time step, and local patches corresponding to different time steps share the same weights \(\varvec{W}_{d'}\).

Given a set of training data \(\{(\varvec{X}_i,\varvec{Y}_i)\}_{i=1}^{{N}}\), where N is the total number of training data, \(\varvec{X}_i\) is the i-th training image and \(\varvec{Y}_i\) is the corresponding mask label. Let \(R_i\) and \(C_i\) denote the total number of local patches in row and column dimension for the i-th pair of training data. Denote \(\varvec{\Theta }\) as the model’s parameter, and \(\psi \) as our model. The objective function defined on \(\{(\varvec{X}_i,\varvec{Y}_i)\}\) is given by:

$$\begin{aligned} \mathcal {L}(\psi (\varvec{X}_i; \varvec{\Theta }), {\varvec{Y}_i})&= \frac{1}{2}\mathop \sum \limits _{r=1}^{R_i}\mathop \sum \limits _{c=1}^{C_i}\left\| \varvec{Y}_i^{(r,c)} - \varvec{\mathcal {O}}_i^{(r,c)} \right\| _2^2 , \end{aligned}$$
(5)

where both of \(\varvec{Y}_i^{(r,c)}\) and \(\varvec{O}_i^{(r,c)}\) are reshaped into a vector to computed the loss.

Our proposed spatial CW-RNN is inherently capable of capturing semantic information in the entire image. Meanwhile, it is totally end-to-end trainable and can be optimized using standard BPTT algorithm [13]. It takes an input image with any size and produces the result mask with the same size as the input.

3 Experimental Results

Dataset and Implementation Details: The proposed spatial CW-RNN has been extensively evaluated using 348 H&E stained skeletal muscle microscopy images (each image roughly contains \(300\times 600\) pixels). All the images are manually annotated and double checked by two neuromuscular pathologists. In total 150 images are chosen for testing and the rest for training. Both qualitative and quantitative experiments are reported. The detailed architecture of our method is summarized in Table 1. The first layer is a dense layer, and the next 4 spatial CW-RNN layers are used to sweep the input feature map in four different directions. The size of the none-overlapping patches is set to \(10\times 10 \times 3\). The model is trained using RMSprop algorithm with a learning rate of 0.003. M and k in Sect. 2.2 are 4 and 48, respectively. Time period is set to exponential series: \(T_i = 2^{i-1}\). Our model is implemented in python with Theano [1] and Keras [4]. The experiments are performed on a PC endowed with an Intel Xeon E5-1650 CPU and an NVIDIA Quadro K4000 GPU.

Table 1. The network architecture. Dense represents the fully connected layer applied individually to every time step. SCR represents spatial CW-RNN, where the arrow indicates the sweeping direction. The Inputs row specifies the layer ID of each layer’s inputs. Layer 7 takes the concatenation of the output from layer 3,4,5 and 6 as input.

Evaluation metrics: Denote \(m_{ij}\) as the number of pixels of class i labeled as class j, \(t_i = \sum _j{m_{ij}}\) as the number of pixels of class i. The following metrics are computed (IU represents region intersection over union):

  1. (1)

    Mean accuracy (MA): \((1/2)\sum _i{m_{ii}/t_i}\)

  2. (2)

    Average IU (AIU): \((1/2)\sum _i({m_{ii}/(t_i + \sum _j{m_{ji}} -m_{ii})})\)

  3. (3)

    Weighted IU (WIU) : \((1/\sum _it_i)\sum _i(t_i{m_{ii}/(t_i + \sum _j{m_{ji}} -m_{ii})})\)

  4. (4)

    Precision (P), recall (R) and \(\text {F}_1\) score.

Comparison with Other Works: We compare our method with several variations of other deep learning based frameworks, e.g., multi-layer perception (MLP), convolutional neural network (CNN). The detailed performance comparison are given in Table 2. SCW-RNN(4) and SCW-RNN(8) denote the proposed method for 4 and 8 connected neighborhood, respectively. CNN-nips is the famous architecture utilized in [5] to segment neuronal membranes, which consists of 4 convolutional layers and 4 max-pooling layers followed by two fully connected layer. This network uses a large input window size \((95\times 95)\) to capture the context information. We also compare our method with U-NET [11], an end-to-end CNN architecture. To demonstrate our method’s capability of handling spatial context information, a plain MLP network that shares similar architecture to our model, denoted as MLP-10 are considered for comparison as well. We also try a larger window size (\(48 \times 48\)) for MLP network, denoted as MLP-48.

Fig. 2.
figure 2

Perymisum segmentation results on three challenging skeleton muscle images which show strong global structure and demonstrates a lot of appearance similarity between perimysium (true positive) and endo/epimysium (false positive). Comparing with other methods, our results show much better global consistency because it can capture global spatial configurations.

Table 2. The quantitative comparative results of muscle perimysium segmentation results. T represents the average running time (measured in second)

As we show in Table 2, both versions of the proposed method, SCW-RNN(4) and SCW-RNN(8), achieve the best overall performance compared with others. It is obvious that the utilization of more spatial context information in SCW-RNN(8) leads to performance improvement than SCW-RNN(4), especially in terms of recall and \(\text {F}_{1}\) score. MLP-10, which does not consider such spatial context information across local patches, produces a lot of false positive evidenced by the low precision and \(\text {F}_{1}\) score. MLP-48, which has a larger receptive field outperforms MLP-10 with a large margin. CNN-nips, which uses a really large window size (\(95 \times 95\)), achieves comparative results as ours, but its running time is almost 100 times slower than our method. Although for certain architecture, fast scanning can be utilized to remove redundant computations of convolution operation, it is not applicable to our case, which conducts patch-wise normalization. U-NET [11], which does not invoke patch based testing, is also very efficient, but it produces a much lower \(\text {F}_{1}\) score and AIU than ours, one of the possible reasons is that we do not apply aggressive data augmentation in all of our experimental settings.

For quantitative comparison, some challenging images with segmentation results overlaid on the original image are shown in Fig. 2. It can be observed that our method produces the most accurate results with much better global consistency. This further provides evidences that our proposed spatial CW-RNN has strong capability to learn the global context information, which is the key to differentiate perimysium from endomysium, epimysium, and blood vessels.

4 Conclusion

In this paper, we propose a formulation of the novel 2D spatial clock-work recurrent neural network and pave the way to utilize RNN architecture to process 2D biomedical image data. Our spatial CW-RNN is totally end-to-end trainable and capable of encoding the global context information into the features of each local image patch, which tremendously improves the performance. In addition, we utilize the structured output for each local image patch, making it efficient for both training and testing.