Keywords

1 Introduction

Convolutional neural networks (CNNs) are the go-to model for most prediction-based computer vision problems. However, most popularized CNNs are treated as black-boxes, lacking interpretability and simple properties concerning the data domains they act on. For instance, in 3D object recognition, we know that object categories are invariant to object pose, but convolutional neural network filters are orientation, scale, reflection, and parity (point reflection) selective. This means that every activation in any intermediate layer is sensitive to local pose, and ultimately the global output of the network is too. A simple solution to obtain this sought-after invariance is to augment the input data with transformed copies, spanning all possible variations, to which we seek to be invariant [2]. This method is simple and effective, but relies on an efficient and realistic data augmentation pipeline. There is also the argument, why should we bother learning these invariances, if we can enforce them a priori? If successful, we would not need as much training data [8, 50]. Indeed, convolutional neural networks already have (i) filter locality and (ii) translational weight-tying built directly into their architectures, which arguably could be learned using a multilayer perceptron with a enough computational budget and training data.

We introduce a CNN architecture, which is linearly equivariant (a generalization of invariance defined in the next section) to 3D rotations about patch centers. To the best of our knowledge, this paper provides the first example of a group-CNN [8] with linear equivariance to 3D rotations and 3D translations of voxelized data. By exploiting the symmetries of the classification task, we are able to reduce the number of trainable parameters using judicious weight tying. We also need less training and test time data augmentation, since some aspects of 3D geometry are already ‘hard-baked’ into the network. We demonstrate state-of-the-art and comparable performance on (i) the ModelNet10 classification challenge, which is a standard 3D classification benchmark task, and (ii) the ISBI 2012 connectome segmentation benchmark, which is a 3D anisotropic boundary segmentation problem. We have released our code at https://deworrall92.github.com.

2 Background

For completeness, we set out our terminology and definitions. We outline definitions of linear equivariance, invariance, groups, and convolution, and then combine these ideas into the group convolution, which is the workhorse of the paper. These definitions are not our contribution and can be found in textbooks such as [7], but we have tried to standardize them and simplify notation.

Definition 1

(Equivariance). Consider a set of transformations G, where individual transformations are indexed as \(g\in G\). Consider also a function or feature map \(\varvec{\varPhi }: \mathcal {X} \rightarrow \mathcal {Y}\) mapping inputs \(\mathbf x \in \mathcal {X}\) to outputs \(\mathbf y \in \mathcal {Y}\). Transformations can be applied to any \(\mathbf x \in \mathcal {X}\) using the operator \(\mathcal {T}_g^{\mathcal {X}}: \mathcal {X} \rightarrow \mathcal {X}\), so that \(\mathbf x \mapsto \mathcal {T}_g^{\mathcal {X}}[\mathbf x ]\). The same can be done for the outputs with \(\mathbf y \mapsto \mathcal {T}_g^{\mathcal {Y}}[\mathbf y ]\). We say that \(\varvec{\varPhi }\) is equivariant to G if

$$\begin{aligned} \varvec{\varPhi }(\mathcal {T}_g^{\mathcal {X}}[\mathbf x ]) = \mathcal {T}_g^{\mathcal {Y}}[\varvec{\varPhi }(\mathbf x )], \qquad \forall g \in G. \end{aligned}$$
(1)

Since \(\mathcal {T}_g^{\mathcal {X}}\) and \(\mathcal {T}_g^{\mathcal {Y}}\) are related via (1), they are essentially different representations of the same transformation. Due to this connection, it is customary to drop the \(\mathcal {T}_g^\bullet \) notation and write

$$\begin{aligned} \varvec{\varPhi }(g\mathbf x ) = g\varvec{\varPhi }(\mathbf x ). \end{aligned}$$
(2)

Equivariance is important, because it highlights an explicit relationship between input transformations and feature-space transformations, which in the context of deep learning is not well-understood. An example of an equivariant task is pose-detection, where g represents the sought-after pose. The kind of equivariant feature maps, we are interested in, are those where \(\mathcal {T}^{\mathcal {X}}\) and \(\mathcal {T}^\mathcal {Y}\) are linear. Such feature maps are known as linearly equivariant. A special case of equivariance is invariance, where we have

$$\begin{aligned} \varvec{\varPhi }(\mathbf x ) = \varvec{\varPhi }(g\mathbf x ), \end{aligned}$$
(3)

that is, the feature-space transformation is just the identity. An example of an invariant task is object classification. Note when we use the term equivariant in the rest of the paper, we will generally refer to non-invariance.

Groups. Invertible transformations are members of a class of mathematical objects called groups. Groups are a mathematical abstraction, which are used to describe the compositional structure of mathematical operators, such as transformations. Groups have four main properties: for group elements \(f,g,h\in G\)

  1. 1.

    closure: chained transformations are transformations, e.g. \(fg \in G\)

  2. 2.

    associativity: f(gh) = (fg)h = fgh

  3. 3.

    identity: there exists a transformation \(e\in G\) (sometimes written \(\mathbf 0 \)) such that \(eg = ge = g, \forall g\in G\)

  4. 4.

    invertibility: every transformation g has an inverse \(g^{-1}\), so \(gg^{-1} = g^{-1}g = e\). Rotations and translations are both examples of groups.

Convolution. The fundamental operation in convolutional neural networks is the convolution—technically CNNs perform cross-correlation, but we stick with the term ‘convolution’ to remain in sync with the literature \(\star \). In 3D, convolution is the inner product of a filter \(\mathbf W \in \mathbb {R}^{h\times w \times d}\) with patches extracted from an activation tensor or feature map \(\mathbf F \in \mathbb {R}^{H \times W \times D}\) where hwdHWD are the height, width, and depth of the filter/activations respectively. The method of patch extraction is usually a translationally sliding window. So given a filter \(\mathbf W \), the translated version is \(g\mathbf W \), such that

$$\begin{aligned}{}[ \mathbf F \star \mathbf W ]_g = \sum _\mathbf{x \in \mathbb {Z}^3} [g\mathbf W ]_\mathbf{x } \mathbf F _\mathbf{x } = \sum _\mathbf{x \in \mathbb {Z}^3} \mathbf W _{g^{-1}\mathbf x } \mathbf F _\mathbf{x }; \end{aligned}$$
(4)

where to index elements of the filters/activations we have used the multi-index notation \(\mathbf W _\mathbf x := \mathbf W _{x,y,z}\) for \(\mathbf x = [x,y,z]^\top \in \mathbb {Z}^3\), and so in this example \(\mathbf W _{g^{-1}\mathbf x } = \mathbf W _{x - g_x, y - g_y, z - g_z}\) for voxel-wise translation in 3D by \(g = [g_x, g_y, g_z]^\top \). This sliding-window interpretation of convolution can be viewed as applying the same filter to different local regions of the inputs. Note that in reality, since the feature map is zero outside of a a certain neighborhood, we need not sum over all \(\mathbb {Z}^3\). Note also how the output of the convolution is indexed by the transformation parameter g; that is, the gth activation corresponds to the response of a g-shifted filter \(g\mathbf W \). We have used the notation \([\mathbf F \star \mathbf W ]_g\) to emphasize that \([\mathbf F \star \mathbf W ]\) is an indexable object like \(\mathbf W \) or \(\mathbf F \), and it can be viewed as a vector (see Fig. 1). CNNs usually have multiple channels k per activation tensor, so in general we really have

$$\begin{aligned}{}[\mathbf F \star \mathbf W ]^k_g = \sum _{i=1}^I \sum _\mathbf{x \in \mathbb {Z}^3} [g\mathbf W ]_\mathbf{x }^{ik} \mathbf F _\mathbf{x }^i, \end{aligned}$$
(5)

where the dummy index i is over input channels with output channel k.

Fig. 1.
figure 1

(Best viewed in color) On the left we show the standard 2D convolution of Eq. 4 between a sliding filter \(\mathbf W \) and an input patch \(\mathbf F \). On the right we show the 2D right-angle rotation convolution (called \(Z_4\)-convolution) acting on an input where \(G=\mathbb {Z}^2\).

One can show (c.f. [8, 11] and Eqs. 8 and 9) that the standard translational convolution is equivariant to translations; that is, translations of the input to the convolution result in translations in the feature space representation \([\mathbf F \star \mathbf W ]\). The extension of this translational equivariance to other groups of transformation is embodied in the group convolution [8], which we show next. This has been proven [29] to be the only operator which is equivariant to (compact) group-structured transformations.

Definition 2

(Group Convolution). A group convolution between a filter \(\mathbf W \) and a single-channel feature map \(\mathbf F \) over a group of transformations G is

$$\begin{aligned}{}[\mathbf F \star \mathbf W ]_g = \sum _{h \in G} [g\mathbf W ]_{h} \mathbf F _{h} = \sum _{h \in G} \mathbf W _{g^{-1}h} \mathbf F _{h}. \end{aligned}$$
(6)

The extension to multichannel activations parallels Eq. (5).

We see that the main difference between the standard convolution of Eq. 4 and the group convolution of Eq. 6 is that we have replaced the domain of summation from \(\mathbb {Z}^3\) to the group G. So the sliding inner product could generalize to a sliding-and-rotating inner-product, or sliding-and-flipping inner product, or even sliding-and-scaling inner product depending on the choice of group G. A simple example is shown in Fig. 1, where we show a 2D translational convolution and a first layer 2D right-angle rotational convolution (called \(Z_4\)-convolution). In this example, the domain of the \(Z_4\)-convolution is \(G=\mathbb {Z}^2\), the standard 2D image domain, but the output is over the group of four 2D rotations, \(Z_4\). This amounts to taking an inner product of the kernel \(\mathbf W \) rotated four times, with each individual response being stacked into a vector. If we were to then convolve a kernel over the response of this first \(Z_4\)-convolution, the domain of that convolution would be \(G=Z_4\). Stacks of group convolutions turn out to be equivariant as well.

Note that the dimensionality of the convolutional responses is linear in the number of elements of the group G. At each layer, it is common to choose the size of the group to be the same, or smaller if we include pooling. To maintain a transformation invariant output, we using average over the group at the final layer of the network, which is an extension of global average pooling to groups.

In this paper, we are interested in the group of 3D roto-translations. The group convolution for this group will involve us convolving an activation tensor with rotated and shifted copies of a filter \([g\mathbf W ]_\mathbf{x } = \mathbf W _{g^{-1}\mathbf x } = \mathbf W _\mathbf{R _g^{-1}\mathbf x - \mathbf z _g}\), where \(\mathbf R _g\) is a 3D rotation matrix and \(\mathbf z _g\) is a translational offset.

3 Related Work

Recently there has been an explosion of interest into CNNs with predefined transformation equivariances, beyond translation [8, 9, 11, 14,15,16, 18, 19, 22, 25, 26, 28, 29, 31, 33, 36, 42, 48,49,50, 55]. However, with the exception of Cohen and Welling [9] (projections on sphere), Kondor [28] (point clouds), and Thomas et al. [48] (point clouds), these have mainly focused on the 2D scenario. There are also examples of CNNs, which have explicit regularization to learn equivariance [30, 40, 43, 51]. To the best of our knowledge, we are the first to develop a 3D rotation equivariant CNN architecture for voxelized data.

Handcrafted Equivariance. There are many computer vision models that exhibit equivariance properties. Perhaps the first notable instance is the scale-space [13], which specifically displays equivariance to isotropic scale, later extended to affine equivariance by Lindeberg [34]. In the presence of continuous transformations, Freeman and Adelson famously [17] (and less famously Lenz [32]), shored up the theory of steerable filters, which are a set of bandlimited linear filters \(\mathbf w _\theta \in \mathbb {R}^{H\times W}\), which can be synthesized exactly at any rotation \(\theta \) as a finite linear combination of basis filters

$$\begin{aligned} w_\theta (\mathbf x ) = \sum _{n=1}^N \alpha _n(\theta ) \phi _n(\mathbf x ). \end{aligned}$$
(7)

These are attractive because their expressiveness is controlled by the number of coefficients N, rather than the spatial size of the filter. These have been applied to scale-spaces/pyramids in Simoncelli et al. [44], and have been placed on firm theoretical ground by Teo [47] in his PhD thesis. It has also been shown that for certain transformations, such as scalings (or more generally non-compact groups), exact steering is only possible if \(N = \infty \). In this case, Perona [37] showed that he could approximate Eq. 7 using an SVD formulation. Like our method, all these works display handcrafted linear equivariance to a predefined set of transformations.

2D Rotation Invariant Neural Networks. For CNNs, as mentioned, most works have focussed on 2D rotations. Fasel and Gatica-Perez [16], Laptev et al. [31], and Gonzalez et al. [19] average classifier predictions on multiple rotated copies of an input. Sifre and Mallat [42] and Oyallon and Mallat [36] use a scattering network [5] for roto-translation invariant classification. Every layer of these networks is locally (patch-wise) rotation invariant, performing a pre-determined wavelet transform averaging responses over rotation. Cotter and Kingsbury [12] recently suggested, however, that these networks lack discriminativeness, partially from the phase removal and partially from the fact that the wavelet transforms are not optimized per-task, which our method can handle.

2D Rotation Equivariant Neural Networks. Henriques and Vedaldi [22] and Esteves et al. [15] perform a log-polar transform of the input, which converts scalings and rotations about a single point into a translation. Applying a standard translation equivariant CNN to this representation is then equivariant to rotations and scalings about the image center. This is only equivariant to global rotations, and does not generalize to 3D. For locally equivariant methods Dieleman et al. [14] maintain multiple rotated feature maps at every layer of a network; whereas, Cohen and Welling [8] rotate the filters. In the same paper, Cohen and Welling also extended this method to finite groups and later generalized this to arbitrary compact groups in [11]. Worrall et al. [50] generalized the filter rotation method to continuous rotations, using circular Fourier transforms to compute continuous rotation responses with a finite number of filters. At the same time Zhou et al. [55] extended the filter rotation method to non-\(90^\circ \) rotations using bilinear interpolation. Gonzalez et al. [18] do similar, but also pool over rotations and use a representation similar to [50]. Weiler et al. [49] so far have the best solution to rotate filters, using steerable filters to solve the interpolation problem. Our method can be seen as an instance of Cohen and Welling [8] adapted to 3D rotation and translation.

Deeply Learned Equivariance. There are many papers which also focus on learning equivariance. Tangent Prop by Simard et al. [43] is a classic example of an invariance inducing regularizer. Hinton et al. [23] introduced the transforming autoencoder to build latent spaces with equivariant structure. More recently, Worrall et al. [51] extended this method by imposing explicit transformation rules on the latent space. Papers such as InfoGAN by Chen et al. [6] and the Deep Convolutional Inverse Graphics Network of Kulkarni et al. [30] seek to learn equivariant structure in unsupervised fashion unsupervised. Most recently Sabour et al. [40] and Hinton et al. [24] achieved highly impressive results on the MNIST dataset with capsule networks by learning approximations to affine equivariance. While these methods are very flexible, they require lots of training data

3D Methods. For classification, the most straightforward CNNs operating on 3D voxel data use 3D convolutions as of Eq. 4 such as Maturana and Scherer [35] or 3D Convolutional Deep Belief Network as in Wu et al. [53]. Brock et al. [4] take this to the extreme, designing an ensemble of six 45-layer deep inception- and resnet-style networks trained with a lot of data-augmentation and rotation averaging. Sedaghat et al. [41] rely less on brute force, augmenting the prediction task with orientation estimation. For 3D rotation equivariant methods, Cohen and Welling introduce the Spherical CNN [10], which operates on images projected onto the sphere, while Kondor [28] and Thomas et al. [48] operate on point clouds. All three methods use variants of a 3D extension of Worrall et al. [50], which introduced continuous rotation equivariance into CNNs, by use of the shifting property of Fourier transforms.

4 Method

We have introduced the concept of groups as a way to model transformations, and as a way to extend standard convolution to these transformations. Here, we chart out three different discrete 3D rotation groups; namely, Klein’s four-group, the tetrahedral group and the cube group. We then show how to apply these groups in a group equivariant CNN using Cayley tables to build three different 3D rotation equivariant CNNs. We do not consider equivariant to continuous 3D rotations in this paper, leaving it for future work.

Cube Group. The set of all right-angle rotations of a cubic filter \(\mathbf F _\mathbf x \in \mathbb {R}^{N\times N\times N}\) forms a group. There are 24 such rotations, going by the name of the cube groupFootnote 1 \(S_4\). Each of the 24 rotations applied to a cube is shown in Fig. 2. The group is non-commutative, so \(\mathbf F _{(g_1g_7)^{-1}\mathbf x } \ne \mathbf F _{(g_7g_1)^{-1}\mathbf x }\) for rotations \(g_1\) and \(g_7\), for example.

Fig. 2.
figure 2

(Best viewed in color) Left: The 24 rotations of the cube group \(S_4\), applied to the a cube \(\mathbf F _\mathbf x \) are shown. For instance, rotation \(g_{22}\) applied to the cube returns \(\mathbf F _{g_{22}^{-1}\mathbf x }\), shown by the #22 in the bottom row. The 12 cubes wrapped in thin blue boxes are the rotational tetrahedral group \(T_4\). The 4 cubes wrapped in thick dashed red lines are the Klein four-group V. Right: The Cayley table of the cube group, representing how rotations are composed. For instance, on the bottom left, we have the example of composing rotation \(g_7\) with rotation \(g_1\). The composition is performed by (i) first applying \(g_7\) to the cube to yield \(\mathbf F _{g_7^{-1} \mathbf x }\) then (ii) applying \(g_1\) to \(\mathbf F _{g_7^{-1} \mathbf x }\), returning \(\mathbf F _{g_1^{-1}g_7^{-1} \mathbf x }\). The first transformation is easy to visualize - it is by \(\#7\) in the grid of cubes. The transformation \(g_1\) is a rotation by \(90^\circ \) counter-clockwise about the vertical axis, thus for the composition we rotate \(\mathbf F _{g_7^{-1}\mathbf x }\) \(90^\circ \) counter-clockwise about the z-axis. This results in \(F_{g_8^{-1}\mathbf x }\). This result is stored in the Cayley table by placing the first rotation down the left column and the second rotation along the top row. The intersection of row \(\mathbf 7 \) with column \(\mathbf 1 \) is the rotation \(\mathbf 8 \). On the bottom right, we show the composition \(g_7 g_1 = g_{17} \ne g_{8} = g_1 g_7\), demonstrating the non-commutativity property of the cube group and 3D rotations in general.

Tetrahedral Group. Using 24 copies of the same filter increases the computational overhead 24 times. A cheaper subsampling is the rotations of the tetrahedron. This has 12 states, and goes by the name of the rotational tetrahedral group \(T_4\). \(T_4\) is formally a subgroupFootnote 2 of the cube group, comprised of all even rotations (i.e. all rotations which can be made by two \(90^\circ \)-rotations). It is shown as the 12 cube rotations wrapped in thin blue in Fig. 2.

Klein’s Four-Group. The smallest subsampling of rotations, which can be seen as rotations about 3 independent axes is Klein’s Vierergruppe V or four-group. It has four rotations as can be seen in Fig. 3. This group is a subgroup of the rotational tetrahedral group and the cube group. Interestingly, it is commutative and also the smallest non-cyclic group. It is shown as the 4 rotations wrapped in dashed red in Fig. 2.

4.1 Cayley Tables

Knowing how a rotation of the input will permute the convolutional response can be figured out from the group Cayley table. This is a multiplication table enumerating every composition of transformations. For Klein’s four-group, we label the rotations as \(g_0\) (the identity), \(g_1\), \(g_2\), & \(g_3\). The Cayley table with instructions of how to read it are given in Table 1. The Cayley table is useful for determining how to perform the group convolution in deeper layers. We can see why this is the case because looking to the expression for the group convolution \(\sum _{h\in G} \mathbf W _{g^{-1}h}\mathbf F _h\), we see a product \(g^{-1}h\) in the indices of \(\mathbf W \). We can use the Cayley table to ascertain the single transformation that is the result of the product. Looking closely at a Cayley table we see that all the rows/columns are permutations of one another, this will be important for understanding how input rotations affect the group-convolutional response.

Table 1. The Cayley table for Klein’s four-group. The product \(g_2g_3\) (a \(g_2\)-rotation followed by a \(g_3\)-rotation) can be found by looking down the left column for the first transformation \(g_2\), then finding the second transformation \(g_3\) in the top row. The cell at the intersection of row-\(g_2\) and column-\(g_3\) (shaded in yellow) is \(g_1\), so \(g_2g_3=g_1\).

4.2 Discrete Group Equivariance and Permutations

Rotating an input to a group convolution will lead to a transformation of its output. Specifically a rotation will lead to a permutation of the output, where we view the output as a vector of responses, with each dimension corresponding to a different group element/transformation \(g\in G\). An example of this vectorized output can be seen in Fig. 1. For translations the permutation is a voxel-wise shift, but for the aforementioned 3D rotations the permutations are much more complicated. If we apply a transformation p to the input features \(\mathbf F \), then

$$\begin{aligned}{}[[p\mathbf F ] \star \mathbf W ]_g&= \sum _{h \in G} [g\mathbf W ]_h [p\mathbf F ]_h = \sum _{h \in G} \mathbf W _{g^{-1}h} \mathbf F _{p^{-1}h} \end{aligned}$$
(8)
$$\begin{aligned}&= \sum _{h' \in G} \mathbf W _{g^{-1}ph'} \mathbf F _{h'} = [\mathbf F \star \mathbf W ]_{p^{-1}g} = [p[\mathbf F \star \mathbf W ]]_{g}. \end{aligned}$$
(9)

Here we have made the substitution \(h' = p^{-1}h\) and noted that \(p^{-1}G = G\) for \(p\in G\), where \(p^{-1}G := \{p^{-1}g \mid g \in G\}\). What lines 8 and 9 say is that the output of the group convolution is permuted whenever the input \(\mathbf F \) is transformed by an element of the group G. The specific permutation of the output depends on the specific transformation and transformation group. Thinking of \(\mathbf F \star \mathbf W \) and \([p\mathbf F ] \star \mathbf W \) as vectors separated by a permutation, we can write

$$\begin{aligned}{}[p\mathbf F ] \star \mathbf W = p[\mathbf F \star \mathbf W ] = \mathbf P _p[\mathbf F \star \mathbf W ], \end{aligned}$$
(10)

where the first equality is from Eqs. 8 and 9 and in the second equality we have rewritten the permutation as multiplication with the permutation matrix \(\mathbf P _p\). In fact \(\mathbf P _p\) is the permutation matrix corresponding to the pth column of the Cayley table. Thus we see that group convolutions are linearly equivariant to transformations \(p\in G\), as defined in Eq. 1. We see an example of this for Klein’s four-group in Fig. 3, where we have labeled the four rotations as \(g_0\) (the identity), \(g_1\), \(g_2\), & \(g_3\).

Fig. 3.
figure 3

Example of how the group convolution output permutes as a function of the input rotation. This example is for Klein’s four-group V. Each cube represents a rotation from V and a corresponding example feature vector is given with each cube.

4.3 Implementation: Roto-Translational Group-Convolution

Now we show how to implement a group-convolution for a 3D roto-translation. In this example, we focus on the four-group to model rotations. A roto-translation can be synthesized from a rotation, followed by a translation. Roto-translations form a group, which can be seen as the productFootnote 3 of V and \(\mathbb {Z}^3\). For our purposes, it is safe to assume that we can write the elements of this producted group as tr for \(t\in \mathbb {Z}^3\) and \(r\in V\). So,

$$\begin{aligned}{}[\mathbf F \star \mathbf W ]_{tr}&= \sum _{\tau \in \mathbb {Z}^3}\sum _{\rho \in V} \left[ tr\mathbf W \right] _{\tau \rho } \mathbf F _{\tau \rho } = \sum _{\tau \in \mathbb {Z}^3}\sum _{\rho \in V} \left[ t \left[ r \mathbf W \right] _{\rho } \right] _{\tau } \mathbf F _{\tau \rho }. \end{aligned}$$
(11)

The interpretation behind this equation is as follows. First we start with a filter \(\mathbf W \). \(\mathbf W \) has a different value for each voxel in its receptive field, indexed by the translation variable \(\tau \), and also for every input rotation \(\rho \)—it may be easier just to think of four 3D filters, \(\mathbf W _{\rho _0}, \mathbf W _{\rho _1}, \mathbf W _{\rho _2}, \mathbf W _{\rho _3}\), one for each rotation in V. To convolve, we first rotate the kernel as \(r\mathbf W _{\rho _\bullet }\), then we perform a translational shift \(t[r\mathbf W _{\rho _\bullet }]\)—this second part ends up as the standard convolution of Eq. 4, which is efficient on GPUs. The initial rotation of the filter \(r\mathbf W _{\rho _\bullet }\) can be found from composing r and \(\rho _\bullet \) using our Cayley tables. It is the rotation needed to rotate r into \(\rho _\bullet \). When the input is a raw image, the input domain is just \(\mathbb {Z}^3\), so the rotation of \(\mathbf W \) is just r.

To compute gradient for backpropagation we leverage the power of automatic differention, which is available in most modern neural network libraries.

5 Experiments and Results

Here we describe two simple experiments we performed to demonstrate the effectiveness of group-convolutions on 3D voxelized data. We tested on the ModelNet10 classification challenge, which is a small 3D voxel dataset, and on the ISBI 2012 connectome segmentation challenge. In both examples, we found Klein’s four-group to be the most effective group for the rotation-equivariant group-convolutions.

5.1 ModelNet10

The ModelNet 10 dataset [53] contains 4905 CAD models from 10 categories with a train:test split of 3991:914. Each model is aligned to a canonical frame and then rotated at 12 evenly-sampled orientations about the z-axis. These rotated models are then voxelized to a \(32\times 32\times 32\) grid. We use the voxelized version of Maturana and Scherer [35]. While the dataset consists of vertically aligned models, rotated only about the z-axis, we posit that local features occur at all 3D rotations, and so a Cubenet is well positioned to operate on such as dataset. We use the four-group of rotations and the rotational tetrahedral group \(T_4\), since we found the cube-group too large and slow to be trained practically multiple times during a model search.

We use a simple VGG-like [45] network architecture shown in Fig. 4. It consists of 10 group-convolutional layers followed by a 2-layer fully-connected network. Before every convolution, we combine multiplicative dropout with 0.1 standard deviation on the filter tensors, and after every convolution we add batch normalization. We use ReLU nonlinearities and global average pooling before two fully-connected layers at the end of the network. The loss function is the multi-class cross-entropy. We initialize all weights using the He method [20] and train the network with ADAM stochastic gradient descent [27], with a learning rate of 1e−3, which steps down by 1/5 every 5 epochs for 25 epochs.

The data augmentation is performed similar to the implementation found in Brock [4] with 12 stratified rotations about the z-axis, reflections in the x- and y-axis with uniform probability and uniformly random translations of up to \(\pm 4\) voxels along all three axes. We use this data augmentation to maintain a direct comparison with prior works. It should also be noted that rotational data augmentation cannot be avoided entirely, since our networks are only equivariant to subgroups of the full roto-translation group SE(3), so we still need to augment for all angles in the quotient SE(3)/G, where G is the subgroup of interest. We also rescale the voxel values to \(\{-1,5\}\) instead of \(\{0,1\}\) as in [4], who showed it helps with sparse voxel volumnes. We show our results in Table 2. We compare the rotational tetrahedral group and the four-group models. For the four-group model, we compare the average single-view accuracy across 5 models for robustness, with rotation averaged accuracy and single-view accuracy for the best model. The single view accuracy is computed as the accuracy averaged over each of the 12 rotated test views; whereas, the rotation averaged accuracy is computed as the accuracy of the average of all 12 predictions.

Table 2. Results for the ModelNet 10 benchmark. We compare against other methods which operate on a voxel-representation of the data. The only model to beat us is Brock et al. ’s ensemble of 6 models. If we just restrict to a single model, then we hold state-of-the-art accuracy.

For the single-model category, our four-group, rotation-averaged network attains state-of-the-art performance. Interestingly, our single-view result we obtain is very similar to ORION [41], which introduces an orientation estimation task along with the classification. We posit that the \(T_4\)-model does not perform as well as the V-model, because increasing the number of filter copies reduces the diversity of filters, when the number of total filters (number of learnable filters times number of copies) is constrained. Essentially there is a tradeoff between filter diversity and the extent of equivariance due to weight-tying. The Klein-group appears to achieve best in this situation. It is also interesting to see that rotation averaging improves performance slightly, compared to our single-view model. We suggest this is because we are averaging over rotations not covers by the four-group. Looking across the model sizes, we see that the group-convolutional models sit somewhere in the middle in terms of number of parameters. Speed-wise, we found that during development the four-group network only trained about 2\(\times \) slower than non-group CNNs.

Fig. 4.
figure 4

(Best viewed in color) The architectures used in our experiments. We use a simple VGG-like architecture for the ModelNet10 classification challenge, and a UNet/FusionNet-like architecture for the ISBI2012 boundary segmentation benchmark.

5.2 ISBI 2012 Challenge: Connectome Segmentation

The ISBI 2012 Challenge is a volumetric boundary segmentation benchmark. The task is to segment Drosophilia ventral nerve cords from a serial-section transmission electron microscopy (EM) image [1]. The training set is a single \(2\times 2 \times 1.5\) \(\upmu \)m\(^3\) volume of anisotropic imaging resolution (high x-y resolution, low z resolution). Each voxel is \(4 \times 4 \times 50\) nm\(^3\) so the full training image is \(512 \times 512 \times 30\) voxels in shape. The test image is \(512\times 512\times 30\) voxel, with withheld labels. Scoring is performed using the metrics \(V_\text {rand}\) and \(V_\text {info}\) described in [1]. Larger is better.

Here we are faced with two major issues, (a) small dataset, (b) high imaging anisotropy. We counter (a) with heavy data augmentation as per [38] and by noting that group convolutions reduce the number of trainable parameters through significant weight-tying. To counter the imaging anisotropy, we use Klein’s four-group, which is not affected by stretching of one of the axes (Fig. 5).

Fig. 5.
figure 5

Examples of 2D slices from the training volume, the associated label mask, and the prediction made by our network. The original volume contains small amounts of noise and certain structures within the volume are ambiguous in nature.

Competing methods segment on a single 2D high-resolution slice at a time, but as a proof of concept we try segmentation as a 3D problem, feeding 3D image chunks into a 3D network. We use an architecture as shown in Fig. 4, based on Weiler et al. ’s steerable version [49] of the FusionNet [38]. It is a UNet [39] with added skip connections within the encoder and decoder paths to encourage better gradient flow. We place Gaussian multiplicative dropout [46] with standard deviation 0.1 before every convolution. By this we mean if x is an activation and \(n \sim \text {Normal}(n;1,0.1^2)\) then the result of dropout is \(x\cdot n\). We also place batch normalization after every convolution and use ReLU nonlinearities directly before each convolution, except on the input.

For the training set we extract random \(100\times 100\times 5\) voxel patches from the training volume and predict the center slice. We reflection pad 10 voxels in the x-y plane, and constant pad up to 5 voxels in the z-direction if we sample at the upper or lower image boundaries. We then apply a random elastic distortion in the x-y-plane, and pass the patches through our group-equivariant FusionNet. We keep our implementation close to the design of Weiler et al. to maintain a close comparison, and do not perform extensive model search. The results are shown in Table 3.

Table 3. Results for the ISBI 2012 challenge. We have tried to keep our implementation as close as possible to Weiler et al. Unlike other methods, we perform no post-processing at all unlike Weiler et al. who use a lifting multi-cut [3] post-process, or UNet and Quan et al. who use rotation averaging. Quan also adds an optional median filtering to boost scores. This shows that we can adapt state-of-the-art models to process 3D volumetric data with little change in the competitiveness of the results.

Our results are comparable with other leading methods. Our \(V_\text {rand}\) metric is slightly improved over UNet and Quan et al., but not as good as Weiler et al., who use a 2D group convolutional neural network approach, with 17 rotations about the z-axis and lifting multicut post-process. The leading method uses the lifting multicut method too. Our \(V_\text {info}\) metric is not as good as the other methods, but we believe with sufficient model search, and extensive post-processing we could increase this number further. The main point of this experiment, as with the ModelNet10 experiment, was to demonstrate that we could get relatively good performance, without the need for extensive test-time rotation averaging.

6 Conclusion

We have presented a 3D convolutional neural network architecture, which is equivariant to right-angle rotations in three dimensions. This relies on an extension of the standard convolution to 3D rotations. On the ModelNet10 classification challenge, we have achieved state-of-the-art for a single model, beating some much larger models, which rely on heavy data augmentation. Since our models are rotation in/equivariant by design, our CNNs need not learn to overcome rotations, the way a standard CNN does. In 3D, this is an especially important gain. As a result, our model is positioned to get better generalization with less data, while avoiding the need to perform time-costly rotation averaging at test-time.

Another perspective on our approach is to think of it as global average pooling over rotations, where we expose a new ‘rotation-dimension’. Without adhering to a defined group, it would be challenging to disentangle or orient a feature space (at any one layer, or across multiple layers) with respect to such a rotation dimension. The trade-off is that we commit to a group and its corresponding CubeNet architecture, to avoid the considerable effort of learning to disentangle pose.

We leave it to future work to examine whether these models can be generalized to continuous rotations and other challenging transformations, such as scale. There is also the untouched challenge of finding 3D rotation groups, which are not aligned to the Cartesian voxel-grid.