1 Introduction

In a neurosurgical procedure, the exposed brain tissue undergoes a time dependent elastic deformation caused by various factors, such as cerebrospinal fluid leakage, gravity and tissue manipulation. Conventional image-guided navigation systems do not take any elastic brain deformation (brain shift) into account. Consequently, the neuroanatomical overlays produced prior to the surgery does not correspond to the actual anatomy of the brain without an intraoperative image update. Hence, real-time intraoperative brain shift compensation has a great impact on the accuracy of image-guide neurosurgery.

An inexpensive modality to update the preoperative MRI image is Ultrasound (US). Its intraoperative repeatability offers another further benefit with respect to real-time visualization of intra-procedural anatomical information [1]. Both feature- and intensity-based deformable, multi-modal (MR-US) registration approaches are proposed to perform brain shift compensation.

In general, brain shift compensation approaches are based on feature-driven deformable registration methods to update the preoperative images by establishing correspondence of selected homologous landmarks. Performance of Chamfer Distance Map [2], Iterative Closest Point (ICP) [3] and Coherent Point Drift [4] are evaluated in phantom [2, 4] and clinical studies [3]. Inherently, the accuracy of feature-based methods is limited by the quality of the landmark segmentation and feature mapping algorithm.

Intensity-based algorithms overcome these intrinsic problems in the feature-based methods. Similarity metrics such as sum of squared differences [5] and normalized mutual information [6] were first proposed to register preoperative MR and iUS non-rigidly. However, intensity-based US-MR non-rigid registration poses a significant challenge due to the low signal-to-noise ratio (SNR) of the ultrasound images and different image characteristics and resolution of US and MR images. To tackle this challenge, Arbel et al., [7] first generates a pseudo US image based on the preoperative MR data and performs US-US non-rigid registration by optimizing the normalized Cross Correlation metric. Recently, local correlation ratio was proposed in a PaTch-based cOrrelation Ratio (RaPTOR) framework [8], where preoperative MR was registered to postresection US for the first time.

Recent advances in reinforcement learning (RL) and imitation learning (or behavior cloning) encourages the reformulation of the MR-US non-rigid registration problem. Krebs et al. [9] trained an artificial agent to estimate the Q-value for a set of pre-calculated actions. Since the Q-value of an action effects the current and future registration accuracy, a sequence of deformation fields for optimal registration can be estimated by maximizing the Q-value. In general, reinforcement learning presupposes a finite set of reasonable actions and learns the optimal policy to predict a combinatorial action sequence of the finite set. However, in a real world problem such as intraoperative brain shift correction, the number of feasible actions are infinite. Consequently, reinforcement learning is hardly to be adapted to resolve brain shift. In contrast, imitation learning is proposed to learn the actions itself. To this end, an agent is trained to mimics the action taken by the demonstrator in associated environment. Therefore, there is no restriction on the number of the actions. It has been used to solve tasks in robotic [10] and autonomous driving systems [11]. Our previous work reformulated the organ segmentation problem as imitation learning and showed good result [12].

Inspired by Turing’s original formulation of imitation game, we reformulate the brain shift correction problem based on the theory of imitation learning in this work. A multi-task neural network is trained to predict the movement of the landmarks directly by mimicing the ground-truth action exhibits by the demonstrator.

2 Imitation Game

We consider the registration of a preoperative MRI volume to the intraoperative ultrasound (iUS) for brain-shift correction as an imitation game. The game is constructed by first defining the environment. The environment \(\mathbb {E}\) for the brain-shift correction using registration is defined as the underlying iUS volume and MRI volume. The key points \(\varvec{P}^\mathbb {E} = [\varvec{p}_1^\mathbb {E}, \varvec{p}_2^\mathbb {E}, \cdots \varvec{p}_N^\mathbb {E}]^T\) in the MRI volume are shifted non-rigidly in three-space to target points \(\varvec{Q} ^\mathbb {E}= [\varvec{q}_1^\mathbb {E},\varvec{q}_2^\mathbb {E}, \cdots , \varvec{q}_N^\mathbb {E}]^T\) in the iUS volume. Subsequently, we define the demonstrator as a system able to estimate the ideal action, in the form of a piece-wise linear transformation \(\varvec{a}_i^{\mathbb {E},t}\), for the \(i^{\text {th}}\) key point \(\varvec{p}_i^{\mathbb {E},t}\), in the \(t^\text {th}\) observation \(\varvec{O}^{\mathbb {E},t}\), to the corresponding target point \(\varvec{q}_i^{\mathbb {E},t}\). The goal of the game is defined as the act of finding an agent \(\mathcal {M}(\cdot )\), to mimic the demonstrator and predict the transformations of the key points given an observation. This was formulated as a least square problem (Eq. 1).

$$\begin{aligned} \text {arg}\min _{\mathcal {M}} = \sum _\mathbb {E} \sum _t \Vert \mathcal {M}(\varvec{O}^{\mathbb {E},t}) - \varvec{A}^{\mathbb {E},t} \Vert ^2_2 \end{aligned}$$
(1)

Here, \(\varvec{A}^{\mathbb {E},t} = [\varvec{a}_1^{\mathbb {E},t}, \varvec{a}_2^{\mathbb {E},t}, \cdots , \varvec{a}_N^{\mathbb {E},t}]^T \) denotes the action of all N key points. In the context of brain shift correction, we use annotated landmarks in the MRI as key points \(\varvec{p}_i^t\) and landmarks in iUS as target points \(\varvec{q}_i^t\). A neural network is employed as our agent \(\mathcal {M}\).

2.1 Observation Encoding

We encode the observation of the point cloud in the environment as a feature vector. For each point \(\varvec{p}_i^{\mathbb {E},t}\) in the point cloud, we extract a cubic sub-volume centered at this point in three-space. The cubic sub-volume has an isotropic dimension of \(C^3\) and voxel size of \(S^3\) in mm and its orientation is identical to the world coordinate system. The value of each voxel in the sub-volume is extracted by sampling the underlying iUS volume in the corresponding position, and interpolating using trilinear interpolation. We denote the sub-volume encoding as a matrix \(\varvec{V}^{\mathbb {E},t} = [\varvec{v}_1^{\mathbb {E},t}, \varvec{v}_2^{\mathbb {E},t}, \cdots \varvec{v}_N^{\mathbb {E},t}]^T\), where each sub-volume is flattened into a vector \(\varvec{v}_i^{\mathbb {E},t} \in \mathbb {R}^{C^3}\). Apart from the sub-volume, we also encode the point cloud information into the observation. We normalized the point cloud to a unit sphere and used the normalized coordinates \(\tilde{\varvec{P}}^{\mathbb {E},t} = [\tilde{\varvec{p}}_1^{\mathbb {E},t},\tilde{\varvec{p}}_2^{\mathbb {E},t}, \cdots , \tilde{\varvec{p}}_N^{\mathbb {E},t}]^3\) as a part in the encoding. The observation \(\varvec{O}^{\mathbb {E},t}\) is a concatenation of \(\varvec{V}^{\mathbb {E},t}\) and \(\tilde{\varvec{P}}^{\mathbb {E},t}\).

2.2 Demonstrator

The demonstrator predicts the action \(\varvec{A}^{\mathbb {E},t} \in \mathbb {R}^{3 \times N}\) of the key points. We define the action for brain shift as the displacement vector for the key points to move to their respective targets. As both the target points and the key points are known, one intuitive way to calculate the action for each key point is to compute the displacement field directly as \(\varvec{a}_i^{\mathbb {E},t} = \varvec{q}_i^{\mathbb {E},t} - \varvec{p}_i^{\mathbb {E},t}\). As we can see, this demonstrator estimates the displacement independent of the observation. This can make the learning difficult. Therefore, we also calculate the translation vector \(\varvec{t}_i^{\mathbb {E},t} = \bar{\varvec{q}}_i^{\mathbb {E},t} - \bar{\varvec{p}}_i^{\mathbb {E},t} \in \mathbb {R}^{3\times 1}\) as the auxiliary output of the demonstrator. Hence, the objective function is,

$$\begin{aligned} \text {arg}\min _{\mathcal {M}} = \sum _\mathbb {E} \sum _t \Vert \mathcal {M}(\varvec{O}^t) - \varvec{A}^t \Vert ^2_2 + \lambda \Vert \mathcal {M}' (\varvec{O}^t) - \varvec{t}^t \Vert ^2_2 \end{aligned}$$
(2)

where, \(\mathcal {M}'\) denotes the agent estimating the auxiliary output and \(\lambda \) is the weighting of the auxiliary output. In the implementation, a multi-task neural network is implemented as both \(\mathcal {M}\) and \(\mathcal {M}'\).

2.3 Data Augmentation

To facilitate the learning process, we augment the training dataset to increase the number of samples and the overall variability. In the context of brain shift correction, data augmentation can be applied both to the environment \(\mathbb {E}\) and to the key points \(\varvec{P}^{\mathbb {E},t}\). In order to augment the environment \(\mathbb {E}\), the elastic deformation proposed by Simard et al. [13] is applied to the MRI and iUS volumes. Varieties of brain shift deformations are simulated by warping the T1, flair MRI volumes and the iUS volume, together with their associated landmarks, independently, using two different deformation fields.

In each of the augmented environments, we also augmented the key points’ (MRI landmarks) coordinates in two different ways. For each key point, we added a random translation vector with a maximal magnitude of 1 mm in each direction. This synthetic non-rigid deformation was included to mimic inter-rater differences that may be included, during landmark annotation [14]. An additional translation vector was also used to shift all key points with a maximal magnitude of 6 mm in each direction. This was done to simulate the residual rigid registration error introduced during the initial registration using fiducial markers. Of particular importance, is how these augmentation steps were applied to the data. We assumed the translation between the key points and target points in the training data to be a random registration error. Consequently, we initially aligned the key points to the center of gravity of the target points. The center of gravity is defined as mean of all associated points. The non-rigid and translation augmentation steps were applied subsequently, to the key points.

2.4 Imitation Network

As observation encoding and the demonstrator are both based on a point cloud, the imitation network also works with a point cloud. Inspired by PointNet [15], which process the point cloud data without a neighborhood assumption efficiently, we proposed a network architecture that utilizes both the known neighborhood in the sub-volume \(\varvec{V}^{\mathbb {E},t}\), and the unknown permutation of associated key points \(\tilde{\varvec{P}}^{\mathbb {E},t}\). The network is depicted in Fig. 1. The network uses the sub-volume and key points as two inputs and processes them independently. During observation encoding, each row vector denotes a sub-volume \(\varvec{v}_i^{\mathbb {E},t} \in \mathbb {R}^{C^3}\) of the associated key point \(\varvec{p}_i^{\mathbb {E},t}\). Therefore, we use three consecutive \(C \times 1\) convolutions with a stride size of \(C \times 1\), to approximate a 3D separable convolution and extract the texture feature vectors. We also employ \(3 \times 1\) convolution kernels to extract features from key points. These low-level features are concatenated for further processing. The main part of the network largely employs the PointNet architecture, where we use a multilayer perceptron (MLP) to extract local features, and max pooling to extract global features. The local and global features are concatenated to propagate the gradient and facilitate the training process. The multi-task learning formulation of the network also helps improves overall robustness. We used batch normalization for each layer and ReLU as activation function. One property of the network is that if a copy of a key point and a associated sub-volume is added as additional input, the output of the network for these key points remains unchanged. This is especially useful in the context of brain-shift correction, where the number of key points usually varies before and after resection. Therefore, we use the maximum number of landmarks in the training data as input key point number of our network. For a training data smaller than this number, we arbitrarily copy one of the key points. Finally, after predicting the deformation of the key points, the deformation field between them is interpolated using B-splines.

Fig. 1.
figure 1

Illustration imitation network architecture.

Table 1. Evaluation of the mean distance between landmarks in MRI and ultrasound before and after correction.

3 Evaluation

We trained and tested our method using the Correction of Brainshift with Intra-Operative Ultrasound (CuRIOUS) MICCAI challenge 2018 data. This challenge use the clinical dataset described in [14]. In the current phase of the challenge, 23 datasets are used as training data, in which 22 comprise the required MRI and ultrasound landmark annotations before dura opening. The registration method is evaluated using target registration error (mTRE) in mm. We used leave-one-out cross-validation to train and evaluate our method. To train the imitation network, we used 19 datasets for training, two for validation and one as the test set. Each training and validation dataset was augmented by 32 folds for the environment cascaded with 32 folds key points augmentation. In total 19.4k datasets were used for the training, 2k were used for validation. We chose a sub-volume with isometric dimensions \(C=7\) and voxel size of \(2\times 2\times 2\,\text {mm}^3\). 16 points were used as input key points and a batch size of 128 was used for the training. The adapted Adam optimizer proposed by Sashank et al. [16] with a learning rate of 0.001 was used. The results are shown in Table 1. Using our method, the overall mean target registration errors (mTREs) can be reduced from 5.37 ± 4.27 mm to 1.21 ± 0.55 mm. In a similar setting, but applied to different datasets, the state-of-the-art registration method RaPTOR has an overall mTRE of 2.9 ± 0.8 mm [8].

The proposed imitation network has 0.22 M trainable parameters, requires 6.7 M floating point operations (FLOPS), and converges within 7 epochs. To calculate the computational complexity in the application phase, we consider the network having a complexity of \(\mathcal {O}(1)\) due to pretraining. The observation encoding step has a complexity of \(\mathcal {O}(N \times C^3)\), where N denotes the number of key points and C denotes the number of sub-volume dimension. Therefore, the complexity of the proposed algorithm is \(\mathcal {O}(N \times C^3)\), independent of the resolution of underlying MRI or iUS volume. In the current implementation, the average runtime of the algorithm is 1.77 s, of which \(88\%\) time is used for observation encoding using CPU.

4 Discussion

To our best knowledge, an imitation learning based approach is proposed for the first time in the context of brain shift correction. The presented method achieves encouraging results within 2 mm with real-time capability (<2 s). In 21 out of 22 datasets, the mTREs are reduced significantly. As the mTRE in 22th dataset is initially small, the inter-rater difference of 0.5 mm is still remarkable [14]. Hence, these results indicates the applicability of the proposed method in the clinical environment. However, following aspects should be concerned for the further development. One important aspect is the number and variation of the training data used in the proposed imitation learning algorithm. Although the number of the training datasets are increased effectively by applying data augmentation methods (described in Sect. 2.3), variation of the training data such as location of the tumor and orientation of the head cannot be augmented without further considerations. A common tool to simulate the different image orientation is rotational augmentation. However, it alters the effect of gravity implicitly, therefore results in unrealistic training data. Thus, rotational augmentation is inappropriate for the data augmentation in context of brain shift compensation. The other aspect is the comprehensive validation of the proposed method. The generalizability and robustness should be evaluated with a larger amount of data acquired with different intraoperative image modalities. In this challenge, we use landmarks as key points and predict the deformation of the landmarks directly. In future applications, we could also adapt our approach to control points of a free-form deformation or contour points of a certain structure (e.g. vessel). The associated target points could be either manually annotated or automatically estimated with point matching algorithms.

5 Conclusion

In this study, we proposed a novel approach for intra-operative brain shift correcting, during tumor resection surgery, using imitation learning. The presented method uses observation encoding to describe the local texture and point-cloud information and the trained imitation network is used to estimate the movement of landmarks defined in pre-operative MR volumes, directly to their counterparts in iUS volumes, based on this encoding. Our network reduced the mean landmark distance between the pre- and intra-operative image pairs substantially, from 5.37 ± 4.27 mm to 1.21 ± 0.55 mm, in real-time, which is particularly compelling for its future use in a surgical setting. Additionally, the proposed approach is flexible, as it is not modality- or anatomy-specific, and thus could be employed in a variety of image-guided surgical interventions.