Keywords

1 Introduction and Relevant Work

Our novel contributions are (1) showing a neural network trained to output two separate driving tasks (i.e., steering and motor throttle predictions) can yield different motion-sensitive neurons that contribute to different output behaviors, and (2) demonstrating that we can probe these hidden filters through controlled experiments inspired by psychology. The experiment results indicate that optical flow filters are used for steering decisions, but variance filters are used for motor throttle decisions.

Our self-driving network takes in video from left and right cameras to predict future steering and motor throttle values, so there are many possible spatiotemporal cues that our network could respond to.

We first tried reproducing receptive field visualizations [1, 7]. Shown in Fig. 1, we generated gradient ascent visualizations on the layers for an early CNN (2 convolutional layers and 2 dense layers) taking in 2 frames at a times. Across frames and cameras for any given neuron filter, Layer 1 receptive fields appear sensitive to optical flow and natural stereoscopic disparity.

Fig. 1.
figure 1

Gradient ascent visualizations. Shown are four neurons’ receptive fields from Layer 1 of our first self-driving network. Each neuron filter is divided into sub-filters, with one sub-filter per camera, per input frame – hence the 2\(\,\times \,\)2 layout per neuron filter. These filters are appear sensitive to optical flow and stereoscopic disparity

However, this is hard to quantify, and later layers are even noisier. Furthermore, our current convolutional network is primarily the SqueezeNet architecture from Iandola et al. [2]. We did not want to interpret unstructured visualizations from 1\(\,\times \,\)1 and 3\(\,\times \,\)3 filters. Instead, though not semantic, we labeled and compared inputs by presumed relevant features, similar to Zhou et al. [8]. We then took inspiration from the general feature manipulation of predictive modeling experiments in psychophysics [6].

We studied optical flow because they provide cues about depth and future trajectories [5], and there is early evidence for them through gradient ascent analysis.

2 Experimental Setup

We labeled input videos by their average steer and motor throttle combinations. We only used videos whose current and future driving combinations had little variation, and the future ones had to be well predicted by the network. This allowed us to easily test on salient ego-motion videos containing one type of flow per video.

Fig. 2.
figure 2

Video speed manipulation. Natural videos are resampled for the optical flow experiment, to simulate optical flow changes invariant of other natural features. The network expects 10-frames of input video to the network, so each manipulated video samples the original frames to match the appropriate size. Sped up versions can just use future frames, but slowed down versions need the timepoints in between the normally captured frames, which are created using the interpolation method by Meyer et al. [4]

Fig. 3.
figure 3

Driving predictions after input video speed manipulation. The output steer (left) and motor throttle (right) neurons’ activations with respect to video speed changes are plotted. The X coordinates are normal video predictions, and the Y coordinates are changed-speed video predictions. Zero means no behavior for both plots. The fit lines indicate that speeding up the input video pushes steer predictions to become more extreme, as well as increasing throttle predictions. The opposite is also true for slower videos

As seen in Fig. 2, by speeding up and slowing down a given video, we created new videos with similar optical flow vectors across the visual field, but with more or less magnitude. We then compared how these affected output driving predictions to test the relevance of input video motion.

We also controlled the frame order and stereoscopic disparity in the input videos, after manipulating the video speed. If optical flow is a relevant feature for our driving predictions, then we should see a change in response with or without properly ordered time frames, similar to the network in Zhou et al. [9]. Furthermore, if the network is attempting to recover depth cues from motion, it could be also affected by stereoscopic disparity, another source of depth cues present with our network setup.

3 Results and Discussion

Theoretically, we expected lower frame rate sampling to push predictions toward zero, and for higher frame rate sampling to do the opposite.

As seen in Fig. 3, input video speed manipulation affects both steering and motor throttle predictions. This suggests potential optical flow sensitivity, but will need to be explored further.

3.1 Temporal Controls

In Fig. 4, steer and motor throttle predictions were plotted for input videos with different frame orders. Motor throttle predictions appear robust to frame order transformations, but the steering predictions are not.

Fig. 4.
figure 4

Steer and motor throttle prediction changes from temporal frame ordering. Changes to output steer (left) and motor throttle (right) neurons from input frame ordering are plotted. The X coordinates are naturally ordered video predictions, and the Y coordinates are predictions after temporal ordering. The fit lines for the steer plots indicate that randomizing the frame order nullifies any steering prediction, whereas reversing the order (not in the training set) reverses the steer prediction. The fit lines for the throttle plots indicate that randomizing and reversing the frame order had little impact on the throttle prediction

Fig. 5.
figure 5

Steer predictions changes from temporal frame ordering after video speed manipulation. Here, input videos are sped up and slowed down as in Fig. 3, but also have their frame orders changed. We can see that reversing the frame order (left) maintains the natural steer changes correlated with video speed manipulation (as in Fig. 5), but randomizing the frame order (right) breaks the natural steer prediction changes after speeding up and slowing down the videos

As seen in Fig. 5, changing around the frame order significantly impacts the video speed manipulation experiment for steer predictions. We need smooth flow of time, either forward or reverse, to get results similar to those from the video speed experiment in Fig. 3. This implies optical flow filters are used for steer decisions.

For motor throttle predictions, changing around the frame order does not significantly impact the video speed manipulation experiment. Figure 6 shows motor throttle predictions are sensitive to input motion independent of frame order, implying that variance filters are used. Independent of frame order, little motion would yield little variance across the frames, whereas high motion would yield the opposite.

Fig. 6.
figure 6

Motor throttle prediction changes from temporal frame ordering after video speed manipulation. Here, input videos are sped up and slowed down as in Fig. 3, but also have their frame orders changed. We can see that both randomizing the frame order (left) and reversing the frame order (right) maintains the natural throttle prediction changes after changing video speed

Fig. 7.
figure 7

Steer and motor prediction changes from stereo effects after video speed manipulation. Here, input videos are sped up and slowed down as in Fig. 3, but also have their stereoscopic disparity changed. We can see that both switching the stereo (left) and removing the stereo (right) maintains the natural steer (top) and speed (bottom) prediction changes after speeding up and slowing down the videos, like in Fig. 3

3.2 Steer and Motor Speed Results Across Stereo Controls

Lastly, for steer and motor speed predictions, stereoscopic disparity changes do not significantly impact the video speed experiment. Figure 7 shows that the motion selective filters for steer and motor speed predictions are independent of stereo features.

4 Conclusion

We show that our network trained to predict steering and motor throttle from stereo video exhibits different motion-selective behavior for steering and throttle. Through a series of controlled psychophysical experiments, we demonstrated that both the steer and motor throttle predictions are correctly affected by varying the motion in the input video. However, even though both behaviors look similar on the surface, correct steer predictions are dependent on smooth frame order, whereas motor throttle predictions are not.

We show that steer decisions are based on optical flow filters in the hidden layers, whereas motor throttle decisions are based on variance filters.

Even though we did not present this in the paper, we did the same video speed experiments on hidden layer neurons as we did for the output neurons. By plotting average neuron activation for changed-speed videos versus normal speed videos, we can generate the same steer-like and motor-like profiles as in Fig. 3. We further found the distribution of steer-like and motor-like neurons across the layers, arguing that these ultimately contribute to the final steer and motor throttle predictions. Linear SVMs were used to find the motor-like neurons based on their activation profiles, with the middle layers of our network having the most motor-like neurons.

From a theoretical standpoint, motor throttle only affects radially-dependent optical flow, but steering creates optical flow consistent throughout the visual field. The latter optical flow is easier for convolutional filters to capture, which we see in our results.

Lastly, consistent with Lundquist et al. [3], depth-sensitive stereo features are more difficult for convolutional networks to learn than other features. Our results appear to be robust to changes in stereoscopic disparity. It seems as though motion cues were more relevant than stereo cues in deciding changes in steer or motor throttle predictions.