Keywords

1 Introduction

Safe road crossing requires conflict-free sharing of road surface. Pedestrian crossing lights stop traffic to ensure secure crossing. Other right of way regulations, such as zebra-crossings, are less reliable since drivers do not consistently yield to pedestrians [19]. Thus, for safe crossing in the absence of crossing lights, pedestrians need to identify occasional traffic gaps [7].

The duration of a traffic gap allowing safe crossing is determined by the crossing time. US data [1] indicates a representative crossing speed of about 1.2 m/s and urban road widths of 5.5 m (one-way single lane) or 8.5 m (two-way or two-lane one-way) with shoulders. The corresponding crossing times are 4.5 s and 7 s respectively. The necessary traffic gap, equivalent to the needed advance warning of an approaching vehicle, must be at least that long.

Recognizing an approaching vehicle 7 s ahead of contact time is possible for people with normal or corrected vision and solid cognitive function. For blind and visually impaired people, as well as for elderly persons, children [15] and others, sufficiently early recognition of an approaching vehicle can be implausible. Blind people are trained to rely on hearing and follow the “cross when quiet” strategy [18]. Auditory-based performance is less than ideal [4, 9] and further difficulties arise where background urban noise masks the low noise emitted by a modern car [6], or when the pedestrian has less than perfect hearing.

lmage and video processing methods to assist visually handicapped people were reviewed in [13]. Many techniques convert visual data to auditory or haptic signals. These include systems to locate street crossings [11, 16]. Converting traffic light status to an auditory or other indication was considered in [3]. Sensor-rich “smart canes” have been developed [5] for close-range obstacle detection and path planning. However, the problem of road crossing at unsignalized crosswalks has remained open. A noteworthy exception is the work of [2], aimed at providing a robotic system with street-crossing capability.

We recently presented [17] a system positioned near a road crossing, alerting pedestrians of approaching traffic and detecting traffic-gaps suitable for safe crossing. The experimental results in [17] were obtained using a powerful laptop computer connected to a good-quality external USB camera with an added narrow field of view lens. The feasibility of real-time implementation on a low-cost and power-frugal platform remained an open question. In this paper we meet this challenge using an off-the-shelf Android device, establishing the potential for widespread deployment and stand-alone solar-powered operation. To complete the feasibility proof, we demonstrate successful autonomous operation at actual road crossings.Footnote 1

2 Stationary Location-Specific Solution

Consider a vehicle approaching a crosswalk at 50 km/h. Detecting this vehicle 7 s in advance, implies detection at about 100 m from the crosswalk. At that distance, the frontal viewing angle of a typical car is about 1\(^\circ \) horizontally and slightly less vertically. In many suburban and country roads, speeds up to 100 km/h are not unusual, corresponding to roughly 0.5\(^\circ \) at the required detection distance.

Assuming a straight road towards the crosswalk, the tiny-looking approaching vehicle appears near the vanishing point in the picture. Reliable detection requires predetermination of the point of first appearance, and a narrow field of view (FOV) around that point. With some exaggeration, this can be illustrated as watching the scene through a straw. One might wish for a smartphone application, warning the holder of approaching vehicles. However, typical mobile phones have wide FOV lenses. Worse, a hand-held device would require careful pointing towards the point of first appearance, a challenge for anyone, and a formidable undertaking for people with visual, cognitive or motoric impairments.

We suggest a stationary and location-specific solution, where the system is attached, for example, to a traffic sign or a lighting pole, facing the approaching traffic. The absence of substantial ego-motion improves robustness, facilitates training and adaptation, and reduces the computational requirements, thus minimizing cost and power consumption. Beyond its technical advantages, the location-specific approach does not require a personal device, and is thus accessible to all members of society.

The essential system elements are a video camera with a narrow field of view lens, a processing unit, and a user interface. Optional elements include a compact solar panel and a rechargeable battery, enabling off-grid operation, and a communication unit (cellular, WiFi, etc.), allowing remote maintenance as well as cooperation with nearby instances of the system, e.g. in two-way multi-lane road crossings. As discussed in the sequel, we successfully implemented the system on the Samsung Galaxy K-Zoom smartphone, see Fig. 1.

Fig. 1.
figure 1

The system, implemented on a Samsung Galaxy K-zoom smartphone, attached to a lighting pole near a road crossing. The camera faces the approaching traffic. Note the extended position of the zoom lens.

Indication of incoming traffic, or lack thereof, can be delivered to the user as an audio, tactile or visual signal, or via a local networking interface (such as Bluetooth or WLAN) to a smartphone or a dedicated receiver.

3 Principles of Operation

Vehicles approaching a crosswalk emerge as tiny blobs at the point of first appearance. A key design choice has been to provide the pedestrian with the earliest possible warning of an approaching vehicle, rather than the latest warning that would allow crossing. This implies that the task at hand is early detection, rather than precise estimation of the time to contact (TTC) [10, 12].

We transform the incoming video to a scalar temporal signal, referred to as Activity, quantifying the overall risk-inducing motion in the scene. Vehicles approaching the camera lead to pulses in the Activity signal, so that substantial activity ascent reflects an approaching vehicle. Timely detection of incoming traffic, with sparse false alarms, calls for reliable separation between meaningful Activity ascent and random-like fluctuations.

3.1 Activity Estimation

Motion in the scene leads to optical flow. Parts of the optical flow field may correspond to vehicles approaching the camera. Other optical flow elements are due to harmless phenomena, such as traffic in other directions, crossing pedestrians, animals on the road, moving leaves and slight camera vibrations. The suggested activity signal \(A_n\), where n is the temporal index, is computed as the projection of the optical flow field \(\varvec{u}_n(x,y)\) onto an influx map. The influx map \(\varvec{m}(x,y)\) is a vector field, representing the location-dependent typical course of traffic approaching the camera. The effective support of the influx map largely corresponds to incoming traffic-lanes. Mathematically,

$$\begin{aligned} A_n=\sum _{x,y}\varvec{m}(x,y)\cdot \varvec{u}_n(x,y). \end{aligned}$$
(1)

The scalar activity signal quantifies the overall hazardous traffic in the scene.

Fig. 2.
figure 2

Influx map, computed by averaging the optical flow field over a training period. Left: arrow-style vector display. Right: color coded vector representation, corresponding to the color-key shown top-right. The motion direction in each pixel is represented by hue, its norm by saturation. (Color figure online)

Assuming a one-way street, the influx map can be obtained by temporal averaging the optical flow field over a training period, as illustrated in Fig. 2. Temporal averaging cancels randomly oriented motion elements due to vibrations, wind and similar effects, while maintaining the salient hazardous traffic motion patterns in the road area.

The influx-map vectors can be intensified to emphasize motion near the point of first appearance, thus increasing the advance warning time. Note that the point of first appearance can be identified during training, e.g. by back-tracking the influx-map vectors to the source. Minor adjustment of the influx map computation procedure is necessary for handing two-way roads, where consistent outbound traffic is also expected.

We have tacitly presented optical flow and influx map computation in dense form, meaning that their spatial resolution is uniform and commensurate with the resolution of the video frame. However, sparse computation of the optical flow and influx map, at spatially variant density, is beneficial with regard to both performance and computational cost. Foveated computation of Eq. 1, dense near the point of first appearance and sparse elsewhere, has the desirable effect of emphasizing motion in that critical region.

3.2 Detecting Approaching Vehicles

Based on the raw Activity signal, we wish to generate a binary indication, suggesting whether risk-inducing traffic is approaching (‘TRAFFIC’ state), or a sufficient traffic gap allows safe crossing (‘GAP’ state).

As a vehicle approaches the camera, the peak of the corresponding Activity pulse is usually evident. However, early warning requires detection at an early stage of ascent, where the signal to noise ratio is poor.

Reliable, timely detection cannot be achieved by direct thresholding of the Activity signal. The signal is therefore evaluated within a sliding temporal window of N samples. This improves the effective signal to noise ratio, consequently increasing the detection probability and reducing the false alarm rate. However, sliding window computation inherently introduces detection latency. The extent N of the sliding window must therefore be limited, to maintain the advance warning period required for safe crossing.

Suppose that the sliding window falls either on a traffic gap (‘GAP’ hypothesis \(\theta _0\)), modelling the Activity signal as \(A_n=\nu _n\), where \(\nu _n\) is zero-mean additive white Gaussian noise, or at the precise moment a vehicle emerges (‘TRAFFIC’ hypothesis \(\theta _1\)), such that \(A_n=s_n+\nu _n\), where \(s_n\) is the clean Activity pulse shape for the specific vehicle and viewing conditions.

We discriminate between the two hypotheses using the Likelihood Ratio Test

$$\begin{aligned} \text{ LRT }=\frac{\frac{1}{\left( 2\pi \right) ^{N/2}\sigma ^{N}}\prod \limits _{n=0}^{N-1}\exp [-\frac{\left( A_{n}-s_{n}\right) ^{2}}{2\sigma ^{2}}]}{\frac{1}{\left( 2\pi \right) ^{N/2}\sigma ^{N}}\prod \limits _{n=0}^{N-1}\exp [-\frac{A_{n}^{2}}{2\sigma ^{2}}]}\gtrless \lambda , \end{aligned}$$
(2)

where \(\lambda \) is a discriminative threshold. This leads to the detection rule:

$$\begin{aligned} \tilde{y} \equiv \sum \limits _{n=0}^{N-1}A_{n}s_{n}\gtrless \frac{1}{2}\left( 2\sigma ^{2}\ln \lambda +\sum \limits _{n=0}^{N-1}s_{n}^{2}\right) \equiv \gamma , \end{aligned}$$
(3)

where \(\tilde{y}\) denotes the correlation of the Activity signal with the pulse template. The pulse shape depends on the particular installation, and can be learned during training. For a given installation, specific vehicle characteristics lead to some variability. The right hand side is an application-dependent threshold denoted \(\gamma \). We determine the threshold using the Neyman-Pearson criterion, tuned to obtain a specified false-alarm probability.

Given either hypothesis, the random variable \(\tilde{y}\) is Gaussian with variance \(\textstyle {\sigma ^{2}\sum \limits _{n=0}^{N-1}s_{n}^{2}}\). The probability of false alarm is determined by the tail distribution of the ‘GAP’ hypothesis \(\theta _{0}\):

$$\begin{aligned} P_{FA} =Pr\left( \tilde{y}>\gamma |\theta _{0}\right) = 1-\phi \left( \frac{\gamma }{\sigma \sqrt{\sum s_{n}^{2}}}\right) , \end{aligned}$$
(4)

where \(\phi (x)\) is the cumulative distribution of the standard normal distribution. Let Q(x) denote its tail probability \(Q(x)=1-Q(-x)=1-\phi (x)\). The threshold \(\gamma \) can thus be expressed in terms of the admissible false-alarm rate, the noise and the Activity pulse template. The decision rule can be reformulated as

$$\begin{aligned} \frac{\sum \limits _{n=0}^{N-1}A_{n}s_{n}}{\sqrt{\sum \limits _{n=0}^{N-1}s_{n}^{2}}} \gtrless Q^{-1}\left( P_{FA}\right) \sigma , \end{aligned}$$
(5)

which is the familiar matched filter result, correlating the input sequence with the Activity pulse template we wish to detect.

So far, we have discussed the most demanding detection condition, the earliest detection of an incoming vehicle. The Activity pulse increases as the vehicle approaches; thus, once the correlator output \(\tilde{y}\) crosses the threshold, it generally exceeds the threshold as long as the vehicle has not reached the crossing. When a vehicle appears late (e.g. leaving a parking space near the crossing), the Activity pulse rapidly ascends, and the correlator promptly crosses the threshold.

3.3 Setting Detector Parameters

The detector is tuned according to the noise variance and the Activity pulse template. During training, after the influx map had been computed, the system produces Activity measurements. These measurements are used to automatically estimate the Activity pulse template and the noise variance.

Fig. 3.
figure 3

Generating the activity pulse template. Left: The Activity signal over a small part of the training period, with the salient local maxima detected. Right: The Activity pulse template, generated as the median of the Activity signal windows

The system locates the salient local maxima of the Activity signal during training, and sets a temporal window containing each maximum, where most of the window precedes the maximum. We set the Activity pulse template as the median of the Activity signal windows, see Fig. 3.

Fig. 4.
figure 4

The activity signal computed over six minutes, including the events shown in Fig. 5. TRAFFIC and GAP indications are represented by red and blue respectively. (Color figure online)

Fig. 5.
figure 5

Video frames corresponding to two multiple car events.

4 Examples and Evaluation

Figure 4 shows the activity signal computed during six minutes at a test site. The color represents the prediction issued to the pedestrian, red signifying TRAFFIC and blue meaning GAP. TRAFFIC is typically indicated about 8–10 s before the Activity signal peak, i.e. before the car reaches the camera, allowing secure crossing with sufficient margin. Multimodal activity pulses, e.g. at \(t \approx 160\) s and at \(t \approx 250\) s are each due to several cars arriving in sequence. Corresponding snapshots are shown in Fig. 5. The false alarms at \(t \approx 80\) s and \(t \approx 205\) s are not due to random noise: the first was caused by a person on the road, the second by a cat. The random noise level is rather low, as seen circa \(t=300s\) in Fig. 4. Figure 6(left) is the histogram of advance warning times provided by the system during two hours of testing.

In the archetypal case, an approaching vehicle emerges close to the point of first appearance, thus maximizing the achievable warning time. However, in various urban or suburban settings, a vehicle can exit a parking place close to the camera, or turn into the monitored road from an adjacent driveway. Thus, while the typical visibility period of an incoming vehicle is often greater than 10 s, in unusual cases it can be as short as 1 s, providing neither the system nor an alert human with sufficient warning. Figure 6(right) is the histogram of the advance warning times obtained during one hour at the challenging installation shown in Fig. 7, characterized by substantial occlusion near the point of first appearance and a residential afternoon activity pattern, including parking, biking and strolling. Short warning times usually correspond to bikes and very slow vehicles, that impose minimal hazard. Irregular events acquired by the system, leading to unusually short or long advance warnings, are exemplified in Fig. 7. Additional results, including ROC curves, can be found in [17].

Fig. 6.
figure 6

Left: histogram of advance warning times provided by the real-time implementation during two hours at the test site shown in Fig. 5. Right: histogram of advance warning times obtained during one hour at the test site shown in Fig. 7.

Fig. 7.
figure 7

Irregular events on the road. Left: a slow, distant bicyclist. Right: a person suddenly appearing on the road.

5 Real-Time Implementation

To show that low-cost realization and wide deployment are possible, we implemented the system using an off-the-shelf Android device. Since the solution requires a narrow field-of-view lens, we selected the Samsung Galaxy K-zoom smartphone, that has a built-in variable zoom lens. Note that the variability of the optical zoom lens is not necessary, and the system can be readily built and operated using a fixed-zoom lens. Fixed-zoom lenses are widely available as low-cost smartphone accessories. Furthermore, unless remote management and distributed cooperative operation are sought, the cellular communication parts of the smartphone are redundant. In addition, although the K-zoom has a GPU capable of reducing computation time and increasing the frame rate, our implementation only utilized the smartphone’s CPU. Thus, the system can be implemented on extremely low-cost Android tablet-like devices. In many applications, the screen of the Android device is not in use, and can be shut down or removed altogether to further reduce the power consumption and cost.

Other than its variable zoom lens, the Samsung Galaxy K-zoom, introduced in 2014, is similar to smartphones of its generation, with Android v4.4.2 (KitKat) as its native operating system. Figure 1 shows an installation of the system at a test site, attached to a lighting pole, with its zoom lens extended.

The target system software was developed using OpenCV4Android SDK version 2.4.10, FFmpeg and JavaCPP. The most demanding computational task is optical flow estimation. During training, dense optical flow estimation is required for generating the influx map. For that purpose, we used the OpenCV4Android implementation of Farnebäck’s algorithm [8], skipping frames to improve the estimation of the tiny optical flow vectors near the point of first appearance. Following the identification of that point, a subset of 2000 image points was sampled using a Gaussian density function centered at the point of first appearance. In the online phase, sparse optical flow is evaluated only at those points, using the OpenCV4Android implementation of the Lucas-Kanade [14] algorithm. The Activity signal is computed by projecting the sparse optical flow field onto the corresponding sparse subset of the influx map. A rate of 8 frames per second is easily obtained on the K-zoom smartphone. To reduce power consumption, the rate can be decreased to two frames per second without significant performance degradation. The template matching process (3) is carried out at 30 frames per second using temporal interpolation. The results shown in Figs. 4 and 6(left) were obtained using the real-time smartphone-based implementation.

Two way streets, as encountered for example in the test site shown in Fig. 1, raise two issues. First, warning about traffic approaching the crossing from the opposite direction can be obtained by installing an additional instance of the system with its camera pointing in that direction. Cellular, WiFi or Bluetooth communication between the two systems allows issuing a mutually agreed indication to the pedestrian. Second, in a two way road, as the camera monitors incoming traffic, distancing vehicles may be seen in the field of view. To distinguish incoming from outgoing traffic based on optical flow, one might compute vector field measures such as divergence. In practice, assuming that the camera is installed higher than typical vehicles, the distinction can be based on the sign of the vertical component of the optical flow vector: incoming vehicles are associated with descending optical flow vectors. In the Android implementation, upwards-pointing vectors in the influx map are simply nullified. Indeed, the system works well at the test site shown in Fig. 1 using that method.

6 Discussion

The global number of visually impaired people worldwide has been estimated to be 285 millions [20], and many more have age-related disabilities limiting their road-crossing skills. Reliable mobile personal assistive devices for road crossing without traffic lights are yet to be invented, but in any case are likely to be beyond the means of needy people worldwide. The goal of this research has been the development of an effective, easy to use, low cost solution. It is facilitated by the stationary, location-specific approach, avoiding system ego-motion altogether. System installation is simple, as the location-dependent influx map and the Activity pulse template are automatically obtained by training.

Compared to our previous work [17], the breakthrough is real-time stand-alone implementation using an off-the-shelf Android device. The retail price of a suitable Android tablet, including all necessary components, is less than US$40 at the time of writing, and the marginal cost of a fixed narrow field lens is negligible. The mass-production manufacturing bill of materials, especially for a screen-less version of the system, is therefore likely to be widely affordable.

Interestingly, in their future work section, [2] imagined a stationary solution for the unsignalized road crossing problem. Our approach is substantially more robust, computationally lighter and better exploits stationarity when compared to the tracking-based method applied by [2] in their robotic system.

We evaluated the system at several test locations, obtaining promising results. Additional work is needed to ensure reliable operation at night or in the presence of rain or snow.