1 Introduction

Due to the advancement of video coding technologies and wireless networks, more and more videos can be serviced on mobile electronics. However, because of the error-prone wireless networks, the problem of transmission errors must be taken into consideration. The error concealment is a technique used at the decoder to minimize the distortion caused by the transmission errors. The concept of error concealment is to reconstruct the damaged data by the spatial and/or temporal correlations.

The spatial error concealment reconstructs the damaged blocks or pixels by interpolating from the undamaged neighboring pixels. The spatial and/or temporal correlations are exploited in error concealment methods to reconstruct the damaged data [10]. With the assumption of consistency in spatial, those pixels within a small area and the neighboring macro blocks (MBs) should be similar, and that is often used in spatial error concealment approach [11]. Temporal error concealment predicts the motion vectors for the missing data to recover the damaged blocks. The temporal error concealment can fill the lost data by the reference frame with the predicted motion vector. In [1, 8, 9], the corrupted MVs are recovered by the principle of block-matching. Both spatial and temporal correlations are exploited in boundary matching algorithm (BMA) [7], which recovers the MV for the damaged block from its neighboring blocks with minimum difference boundary. In [4], the concept that the MVs are similar in the same object is used in object-based error concealment algorithm. In the above mentioned literatures, the neighbor or boundary matching algorithm (NMA / BMA) used in error concealment needs neighbor recovered image information. However, it is unable to get neighbor block information when whole frame is lost.

An advanced concept of depth-image-based rendering (DIBR) system has been proposed as a promising technology for 3DTV systems, which consists of traditional 2-D video and its corresponding depth information [3]. Liu et al. use depth information to propose a depth image-based error concealment algorithm for 3-D video, which utilized the correlations between 2-D video and its corresponding depth map [5]. In [13], depth based BMA algorithm (DBMA) based on (BMA) is proposed to conceal the corrupted blocks. In [2], Chen et al. proposed a method to conceal the lost frame, which determines the MVs from foreground and background MVs. In [12], Wu et al. proposed an error concealment algorithm for H.264/AVC to reconstruct the entire lost frame by block-level motion projection. In [2, 12, 13], the authors employ the entire frame error concealment algorithm to recover the entire missing frame. However, these algorithms use coded motion vector, which may be not true motion vector (TMV) and cause lower performance on their algorithm.

In [14], Yan et al. proposed a Hybrid MV Estimation (HMVE) algorithm [14], which is a pixel based error concealment algorithm to recover the entire missing frame. It extrapolates the motion vectors from encoder and classifies three types of regions in damage frame. Each type of region uses the average MVs from extrapolated MVs. In [15], the authors extend [14] to combine depth map for improving the performance of error concealment algorithm. They claim that their algorithm can provide an efficient error concealment for depth image-based 3-D video transmission, which is able to provide accurate estimation for the motion vectors of the lost frame with the help of the depth information. However, both [14, 15] still use the conventional coded motion vector but not true motion vector for the error concealment algorithm. This limits the performance of the error concealment algorithm.

To improve the performance of the error concealment algoprithm, an object-based full frame recovery (OFFR) algorithm for 3-D depth-based video by using the true motion estimation (TME) [6] is proposed. The improved true motion vectors (TMVs) from TME and MVs from encoder are exploited to extrapolate objects and blocks, respectively. The depth map is used to overcome the blocking artifact problem. Finally, the uncovered regions should be compensated with neighbor recovered pixels. Experimental results demonstrate that our proposed algorithm can efficiently improve the quality of corrupted video when compared to [15].

The rest of the paper is organized as follows. Section 2 describes the related background used in the proposed algorithm. Then, Section 3 presents the main proposed algorithm for reconstructing the whole missing frame. Simulation results are shown in Section 4. Finally, conclusions are made in Section 5.

2 Background

It is hard to find the correct MV of the lost block in the field of error concealment for video CODEC. Generally, the temporal domain usually keeps more information than spatial domain. Temporal error concealment is usually used in inter frame recovery. Some related details in these algorithms are listed as follows.

2.1 Stereo video

2.1.1 Stereo video coding

The Multi-View video coding specification has not yet been developed. Up to now, there are four coding methods to compression stereo video, which are shown in Fig. 1.

Fig. 1
figure 1

Several ways to encode the stereo video

  1. (a)

    Two single video streams: Two views are encoded separately.

  2. (b)

    Inter-View coding: Two views are encoded together and reference each other.

  3. (c)

    Side-By-Side: Two views are down-sampling to half size and combined to one video.

  4. (d)

    One View with depth map: One view and its grey level depth map are encoded together.

2.1.2 Depth-map

In 3D image or video, a depth map is an image or video that contains information relating to the distance of scene objects from a viewpoint. For binocular vision, if the scene object is close to eyes, its position shift between two eyes will be large. The position shift between the left and right views is called disparity. In addition to the original view in 3-D video, the other views can be rendered by the disparity map, which is called depth map. The depth map is equivalent to disparity map.

The depth-map is created by three ways listed as follow:

  1. (1)

    According to the connected cameras, calculating each pixel distance between two views in horizontal-axis.

  2. (2)

    According to the 3D video moviemaker software, the depth map could be easy created well.

  3. (3)

    According to an additional depth sensor with the camera, the depth map could be created by sensor data.

For the case (1), it needs depth-map estimation to calculate depth map. In this way, the depth map usually exists an estimated error. In the case (2) and (3), the depth map is more accurate.

2.2 True motion estimation(TME)

Block-based Motion Estimation has been widely applied for most video compression standards, such as MPEG-1, MPEG-4, H.263, and H.264. In these techniques, image frames are divided into square or rectangle blocks, and the MV for each block is estimated. To raise the accuracy of MV that used to conceal the missing frame, a multi-pass TME [6] has been used. Figure 2 presents an overall view of the multi-pass MV search procedure and associated MV propagation routine.

Fig. 2
figure 2

Flow diagram of multi-pass TME

The search block size would set from 32 × 32 to 4 × 4. The block would be set as converged block if its MV is similar as neighbor blocks. In order to find true motion with less computation complexity, few update vectors are used. USi represents the set of update vectors in the i-th pass, and it can be expressed by

$$ U{S}^i=\left\{\begin{array}{c}\hfill U{S}_{\mathrm{large}},\ i=1\hfill \\ {}\hfill U{S}_{\mathrm{small}},\ 2\le i\le 12\hfill \end{array}\right. $$
(1)
$$ U{S}_{\mathrm{small}}=\left\{\left(\begin{array}{c}\hfill 0\hfill \\ {}\hfill 0\hfill \end{array}\right),\ \left(\begin{array}{c}\hfill \pm 1\hfill \\ {}\hfill 0\hfill \end{array}\right),\ \left(\begin{array}{c}\hfill 0\hfill \\ {}\hfill \pm 1\hfill \end{array}\right)\right\} $$
(2)
$$ U{S}_{\mathrm{large}}=U{S}_{\mathrm{small}}{\displaystyle \cup }\ \left\{\left(\begin{array}{c}\hfill \pm 2\hfill \\ {}\hfill 0\hfill \end{array}\right),\ \left(\begin{array}{c}\hfill 0\hfill \\ {}\hfill \pm 3\hfill \end{array}\right),\ \left(\begin{array}{c}\hfill \pm 4\hfill \\ {}\hfill 0\hfill \end{array}\right),\ \left(\begin{array}{c}\hfill 0\hfill \\ {}\hfill \pm 5\hfill \end{array}\right)\right\} $$
(3)

In initial pass, large update vector set, such as ± (0,5), are used when employing large block size. Because large blocks are more appropriate to describe the object feature, large update vectors can be used in the anterior pass to find global motions. In later passes, a small update vector set is used to refine initial non-true motion estimates into true motion estimates. For more information, readers can refer [6].

3 Proposed method

As mentioned in the previous section, compared with motion vectors from decoder, executing true motion estimation (TME) can describe the moving direction more accurately. Additionally, it is also important to consider the complexity of TME in the decoder. Furthermore, the human observers focus on only some regions of the full frame. As a result, refining the MVs objects only in the regions of interest by TME is a more efficient method. The visual distinctive region decision block shown in Fig. 3 is used to decide which regions might be attractive to human visual.

Fig. 3
figure 3

The processed flow diagram of the proposed algorithm

In Fig. 3, the object-based TME block utilizes the depth map to provide more accurate pixel-based MVs for objects in the reference frame. In this paper, the blocks or pixels, which have similar depth value, might be included in the same object. By this way, the block-based motion estimation could be extended to pixel-based. For the concerned regions, those blocks and pixels with the same object-MV are included to an object.

Finally, the uncovered region should be filled with some value by spatial-temporal concealment method.

3.1 The visual distinctive region decision

Usually, the camera focuses on only some regions of the scene and the others may be out of focus. There are some blurred regions which depth is much different from focus region or complex object, which could not detect true motion using motion estimation. Therefore, in this paper, a high-pass filter is used to detect complex regions, which are full of texture and dominant for human visual quality. It is seen that high frequency regions are more sensitive to the error of motion estimation than low or middle frequency regions. That is, in the high frequency, error of motion estimation causes much more degradation of the visual quality than that of low or middle frequency regions. The reason why is that low frequency region has lower pixel value difference between each other. It has proved that TME is better performance but with high computation complexity. Consider the trade-off between the visual quality and computation complexity, in this paper, MVs from TME for those textured regions and MVs from encoder are used in extrapolation for the textured and blur regions respectively.

Figure 4 presents the flow diagram of attractive regions selection for human visual and an example in detail. The regions of interest are decided by the result of Gaussian high-pass filter by an adaptive threshold. If the local average in this region is larger than the global average, it will be chosen as attractive to human visual.

Fig. 4
figure 4

The flow diagram of visual distinctive region decision

The filter effect in the frequency domain can be transformed to 2-D convolution in spatial domain. From the example shown in Fig. 4, it can clearly be found that observers focus on only the airplane. By means of visual distinctive region decision, the regions of interest can be easily determined.

3.2 Multi-pass true motion estimation and object-based MV refinement

When the MV estimated in multi-pass TME is converged, in order to get more accurate pixel-based MVs, it is required to check whether the block contain different objects. If the corresponding block in the corresponding depth map contains various depth values, which means the block tends to have different objects. Therefore, the block should be partitioned into two objects and the MV of each object should be estimated. Sobel operator is used as edge detector to calculate the gradient of each pixel in depth map and then a simple threshold method is exploited to partition different objects.

For each converged block BC in TME, if there is edge in depth map, the edge map G(x, y) is used to calculate the adaptive threshold Tr to divide the block into foreground and background part. The object map O(x, y) indicates the two objects and is defined as follow:

$$ \mathrm{O}\left(x,y\right)=\left\{\begin{array}{cc}\hfill 1\hfill & \hfill for\ depth\left(x,y\right)>{T}_r\hfill \\ {}\hfill 0\hfill & \hfill for\ depth\left(x,y\right)\le {T}_r\hfill \end{array}\right. $$
(4)

To avoid inaccuracy object segmentation, the threshold is determined by the weighted average of depth map, which is defined as follow:

$$ {T}_r=\frac{1}{{\displaystyle {\sum}_{\left(x,y\right)\ \in\ {B}_C}}G\left(x,y\right)}{\displaystyle \sum_{\left(x,y\right)\ \in\ {B}_C}}G\left(x,y\right) \times depth\ map\left(x,y\right) $$
(5)

MVs for the foreground and background are chosen from their candidate MV set, which contains the MV of this converged block and the object-based MVs of the neighbors which have been converged in the previous pass TME. The candidate MV sets for foreground and background are MVSf and MVSb defined as follow:

$$ M{V}_f= \arg min\left( SAD(MV)\right),\ MV\in MV{S}_f $$
(6)
$$ M{V}_b= \arg min\left( SAD(MV)\right),\ MV\in\ MV{S}_b $$
(7)

As shown in Fig. 5, the neighbor blocks may contain converged blocks BC and/or un-converged blocks BUC and only the MV of converged block BC can be included into the candidate MV sets.

Fig. 5
figure 5

An example for decription of the component of candidate MV sets

In Fig. 5, the blank blocks do not estimate TMV; BC and BUC are the blocks converged and un-converged in TME. In this example, MVSf and MVSb are set as:

$$ MV{S}_f=\left\{M{V}_f,M{V}_{2f},M{V}_{3f}\right\} $$
(8)
$$ MV{S}_b=\left\{M{V}_b,M{V}_{2b},M{V}_{3b}\right\} $$
(9)

3.3 Object-based MV extrapolation and overlapped regions processing

Extrapolating the objects in the reference frame can reconstruct the missing frame, but it might occur overlapping and uncovering issues. As shown in Fig. 6, according to the MV of reference frame (t-1), the block would use to conceal a block region of the current frame (t).

Fig. 6
figure 6

Motion vector extrapolation

Therefore, it must be solved that which MV should be chosen for the overlap region. In the conventional method, such as HMVE, MVs for those overlap regions are determined by weighted averaging those overlap MVs. However, in depth-based 3-D video, the depth information can be utilized to overcome this overlap problem. Figure 7 shows the block of current frame covered by zero or more than one motion vector.

Fig. 7
figure 7

Motion vector extrapolation situation

3.4 Motion compensation

Motion compensation is an algorithm used to predict a frame in a video, given the previous frames by accounting for motion vector of the camera or objects in the video. A common function of motion compensation is as follow:

$$ {\mathrm{P}}_{\mathrm{t}}\left(x,y\right) = {\mathrm{P}}_{\mathrm{t}-1}\left(\mathrm{x}+\mathrm{M}\mathrm{V}\mathrm{x},\mathrm{y}+\mathrm{M}\mathrm{V}\mathrm{y}\right) $$
(10)

where Pt(x,y) means the pixel value in frame t at position (x,y), and Pt(x,y)1 means the pixel value in frame t at position (x,y). MVx and MVy are the distances on x and y direction. Each pixel in current frame (t) could recover from reference frame (t-1) according to its MV.

The missing frame can be reconstructed by the reference frame according extrapolating MVs. However, the reconstructed frame still has some uncovered regions needed processing. To process the uncovered regions will be discussed in the next section.

3.5 Uncovered regions processing

A simple edge detection is used to find out the edge direction. The pixel pair with the same color and shape subtracts each other shown in Fig. 8. Search for the minimum delta for each pixel pair as the edge direction. The MV of center pixel would be interpolated by the direction with minimal difference value. If the uncovered region size is larger than 10 × 10 pixels, use the MVs at the same location in the reference frame to compensate.

Fig. 8
figure 8

The proposed pixel-based spatial directional interpolation

4 Simulation results

The proposed method is simulated using the JM10.1 H.264 CODEC. Several 3-D test sequences: airshow, watermolen, EcotoneEuropa, breakdancer and ballet are used to evaluate the performance of the proposed method. The sequences are in 4:2:0 YUV format and encoded with quantization parameter QP = 24 and QP = 30. The GOP size is 15 frames with an I-frame and 14 P-frames, which reference only one frame. In this simulation, we suppose a frame lost in a GOP. The frame size is at two resolutions: 960 × 540 and 1024 × 768 and the frame rate is 30 frames/sec. A frame contains only one slice.

In order to evaluate the effectiveness of the proposed method, we show the comparisons of the performance of the proposed method, HMVE mentioned in background and FC (frame copy) by the peak-signal-to-noise ratio (PSNR) as from Figs. 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, and 19.

Fig. 9
figure 9

“Airshow” sequence PSNR comparison versus frame number with QP = 24

Fig. 10
figure 10

“Airshow” sequence PSNR comparison versus frame number with QP = 30

Fig. 11
figure 11

“Nature” sequence PSNR comparison versus frame number with QP = 24

Fig. 12
figure 12

“Nature” sequence PSNR comparison versus frame number with QP = 30

Fig. 13
figure 13

“Watermolen” sequence PSNR comparison versus frame number with QP = 24

Fig. 14
figure 14

“Watermolen” sequence PSNR comparison versus frame number with QP = 30

Fig. 15
figure 15

“EcotoneEuropa” sequence PSNR comparison versus frame number with QP = 24

Fig. 16
figure 16

“EcotoneEuropa” sequence PSNR comparison versus frame number with QP = 30

Fig. 17
figure 17

“Breakdancer” sequence PSNR comparison versus frame number with QP = 30

Fig. 18
figure 18

“Ballet” sequence PSNR comparison versus frame number with QP = 24

Fig. 19
figure 19

“Ballet” sequence PSNR comparison versus frame number with QP = 30

The natural sequences like “airshow”, “nature”, “watermolen” have many regions being unfocused. In these regions, the object could not be defined. The object method would have better quality than HMVE algorithm. “EcotoneEuropa” sequence is a colorful sequence and its object is obviously. The object in the “EcotoneEuropa” could be concealed well using the object-based algorithm. The sequence “breakdancer” and “ballet” are famous test sequences. In these sequences, although the depth maps cannot be detected very well. There still have good PSNR and visual quality using object method.

In Fig. 20, the pixels with the same color are detected as the same object. The pixels with the same color would have the same motion vector. Our algorithm would keep the object shape of object and get lower blocking artifact.

Fig. 20
figure 20

The object map generated from OFFR

Here is a case indicates that our proposed algorithm provides better error propagation control from 49th to 60th frame in “ballet” sequence with QP = 30. Fig. 21 shows the comparison of our method and HMVE in detail. In this case, object based method has a huge benefit than HMVE. According to correct object detection, ballet girl could conceal by the same MV without crack.

Fig. 21
figure 21

“Ballet” sequence PSNR comparison versus frame number with QP = 30

Table 1 presents the average PSNR performance of different methods for each test sequence. It can be found that the proposed method has higher PSNR in Table 1. By the experimental results, it can be observed that the proposed method significantly outperforms HMVE and FC by up to 2.848 dB and 5.0134 dB at QP = 24, and 1.5536 dB and 4.3068 dB in average, respectively. The tables indicate that the “EcotoneEuropa” sequence gives a highest PSNR than any other sequences at QP = 24 and Qp = 30. The reason is that the coded motion vectors for HMVE of the “EcotoneEuropa” sequence is not accurate enough but TMV correct this problem.

Table 1 Comparison of the average PSNR performance.(Evaluation of the quality performance of the whole sequence)

The results shown in Table 1 demonstrate that our algorithm keeps the object shape according to the depth map information. The average PSNR performance shown in Table 1 is evaluated by the whole sequence. That is, the quality of all error free frames and the reconstructed frame in each test sequence will be computed. Such evaluation method is difficult to estimate the quality performance of reconstructed frame for each algorithm. To concentrate on the quality performance of the reconstructed frame, a comparison of HMVE, [15] and our proposed algorithm is shown in Table 2. Here only the reconstructed frames are evaluated for computing the quality performance. In general, it can be found that the proposed algorithm show better performance than the other methods. Because of using TME algorithm, we found that the proposed algorithm in the “EcotoneEuropa” sequence gets a highest PSNR than any other sequences.

Table 2 Comparison of the average PSNR performance (only evaluation of the quality performance of the reconstructed frame)

As shown in Fig. 22, the cockpit and airfoil of aircraft show rough at the edge using HMVE algorithm and [15]. On the contrary, the performance of OFFR shows better visual quality at the same edge as compared with HMVE algorithm and [15].

Fig. 22
figure 22

Comparison of visual quality of HMVE, [15] and the proposed method in detail

5 Conclusion

In this paper, we propose an efficient temporal-spatial full frame reconstruct algorithm for 3-D video in H.264/AVC by object-based TME. For the objects of interest, object-based MVs are calculated and used to reconstruct the missing frame. First, only those textured regions employ TME because of human visual system and computation complexity. For those distinctive regions, the depth map is utilized for TME to extend the block-based MVs to object-based MVs. Then, reconstruct the damage frame by extrapolating the object-based MVs from TME for those interesting regions and the block-based MVs from encoder for the rest part. In case of overlap, it makes sense to choose the MV with larger depth value instead of weighted averaging those overlap MVs such as HMVE. Finally, spatial and temporal correlations are applied to small and large uncovered region respectively. Our proposed paper combined all useful information to conceal entire frame using true motion estimation and it is a pixel-based algorithm to get higher quality.

The proposed new method is to find the objects according to temporal and depth information. Performance comparisons in PSNR with other algorithms are presented. Simulation results demonstrate that our approach performs better visual quality and PSNR in most case. Actually, TME plays an important role in the proposed algorithm. However, TME is a high complexity algorithm. To make the proposed algorithm more feasible, we have taken an important future direction to reduce the complexity of TME.