Keywords

1 Introduction

Face alignment is defined as determining the position of a set of known facial points across different subjects, illuminations, expressions and poses [4]. 3D face alignment in the wild is defined as determining the position of these landmarks in the 3D space given only a 2D image acquired in unconstrained environments. This information can be used for several computer vision applications, such as face recognition [11], pose estimation [7], face tracking [14, 15], 3D face reconstruction [1] and expression transfer [12, 13].

Recent face alignment work can be subdivided into 2D and 3D methods. Zhu and Ramanan [19] use mixtures of trees with a shared pool of parts for sparsely aligning faces even in profile head poses, successfully calculating the position of the 2D landmarks. Ren et al. [8] uses regression local binary features to perform 2D sparse face landmark estimation at 3000 frames per second. Jeni et al. [4] is able achieve state-of-the-art real-time 3D dense face alignment by fitting a 3D model on images acquired in controlled environments. The use of cascaded coupled-regressors, by integrating a 3D point distribution model was proposed by Jourabloo and Liu [5] for estimating sparse 3D face landmarks in extreme poses.

In this paper, we present our entry for the 3D Facial Alignment in the Wild (3DFAW) challenge. Our approach is landmark-free in the sense that it does not need any specific face information, only a detected nose region that is used to estimate the head pose. A generic face landmark model is rotated based on the head pose, translated and scaled to fit the detected nose. We choose to use the nose as basis of our work as it has been shown efficient for head pose estimation, does not deform easily when facial expressions are present, is not easily occluded by accessories and is visible even in extreme profile head poses [17]. Our method does not make use of any facial trait specific to the subject and does not take facial expression into account, yet it achieves competitive results. Our approach works well as an initial estimation for the position of the landmarks, since it only needs the nose region, which can be easily obtained with existing detection methods even in challenging environments.

The 3DFAW challenge presents a set of images and annotations for evaluating the performance of in the wild 3D sparse face alignment methods. Part of the data is from the MultiPIE dataset [2] or from images and videos collected on the internet, having its depth information been recovered through a dense 3D from 2D videos alignment method [4]. The rest of the data was synthetically generated by rendering the 3D models present in the BU-4DFE [16] and BP4D-Spontaneous [18] databases onto different backgrounds. The training data includes the face bounding box and the 3D coordinates of 66 facial landmarks, while the testing data only includes the face bounding box.

The results obtained on the challenge’s dataset are evaluated using two different metrics: Ground Truth Error (GTE) (Eq. 1) and Cross View Ground Truth Consistency Error (CVGTCE) (Eq. 2), such that X is the prediction, Y is the ground-truth, \(d_i\) is the Euclidean distance between the corner of the eyes for the i-th image [10] and P is obtained using Eq. 3.

$$\begin{aligned} E(X,Y)= & {} \frac{1}{N} \sum _{k=1}^{N} \frac{\left\| x_k - y_k \right\| _2}{d_i} \end{aligned}$$
(1)
$$\begin{aligned} E_{vc}(X,Y,P)= & {} \frac{1}{N} \sum _{k=1}^{N} \frac{\left\| (sRx_k+t)-y_k \right\| _2}{d_i} \end{aligned}$$
(2)
$$\begin{aligned} P= & {} \{s,R,t\} = \mathop {\text {argmin}}\limits _{s,R,t} \sum _{k=1}^{N} \left\| y_k - (sRx_k+t) \right\| _2^2 \end{aligned}$$
(3)

This paper is structured as follows: Sect. 2 explains our method in detail, with attention to each step; Sect. 3 presents and explains our results on the 3DFAW challenge; and Sect. 4 includes final remarks.

2 Our Approach

Our method is composed of seven steps, four offline and three online: 1. A nose detector is trained; 2. An average face model is generated to be used as template and for calibrating the pose; 3. Ground-truth head poses are extracted; 4. A head pose estimator is trained; 5. Nose detection is performed; 6. The detected region is used for estimating the head pose; 7. The face model is adjusted to the pose and fitted using the nose for alignment. The simplicity of the online steps is outlined in Fig. 1.

Fig. 1.
figure 1

Overview of our method

2.1 Landmark Model and Head Pose Ground-Truth Generation

A near frontal image (0\(^\circ \) head yaw, pitch and roll) from the training subset was chosen to be used for calibrating all other images (Fig. 2(a)), it is defined as not having any rotation on any of the three axes.

For generating the ground-truth head pose, an affine transformation is applied to the landmarks belonging to all images in the training subsets relative to the landmarks in the calibration image (Fig. 2(b)), such that a transformation matrix including translation, scale and rotation is generated. The Euler angles are extracted from this matrix and defined as the ground-truth head yaw, pitch and roll.

The landmark model (Fig. 2(c)) is generated by applying the aforementioned transformation to the training data, normalizing the scale based on the distance of landmarks number 32 and 36 on the base of the nose and averaging the position of each landmark. The resulting model roughly represents the average face with a neutral expression in the dataset.

Fig. 2.
figure 2

(a) Calibration image; (b) Calibration landmarks; (c) Landmark model viewed with the calibration pose

2.2 Nose Detection

Ground-truth nose regions are extracted by cropping the training images around the nose landmarks. Faster R-CNN [9] is trained with all training images in the 3DFAW dataset and used for detection. The Faster R-CNN introduces the novel concept and use of a Region Proposal Network, that generates both candidate bounding-boxes and detection confidence scores.

When processing the testing images, the detected region with the highest confidence score is selected. If Faster R-CNN yields no candidates, a region at the center of the image is selected.

2.3 Head Pose Estimation

For performing head pose estimation, the CNN variant of NosePose [17] is applied. It uses a network similar to those crafted for face recognition [3, 6], but modified to be trained with a smaller number of images and to support the smaller nose regions.

Training was performed using all images in all training subsets. The ground-truth head pose was discretized for both yaw and pitch in steps of 7.5\(^\circ \), the yaw ranges from −60 to 60 and the pitch, from −52.5 to 52.5\(^\circ \), relative to the calibration image. The roll is not estimated as only small variations are present in the dataset. These values were all empirically determined and a single network is trained for estimating both the yaw and the pitch simultaneously.

2.4 Model Fitting

Given the detected nose region, an optional face bounding box and the estimated head pose, the landmark model, explained in Sect. 2.1, is fitted on the face according to Algorithm 1. While having the face bounding box allows the method to perform a slightly more precise scaling of the model, it is not required as it is possible to infer the size of the face from the size of the nose.

figure a

Figure 3 contains examples of this process, including the detected nose region and face bounding-box. If the face bounding box is given, when scaling the model, three constants for minimizing the error are used, one for each axis. The best position used for aligning the model with the nose region and the factor used when scaling according to it were also determined in a similar fashion.

Fig. 3.
figure 3

Example results of the model fitting stage, showing the face bounding-box in red, the detected nose region in blue and the estimated position of the landmarks in green. (a) Near frontal good fit; (b) Bad fit caused by bad head pitch estimation; (c) and (d) Half-profile good fit; (e) Good fit in an image sourced from the MultiPIE dataset; (f) Modest fit in one of the most challenging images in the dataset (Color figure online)

3 Experimental Results

In order to assess the performance of our nose detection step, manual verification of the results was performed on all 4,912 images in the testing subset. Only nine images failed to yield detections, however, the wrong region was detected in four images and the nose belonging to the wrong subject was detected in one. The detection was accurate in \(99.71\,\%\) of the images. This high rate is expected as a state-of-the-art detection method was used and all images are high resolution with very few of them including blur and variations in lighting. Figure 4 illustrates three cases where the detector failed.

Fig. 4.
figure 4

Images where the nose detection (blue box) failed: (a) Wrong nose detected; (b) False positive; (c) No detection (Color figure online)

Isolating and assessing the head pose estimation performance is not possible as the ground-truth landmarks were not made available and visually verifying the correctness of the head pose is a difficult problem for humans [17]. Its performance, however, reflects on the final landmark estimation score, which was calculated by the challenge’s web system. Our method achieved 5.9093 CVGTCE and 10.8001 GTE when scaling the model according to the size of the face bounding box.

Fig. 5.
figure 5

Visual comparison between the two different model scaling methods. The results obtained using the detected nose region are on the left and the ones obtained with the face bounding box are on the right. (a) and (b) the model fitted with the nose is noticeably larger; (c) and (d) using the nose yielded better results

Due to the limited number of allowed submissions to the web system and not being able to locally assess our method’s performance, we do not present a quantitative evaluation when the size of the nose region is used to infer the size of the face. However, we performed visual inspection of the results and concluded they are consistent with those obtained using the face bounding box. Figure 5 presents examples of our results using both scaling approaches, for comparison.

4 Final Remarks

We presented a nose region based approach for in the wild 3D landmark estimation, our entry for the 3D Face Alignment in the Wild Challenge. A generic face landmark model is generated using the information present in the training subset of the challenge. A high-performance, state-of-the-art nose detector and head pose estimator are trained using the nose regions extracted from the landmark annotations. The detected nose is used for estimating the head pose and projecting the rotated landmark model according to the size of the face, either inferred from the size of the nose or from the available face bounding-box. Competitive results were achieved while taking neither facial expressions nor facial traits into account. Because only the nose region is needed for performing our estimation, it has the potential to be used for calculating useful initial landmark positions that can be refined for finer face alignment.