Keywords

1 Introduction

Here at Human Solutions of North America, we have developed a novel multi-tiered approach to detecting and estimating poses in monocular system of humanoid objects using state of the art deep learning architectures and extensive domain knowledge through our commercial body scanners and Size North America proprietary data. Using a U-Net architecture we are able to segment an image to classify which pixels belong to a humanoid and which pixels belong to the background. The U-net architecture is ideal for this task and is considered the state of the art when it comes to image segmentation tasks. It is an encoder-decoder architecture that introduced a technique called a skip step that allows the propagation of feature locality throughout the network in order to classify what kind of subject a particular pixel belongs to. We then clip each detected subject and pass the image into a Convolutional Neural Network (CNN) to infer demographic information. This particular portion of the approach allows us to pick a good “initial guess” as to the structure of the subject. We extract information such as race, age, weight, and body morphology. Thusly, we choose a homologous mesh that has been statistically generated from our Size North America database for that particular demographic. The Size North America database consists of submillimeter precision three dimensional body scans of approximately 18,000 subjects distributed evenly across various demographics. This database allows us to produce a statistically representative three dimensional meshes of each demographic across multiple morphologies. Finally, we pass the homologous mesh into a deep neural network and produce a final mesh that represents the pose of the subject. This last step acts as a regressor and deforms the homologous mesh to fit the initial body pose of the subject.

This novel approach allows us to estimate the pose of multiple subjects that are within view of a monocular system as well as letting us infer a globally plausible body shape for occluded portions of the subject. This approach also opens the door for soft body simulation on subjects within an image. Applications of this methodology are wide and far impacting from three dimensional scene reconstruction and point of view visualization, to high fidelity motion capture from low cost systems.

2 Materials and Methods

2.1 Image Segmentation

Training a deep encoder-decoder neural network is rather tricky. This is caused by the conflicting nature of the requirements of the neural network versus the drawbacks of backpropagation. The U-Net architecture requires a maximization of information for semantic segmentation to be successful. This means that the standard methods of model regularization can no longer be utilized.

One major issue of deep neural networks is a tendency for overfitting. This is due to their large parameter space. The standard way to combat this issue is through dropout. During training we employ a process that stochastically stops gradients from propagating backwards through the layers in the neural network. This effectively kills neurons and forces the neural network to perform at a deficit. Many have theorized that this method causes the neural network to generate strong sub-classifiers in earlier layers. The late stage layers then ensemble these subnetworks to produce a final prediction. Unfortunately, structural information will be lost that act as input for later layer in the decoder network. Therefore, this method cannot be used.

To reduce computational cost, many neural networks, employ a Max Pooling layer whereby neurons of a previous layer are pooled together into a single neuron by taking the highest output signal from the group. This has the effect of reducing computational complexity while preserving the gross structure of the information. Unfortunately, local adjacency information is not preserved with this technique and fine image details that are important for classifying humans are lost.

We initially take a 512 × 256 three channel image, referred to as the source image, and pass it through a specialized “encoder-decoder” convolutional neural network referred to as a U-Net architecture [1]. The U-Net architecture introduces a tensor concatenation operator that allows structural information about identified classes to propagate throughout the neural network that is used to reconstruct a pixel-wise classification tensor. This concatenation operation is referred to as a “skip-step”. Because we are dealing entirely with rank three tensors the concatenation operations occur along the third axis or the channels axis and are computationally cheap (Fig. 1).

Fig. 1.
figure 1

Shows the general architecture for the U-Net Convolutional Neural Network. On the left hand side show the encoder network. On the bottom center is the latent tensor representation of the source image. On the right hand side is the decoder side of the neural network. In the center we have the concatenation operations that allow the structural information of the source image to propagate.

To reduce computational complexity, we employ a strided convolution that acts similarly to max pooling. The difference is our kernel size is always larger than the stride. This allows us to include adjacent information that is outside the “pooling” region while reducing the number of computations by power of two.

Since we are unable to use dropout to regularize our neural network we employed a method of streaming subsets of our original dataset, this is also referred to as incremental learning [1]. The Common Objects in Context (COCO) dataset [2], includes 330 thousand images that are semantically labeled by object class. The dataset is excellently curated and provides a large variety of examples to train on (Fig. 2).

Fig. 2.
figure 2

The iterative training process allows us to define a dynamic set of images to train on. This removes the issue of training over fit without having to perform dropout and other model regularization techniques.

Our activation function, which provide the non-linear capacity for our neural network, was chosen specifically to remove the need for batch normalization [3]. SELU, or scaled exponential linear units belong to a class of self-normalizing activation functions. This activation function allowed us to remove the need for additional normalization layers without losing the benefit that normalization has to solving the vanishing gradient property.

SELU is defined as

$$ f\left( {x,\alpha } \right) = \lambda \left\{ {\begin{array}{*{20}r} \hfill {\alpha \left( {e^{x} - 1} \right),} & \hfill {x < 0} \\ \hfill {x,} & \hfill {x \ge 0} \\ \end{array} } \right. $$

Where \( \lambda \) is a learned parameter that acts as a scaling factor to boost gradient propagation.

2.2 Clipping

Once a class is identified within the source image we must clip the class object into a separate image to extract demographic information. This clipped form of the image isolates the subject from external sources of information that may add undue noise during the subsequent processes.

Clipping is performed using a masking methodology on a low-pass canny filter. Initially we take a source image and pass a Gaussian Kernel Convolution across the source image to remove high frequency information from the image. This will have the effect of reducing the number of possible edges, as shown in Fig. 3.

Fig. 3.
figure 3

Shows the canny edge filter as applied directly on the source image (top right) versus being applied after a low pass filter operation on the source image (bottom right).

Once we extract edges we apply a pixel-wise multiplication of our region proposal. The result of the operation yields a very clean image that contains only the subject to be passed on later processes (Fig. 4).

Fig. 4.
figure 4

Edge masking allows us to focus on edges that we think belong to a human.

2.3 Demographics Estimation

The demographics of a detected subject plays an important role in selecting the right initial conditions for the mesh regression procedure. Extracting the demographics of a subject is done using three convolutional neural networks. Each one is responsible for extracting a prediction for age, race, and gender. The CNN’s use two principles to achieve better than human performance when classifying demographics. A decaying special drop rate, and an expanding kernel size.

To regularize the neural networks and prevent over fit, we employ a high drop rate in the earlier stages of the neural network and a low dropout rate in the later stages of the neural network. This improves the ability of strong subnetworks to be generated for extracting low level features. In the later stages we want the layers to act as an ensembling mechanism. Secondly, expand the kernel sizes to capture local features within the image at earlier stages and global features in later stages.

The result of the convolutional neural networks is then concatenated to produce a final prediction vector to be used in subsequent steps (Fig. 5).

Fig. 5.
figure 5

Highlights the key architecture of the set of Convolutional Neural Networks that are responsible for extracting demographic information from the subject after clipping.

2.4 Homologous Mesh Generation

During our product developments we conducted a size survey called Size North America which consisted of scanning eighteen thousand diverse subjects using millimeter precision body scanners. The subject takes a quick demographic survey and then change into skin-tight under garments. They then enter our body scanner whereby multi-laser optical measurements occur across the entire length of the body producing High Density Point Cloud (HDPC) data. Using propriety software, we aggregated our HDPC data into statistically representative and vertex uniform meshes called homologous meshes (Fig. 6).

Fig. 6.
figure 6

Showcases the vertex uniformity of the homologous meshes within our dataset.

2.5 Homologous Mesh Estimation

Given a demographic prediction vector Pi about a particular subject then a reasonable estimate about a subject’s mesh Mi can be given by an inner product of the prediction vector with the basis B of the space representing all possible human meshes. We approximate the basis of this space using out homologous mesh’s extracted from our Size North America survey.

$$ M_{i} = \frac{{P_{i} \cdot B}}{{P_{i} \cdot P_{i} }} $$

Where B is the basis set of meshes defined as

$$ \left\{ {B_{g,r,a} | B_{g,r,a} \in {\text{M}}^{{{\text{n }}\,{\text{x}}\,3\,{\text{x}}\,160,785,}} , g \in {\text{Z }}r \in {\text{Z}}, a \in {\text{Z}}} \right\} $$

and Pi is the prediction vector defined as

$$ \left\{ {P_{i} | P_{i} \in {\text{R}}^{\text{n}} } \right\} , $$

for a subject i (Fig. 7).

Fig. 7.
figure 7

Shows a sample of our homologous meshes across demographic range. Starting from the top we show meshes for Female African Americans, Male African Americans, Female Asians, Male Asians, Female Others, Male Others, Female White, Male White. Each mesh across a row is a statistically representative model of our age group classes. Starting from the left we show meshes for ages 0–11, 12–17, 18–23, 24–29, 30–35, 36–41, 42–47, 48–53, 54–59, 60–65, 66–71, 72+ respectively.

In essence this process is a weighted average operation of all the homologous meshes across our demographic classes. The weights are determined by the probabilities produced by the neural network.

2.6 Pose Estimation

Pose estimation was accomplished using a Convolutional Neural Network on clipped source images. Preprocessing the image to remove background information allowed us to reduce the complexity of our neural network. Since pre-clipping removes background information, our neural network did not need to learn what a person looks like.

We posit that the pose estimator works by simply regressing a central skeleton into the contour provided. Our neural network’s final layer simply had 22 degrees of freedom. We constructed a constrained skeleton layer based on pre-existing anatomical models which greatly reduced the regression times and improved overall accuracy when compared to a standard dense layer output. Our constraints are defined by medically accepted normal ranges of motion (Tables 1, 2, 3, 4, 5, 6 and 7).

Table 1. Describes the normal range of motion for the hip.
Table 2. Describes the normal range of motion for the knee.
Table 3. Describes the normal range of motion for the ankle.
Table 4. Describes the normal range of motion for the foot.
Table 5. Describes the normal range of motion for the shoulder.
Table 6. Describes the normal range of motion for the elbow.
Table 7. Describes the normal range of motion for the wrist.

2.7 Rigging Homologous Meshes

Once pose estimation is complete applying the pose to the mesh involves regressing the mesh skeleton which applies a system of linear transformations to the mesh allowing the mesh to be regressed into the desired pose.

To simplify the rigging process of the homologous mesh we used the software Unity. By defining the key points of a skeleton we are able to apply transformations to the entire mesh through the 3D rending software (Fig. 8).

Fig. 8.
figure 8

Showcases the control points of the skeleton defined in Unity. These will act to define a system of linear transformations that will be applied to each vertex on the mesh.

3 Results

The image segmentation network was particularly difficult to train as great care had to be taken when dealing with class weights. Code was developed to dynamically calculate class weight upon each batch. The class weights were calculated by counting pixels belonging to people versus pixels belonging to the background. This added procedure cause training times to be much higher, but yielded very good results (Fig. 9).

Fig. 9.
figure 9

Showcases very hard validation examples of the image segmentation process. Input images are shown in column 1, the ground truth labels in column 2, and the neural network results in column 3. Background pixels are represented in green, while pixels belonging to people are represented in blue. (Color figure online)

The clipping operation yielded expected results whereby 83% of human subjects in validation data were clipped from the source image. This is largely sufficient for images in the wild. We expect the use case for this algorithm to be mostly situated in controlled well lit environments (Fig. 10).

Fig. 10.
figure 10

Shows a sample of the clipping process in a non-trivial test case where the subject has intersecting edges with a background. The subject also has a wide variety of occluding features such as facial hair with no discernable variation from his shirt.

Training the demographic convolutional neural networks yielded a significantly greater than random accuracy for each network. We trained these networks using the UTKFace dataset [16] which provides a well curated set of faces with race age and gender annotations (Figs. 11, 12 and 13).

Fig. 11.
figure 11

Outlines the train curve for extracting age estimations after 400 training epochs. The top one prediction accuracy for 13 classes plateaued after the 150th epoch. The jitter in accuracy is caused by the dropout rate in earlier layers as compared to the training step size (1e-3). The line in blue shows the benchmark accuracy if the neural network were to classify age at random. (Color figure online)

Fig. 12.
figure 12

Outlines the train curve for extracting gender estimations after 400 training epochs. The top one prediction accuracy for two classes plateaued after the 60th epoch. The line in blue shows the benchmark accuracy if the neural network were to classify gender at random. (Color figure online)

Fig. 13.
figure 13

Outlines the train curve for extracting race estimations after 400 training epochs. The top one prediction accuracy for four classes did not plateau and had significant trouble attaining greater than random accuracy. The line in blue shows the benchmark accuracy if the neural network were to classify race at random. (Color figure online)

4 Discussion

4.1 Improvements

One major drawback that this methodology has is the multistage approach. Computationally speaking this is not efficient and may suffer when implemented on lower end hardware. We propose that the whole process be integrated into a single feed forward neural network.

Our neural network size was also limited by the capabilities of our hardware. Source images were down sampled from their original sizes. Therefore, it is reasonable to expect a major loss of fine details that are crucial to the process. Expanding the number of filters and adding more layers may allow the neural network to perform better.

4.2 Applications

Mobile Sizing

Our methodology opens the door for robust sizing estimations of a subject without the need for expensive hardware. Given a proper reference point the algorithm can extract length measurements across any set of points defined along the mesh. This has direct applications for the fashion, automotive, aerospace, and ergonomic industries.

An example scenario for the fashion industry would be at the retail level. A boutique fashion store can setup or use existing camera systems to build mesh estimates for all their customers. When a customer selects a garment they can instantly view a simulation of how the garment looks and moves on their body. This removes the risk of exposing expensive apparel to the customer and allows the store to reduce inventory while catering to a higher range demographics.

In ergonomic research, a key area of the field that is lacking is the ability to rapidly prototype designs on computer systems. The ability to develop ergonomic products that can be easily be tested on specific demographics plays an important role. Our technology has particular use when the designer(s) does not have access to an expensive demographics sizing database. They will be able to easily produce simulation ready meshes from any images.

Social Networking and Information Pivoting

The ability search information broker databases allow one to leverage limited knowledge about a subject to expand their information. Unfortunately, traversing these databases becomes an intractable computational nightmare. Searching social media databases is nearly impossible when looking for a particular subject. The ability to narrow down the search space for a human subject greatly reduces search times.

When this methodology is paired with other information gathering techniques, such as natural language processing, one may be able to extract knowledge about a human subject just by having a simple conversation with the subject. This has direct applications in law enforcement. During an interrogation the interviewer’s task is to extract information that might otherwise be hidden or obscured. Real time information validation plays a very crucial role. Our system can be used to search and validate a person’s identity in real time. Information such as age, gender, race, height, body morphology can be used as filtering terms to search offender databases without the need to rely on the human subject to provide accurate information.

Motion Capture

The motion capture industry has barrier of entry in terms of cost of equipment and education. High fidelity motion capture systems requires dedicated studios with dedicated hardware and a very knowledgeable team to maintain [18]. With our pose estimation and mesh regression we are able to produce reasonably accurate motion capture that can later be fed game development projects and movie studios. Our system’s ability to produce homologous mesh’s allows for easy integration with pre-existing animation and rendering pipelines. The vertex uniformity of the mesh lets studios perform soft-body and hard body simulations to produce highly realistic scene renderings at a fraction of the cost.

4.3 Privacy Implications

The sensitive nature of extracting demographic data from images has grave privacy implications. The applications for this technology should be selected to align with the public good. Such a technology could be used to leverage into personal and private details. The methods discussed by this paper are not the edge cases for the potential application of this technology. Such methods can be used to estimate data protected by legislation such as medical history. With the right combination of inputs bad actors may use this technology to perform identity theft and other more malicious acts.

Age has particularly strong privacy implications if this technology is used in public facing systems. The ability to extract identifying features from the minority subset of the population without parental approval can breach many local and federal regulations. Such a system must have filters in place to ignore subjects that have reasonable evidence that they are below the age of majority.

Race plays an important role in the system’s ability to extract fine details with a high degree of accuracy. Initial structural features that reduce regression times are highly dependent on race. There are many downsides to a system that relies on accurately classifying race. If the convolutional neural network is trained on data that has a class imbalance between races the network may miss-identify a race or the race in particular may become under or over represented within the prediction vector. This will negatively impact the quality of the results. In terms of morality, threat analysis systems and the like that rely on race for identification and classification may compound race inequalities. Therefor the author proposes that systems that are used to predict human behavior should abstain from using race qualifiers.

Gender, like race, is a predictive qualifier for estimating body structure. The very trivial example is bust size. If an initial guess for a female subject was not statistically representative for a female, the regressor would likely need more iterations for a fixed step size to optimize the initial mesh to fit a female bust. Choosing the correct gender is crucial to an accurate representation of a subject. Unfortunately, it is very difficult to represent the subset of the population that is gender ambiguous. By the very definition a transgender subject crosses the boundaries between classes and can cause even the most perceptive humans to think twice. This poses a very difficult technological problem and may also exacerbate the political issues around transgender rights.

Many of the examples presented show the need for a good demographic classifier, but we must take particular care when these systems are applied to public applications. We must not give public institutions and regulatory bodies technological justifications to widen the gap of inequality. Nor must we employ these technologies prematurely when they have a direct impact on a person’s life and liberty.