1 Introduction

Recently, recognition of human actions from RGB-D data has attracted increasing attention in computer vision with its extensive applications from surveillance to human-computer interaction. Compared with traditional RGB and depth data, skeleton data containing the positions of human joints, is a relatively high-level feature for motion recognition. It is robust to scale and illumination changes, and can be invariant to camera view, human body rotation and motion speed.

The skeleton sequences are commonly used to represent human motion in conventional research due to the intuitive property. To date, many methods based on hand-crafted features [7, 12, 24, 25, 27, 32,33,34, 42, 43] have been reported for recognising actions from skeleton data. Along with the first work of [17] reported in 2010, deep learning develops rapidly in recent years, and many methods based on Convolutional Neural Networks(ConvNets) [29, 35,36,37] or Recurrent Neural Networks (RNNs) [4, 19, 31, 49] have been proposed for skeleton-based human action recognition. However, how to effectively extract discriminative features from skeleton sequence and model the spatial and temporal evolution for robust recognition of a large scale and complex actions still remains an open problem.

For hand-designed skeleton features, the Bag-of-words model is used widely but often ignores the spatial information. Dynamic Time Warpings (DTWs), Fourier Temporal Pyramid (FTP) or Hidden Markov Models (HMMs) methods are commonly used in capturing the temporal information. However, these hand-crafted features are always shallow and dataset-dependent. Compared with hand-designed skeleton features, the features based on deep learning method including RNNs and ConvNets are more effective. In RNNs based methods [4, 19, 31, 49], RNNs tend to overemphasize the temporal information and it is difficult to construct deep RNNs to extract high-level features [28]. In methods based on ConvNets, the key and most commonly used approach is to convert a skeleton sequence into a texture image and encode the temporal information in the texture image. For example, Wang et al. [38] proposed to project skeleton onto three orthogonal Cartesian planes, and encode the joint trajectory into Joint Trajectory Maps (JTM) by using HSV (hue, saturation, value) space. However, the temporal information is lost due to trajectory overlapping, and some actions can not distinguish well such as action “knock” and “catch”.

In this paper, we proposed an effective hierarchical architecture to encode skeleton sequences into color texture images, referred to as Temporal Pyramid Skeleton Motion Maps(TPSMMs), and texture images as the input of ConvNets for recognition. The TPSMMs not only capture short temporal information, but also embed the long dynamic information over the period of an action. Meanwhile we proposed a new encoding method to encode the spatial-temporal information more effectively. For instance, Our TPSMMs can effectively discriminate action “knock” and “catch” as shown in Fig. 1c and d which can not distinguish well by using JTM. The proposed method has been verified and achieved the state-of-the-art results on the widely used UTD-MHAD [2], MSRC-12 Kinect Gesture [6] and SYSU-3D [9] datasets.

Fig. 1
figure 1

a and b are the result of JTM [38] for action “knock” and “catch” in UTD-MHAD respectively. c and d are the result of our proposed method for same action

2 Related work

In this section, we introduce the related literature of skeleton-based action recognition methods using hand-crafted features, ConvNets-based approach and RNN-based approach.

Hand-crafted features

Hussein et al. [13] encoded joint movement and time information by calculating multiple covariance matrices over sub-sequences in a hierarchical temporal levels. Chaudhry et al. [1] encoded spatial and temporal information of the 3D skeleton data using Linear Dynamical Systems (LDSs), and used discriminative metric learning to perform action recognition. Vemulapalli et al. [32] used rotations and translations in 3D space to model the 3D geometric relationships between various skeleton joints. Ofli et al. [26] proposed Most Informative Joints(SMIJ) to automatically select a few skeletal joints with the most informative for the action.

ConvNets-based approach

Du et al. [5] represented a skeleton sequence as a matrix by concatenating the joint coordinates at each instant and arranging the vector representations in a chronological order. The matrix is then quantified into an image and normalized to handle the variable-length problem. Hou et al. [8] encoded the joint over the sequence into 3D points with color and drew the skeleton joints with a specific pen to three orthogonal canvases, then encodes the dynamic information in the skeleton sequences with color encoding. Wang et al. [38] proposed Joint Trajectory Maps (JTM), which represents both spatial configuration and dynamics of joint trajectories into three texture images through color encoding. However, both methods in [8, 38] are view dependent and easily loss temporal information due to the direct trajectory encoding. In order to overcome the drawbacks, Li et al. [18] calculated skeleton joints of single or multiple subjects of pair-wise distances, encoding into texture images, as the input of ConvNets for action recognition. Liu et al. [20] arranged the skeleton joints in a 2D array to generate a set of RGB images, and select the arrangement by maximizing a unique distance metric. All frames are then concatenated to construct a large image as the final representation. Ke et al. [14] presented the sequence as a clip with several gray images for each channel of the 3D coordinates, and used a deep ConvNets to learn high-level features, then the features of all the three clips at the same time-step are concatenated in a feature vector. A Multi-Task Learning Network is adopted for final classification. However,these methods have broken the spatial structure of human skeleton data. Liu et al. [23] proposed a sequence-based view invariant transform to address the problem of view variation, color images enhanced by visual and motion, and multi-stream ConvNets fusion method is adopted to conduct recognition. But this method uses ten CNN models and, hence, has a high complexity. Tang et al. [30] propose a deep progressive reinforcement learning (DPRL) method, distil the most informative frames, discard ambiguous frames in sequences, and employ the graph-based convolutional neural network to capture the dependency between the joints for action recognition. This paper keeps structure of human skeleton and provides a new color encoding method to capture spatio-temporal information more effectively.

RNN-based approach

The basic idea of RNN-based methods is to extract frame-based feature from a skeleton sequence and pass the feature sequence into an RNN model to map the sequence of features into a class of actions. These methods tend to overstress the temporal information and underestimate the spatial information. To overcome this shortcoming, different improvements have been proposed. The works in [4, 16, 19, 29, 45, 48] extended LSTM models to spatial domains. Specifically, Du et al. [4] divided skeleton joints into five parts according to the structure, as the input of five BRNNs. Along with the number of layers increases, the feature was stratified fusion and classification. Shahroudy et al. [29] proposed to model the long-term temporal correlation of the features for each body part. All body parts are concatenated by a shared output gate. Liu et al. [19] extended the traditional LSTM-based learning to spatial-temporal domains (ST-LSTM) and used a trusted gate to deal with unreliable input data. A skeleton tree traversal algorithm was used to discover stronger long-term spatial dependency patterns. [45] designed a view adaptive LSTM architecture that enables the network itself to automatically adjust to most suitable observation viewpoints. Lee et al. [16] proposed an ensemble Temporal Sliding LSTM (TS-LSTM) networks for skeleton-based action recognition. The proposed network is composed of multiple parts containing short-term, medium- term and long-term TS-LSTM networks, respectively. Zhu et al. [48] proposed an end-to-end fully connected deep LSTM network and added a mixed-norm regularization term to the cost function in order to learn the co-occurrence features of skeleton. Instead of extending LSTM models, Zhang et al. [46] provided a simple universal spatial modeling method and utilized eight geometric relational features to feed into a basic 3-layer LSTM model. However, concatenation of the eight types of geometric features performs worse compared to a single type of these features.

Unlike the above mentioned works, the proposed method TPSMMs can effectively distinguish similar actions by using the temporal and spatial information of the human skeleton sequence, and to further improve the performance by using decision fusion.

3 The proposed method

The proposed method is illustrated in Fig. 2. Given a skeleton sequence, six TPSMMs are created by projecting the 3D points to the three orthogonal, and using Temporal Pyramid to construct a hierarchical architecture. Six TPSMMs serve as inputs to six ConvNets for classification. Final classification of the given skeleton sequence is obtained through a late fusion of the six ConvNets. Many strategies have been developed to exploit the spatial-temporal information in skeleton sequences. Firstly, different shapes are used to represent skeleton joints and visualization of the skeleton in physicals. Secondly, skeletons are segmented and accumulated the absolute difference between consecutive frames on each joint over each segmentation. Thirdly, each TPSMMs goes through a pseudo-color coding process to enhance different motion patterns into the pseudo-RGB channels.

Fig. 2
figure 2

The proposed TPSMMs+ 6ConvNet architecture for skeleton-based action recognition

The TPSMMs processing is shown in Fig. 3, it consists of Skeleton Semi-Visualization processing, Temporal Pyramid (TP) processing, Segmented Skeleton Motion Maps (S-SMMs) processing, Pseudo-color Coding and Temporal Pooling. In the rest of this section, it will be described in detail.

Fig. 3
figure 3

Illustration of TPSMMs processing in the view xy. n represent the frames of a skeleton sequence, w is the number of segments of the skeleton distribution map, k represent the k-th of the segmented skeleton distribution map

3.1 Semi-visualization

Give a skeleton sequence S, we assume that it contains n frames as follows:

$$ S=\{F_{1}, F_{2},...,F_{n}\} $$
(1)

where \(F_{i}=\{{p_{i}^{1}},{p_{i}^{2}},...{p_{i}^{m}}\}\) indicates the skeleton of the i-th frame; m denotes the total number of joints; \({p_{i}^{j}}=({x_{i}^{j}}, {y_{i}^{j}}, {z_{i}^{j}})\) represents the 3D coordinates of the j-th joint in Fi. The skeleton joints are projected to three orthogonal cartesian planes including (xy plane,xz plane and yz plane). For each plane, we map the joints of each frame into an image. For example, for the xy plane, the joint coordinates (x,y) of a skeleton sequence are seen respectively as the coordinates (x,y) of an image. And the z is seen as value in B channel. R channel and G channel are set to 0. Then B channel is normalized and expressed as follows:

$$ {Z_{i}^{j}}=\frac{Z_{max}-Z_{i}}{Z_{max}} \times 255 $$
(2)

where \({Z_{i}^{j}}\) represents z value of each joint; Zmax represents the maximum z value of a joint in a skeleton sequence.

In order to distinguish different body parts, the skeleton of human body is divided into five parts: left upper body part (left_shoulder, left_elbow, left_wrist and left_hand), left lower body part (left_hip, left_knee, left_ankle and left_foot), right upper body part (right_shoulder, right_elbow, right_wrist and right_hand), right lower body part (right_hip, right_knee, right_ankle and right_foot) and middle body part (head, shoulder_center, spine and hip_center). Each part is represented by using different shapes.

After projecting the skeleton sequence onto the three coordinate planes, we can get the skeleton distribution map D of three different view as follows:

$$ D_{v}=\left\{ma{p_{v}^{1}}, ma{p_{v}^{2}},...,ma{p_{v}^{k}},...,ma{p_{v}^{n}}\right\} $$
(3)

where v ∈{xy,xz,yz} denotes the three projected views in the three orthogonal cartesian planes, \(ma{p_{v}^{k}}\) represents the k-th frame of skeleton data projected onto the view v.

3.2 Temporal pyramid processing

The paper [33] proposes that Fourier Temporal Pyramid (FTP) is a descriptive representation of the temporal structure of the actions, it explores temporal information of features through sequence partition and combines fourier features of segmentation finally. We apply the Temporal Pyramid to skeleton-based human action recognition.

A skeleton distribution map is split into w segmentations and most commonly, w = 1,2,4. Considering the computational cost and effectiveness, w = 1,4 are adopted for skeleton segmentation to construct a two layers architecture. The segmented skeleton distribution map is expressed as follows:

$$ Se{g_{v}^{k}}=\{map_{v}^{\frac{n}{w}*(k-1)+1}, map_{v}^{\frac{n}{w}*(k-1)+2},...,map_{v}^{\frac{n}{w}*k}\} \quad k \in\{1,2,...,w \} $$
(4)

where w is the number of segments of the skeleton distribution map. \(Se{g_{v}^{k}}\) is the k-th of the segmented skeleton distribution map in the view v, n is the number of frames in a skeleton distribution map.

3.3 Segmented skeleton motion maps processing

The paper [43] proposed the method DMMs, which are computed by adding all the absolute differences between consecutive frames in depth map. Expanding to skeleton sequence, we map the segmented skeleton distribution map into segmented skeleton motion maps(S-SMMs). The motion energy is stacked in each skeleton distribution map segment as follows:

$$ S\textrm{-}SM{M_{v}^{k}} = \sum\limits_{j = 1}^{M} {\left| {map_{v}^{j + 1} - ma{p_{v}^{j}}} \right|} \quad ma{p_{v}^{j}}\in{Se{g_{v}^{k}}} $$
(5)

where \(M=\frac {n}{w}\) is the number of frames in a segmented skeleton distribution map.

3.4 Pseudo-color coding

Color-coding can utilize the perceptual capabilities of the human visual system to extract more information from grayscale images, and effectively enhance the detailed texture patterns contained in the image. For better exploit the motion patterns, the temporal information and depth value of the joint are used to encode S-SMMs. Each S-SMMs has its fixed color in R and G chanels. And we map the depth value of the joint into B channel.

When a skeleton distribution map is divided into four segments(w = 4), we set the values of R channel and G channel of the four colormaps to (0,0.498), (0.498,1), (1,0.502) and (0.502,0), respectively. The B channel of the four colormaps changes are as follows:

$$ B= \left\{\begin{array}{ll} (131+4\times (I-1))\div 255& {0 \le \text{I} < 32}\\ 1 & {32 \le \text{I} < 96}\\ (255-4\times(I-96)+1)\div 255 & {96 \le \text{I} < 160}\\ 0 & {160 \le \text{I} \le 255} \end{array}\right. $$
(6)
$$ I=\frac{g-g_{min}}{g_{max}-g_{min}}\times 255 $$
(7)

where I is normalized gray value, gmax is maximum value of S-SMMs, gmin is the minimum value of S-SMMs.

The effect of the four colormaps as the value of I changes is shown in Fig. 4. It can be seen from the figure that the four colormaps are in four different color ranges, and it is no overlapping color between the colormaps.

Fig. 4
figure 4

The four hand-designed colormap, for each colormap, it has different colors with the change of I

In addition, Temporal Pooling is applied on S-SMMs to merge into Temporal Pyramid Skeleton Motion Maps (TPSMMs) finally, and it can be applied to the input of convolutional neural networks.

3.5 Network training

Six ConvNets are trained on the TPSMMs constructed from the skeleton sequences in the three cartesian planes for two layers of our architecture respectively. The layer configuration of the six ConvNets are used DenseNet 121 [11]. The deep DenseNet contains many dense blocks and transition layers as shown in Table 1. Each dense block consist of two kinds of “conv” and each “conv” layer corresponds the sequence BN-ReLU-Conv. The kernel size of the first and second “conv” are 128 and 32 respectively. And The transition layers used in our experiments consist of a batch normalization layer and an 1 × 1 convolutional layer followed by a 2 × 2 average pooling layer.

Table 1 DenseNet architectures for ImageNet

The implementation is derived from the publicly available platform Keras toolbox [3] with tensorflow based on one NVIDIA GeForce GTX TITAN X card. The pre-trained models over ImageNet [15] are used for initialization in training. The network weights are learned using the mini-batch stochastic gradient descent with momentum being set to 0.9 and weight decay being set to 0.0005. All hidden weight layers use the rectification (RELU) activation function. At each iteration, the batch size is set to 32. All the images are resized to 224 × 224. The learning rate is set to 10− 3 for fine-tuning the pre-trained models, and then it is decreased according to a fixed schedule, which is kept the same for all training sets. For each ConvNet, the training undergoes 40 epochs.

3.6 Score fusion for classification

Given a test skeleton sequence (sample), six TPSMMs are generated and fed into six different trained ConvNets. The score vectors of six neural networks are fused and the max score in the resultant vector is assigned as the probability of the test sequence being the recognized class. The common fusion method consists of max-score fusion, summation-score fusion and multiply-score fusion, that is, the score vectors outputted by the six neural networks are maximized, added and multiplied in an element-wise way respectively. Compared with the max-score fusion, summation-score fusion [40, 41] and multiply-score fusion perform better in our application and expressed as follows:

$$ label=Fin(\alpha (v_{11} + v_{12} + v_{13})+ \beta (v_{21} + v_{22} + v_{23})) $$
(8)
$$ label=Fin(v_{11} \circ v_{12} \circ v_{13} \circ v_{21} \circ v_{22} \circ v_{23}) $$
(9)

where v11,v12, v13 are the score vector of three TPSMMs with w = 1 respectively. v21,v22, and v23 are the score vector of three TPSMMs with w = 4 respectively. α and β are weights respectively, and α + β = 1. ∘ refers to element-wise multiplication, Fin() is a function to find the index of the element having the maximum score.

4 Experiments

The proposed method is evaluated on three widely used benchmark RGB-D datasets, namely, SYSU-3D [9], MSRC-12 [6] and UTD-MHAD [2] datasets. These three datasets cover a wide range of different types of actions including simple actions, daily activities, human-object interactions and fine-grained activities. Finally, the results on the three datasets and the detailed analysis are presented.

4.1 Comparison of TPSMMs in different views

The effectiveness of different views and different skeleton segments were evaluated on three datasets, and the recognition accuracies are listed in Table 2. From table, we can see that the TPSMMs with w = 4 performs better compared with TPSMMs with w = 1 due to encoding more colors into TPSMMs.

Table 2 Comparison of different views and different skeleton segments on three datasets in terms of recognition accuracy

4.2 Comparison of three score fusion methods

Table 3 shows the different weights α and β on three datasets in terms of recognition accuracy. It can be seen from the Table 3 that different weight α and β make the recognition accuracy increase differently. For SYSU-3D dataset, it can get better result when α and β are 0.5 and 0.5 respectively. It performs better for UTD-MHAD dataset and MSRC-12 dataset when α and β are 0.3 and 0.7 respectively. And the comparison of three score fusion methods on the three datasets for final recognition are listed in Table 4. From the table, we can see that on the evaluated three datasets, the summation and multiply score fusion consistently outperformed max score fusion. This verifies that the six TPSMMs are likely to be statistically independent and provide complementary information.

Table 3 Comparison of different weights α and β on three datasets in terms of recognition accuracy
Table 4 Comparison of three fusion methods on three datasets in terms of recognition accuracy

4.3 Comparison of different color encoding methods

We conduct extensive experiments with different color encoding to verify the effectiveness of individual technical contributions proposed, as follows:

  • Overlapping: We just use overlapping technology instead of temporal pooling to combine coloring S-SMMs and generate TPSMM.

  • Jet colormap: We only utilize the jet colormap, ranging from blue to red, and passing through the colors cyan, yellow, and orange, and divide the jet map four parts instead of original four channels of color.

  • Temporal encoding: A skeleton sequence is split into w segmentations and obtain w S-SMMs. And we divide the jet map into w parts. we don’t use the depth information of skeleton. For each S-SMMs, the middle color of corresponding color map is selected. And temporal pooling is used to combine coloring S-SMMs.

We use the same scores of w = 1 and multiply-score fusion to compare different encoding methods. Table 5 shows the results of different encoding methods on UTD-MHAD Dataset. From the table, it can be seen that encoding method using temporal pooling can improve the performance. We use other color map in proposed encoding method and also get similar result. With the increasement of w, the result of temporal encoding based method is stable, but it is worse than our encoding method.

Table 5 Comparison of the different color encoding on UTD-MHAD Dataset

4.4 SYSU-3D human-object interaction dataset

The SYSU-3D dataset [9] collected with the Microsoft Kinect contains 12 human-object interaction performed by 40 subjects. It has 480 skeleton sequences. And it is very challenging due to similar the motion patterns and a lot of viewpoint variations. We follow the standard protocol [9] to evaluate our method. The half of the subjects are used for training and the rest for testing. Table 6 shows the results of the proposed method and its comparison to the state-of-the-art results achieved by either hand-crafted feature based methods and deep learning based methods. Compared with most methods [9, 21, 30, 45], our method performs better. From the table, we can see that deep learning based methods [21, 30, 45] outperform hand-crafted feature based methods [9, 10]. The method using RNN and CNN [39] is perform better than these methods using RNN [21, 22, 45] which tend to overemphasize the temporal information. However, the accuracy will drop dramatically when using deep networks due to over-fitting in small dataset. The graph based method [30] is not better than RNN based some methods [22, 45]. Since the spatial graph is built over clusters, each of which is assigned a weight and thus may not capture delicate pairwise spatial relationship among joints.

Table 6 Comparison of the proposed method with existing methods on the SYSU-3D Dataset

The confusion matrix is shown in Fig. 5. From the confusion matrix we can see that the proposed method can not distinguish some actions well such as “Drinking ”, “calling phone”, “mopping” and “sweeping” . For skeleton data, it does not have any appearance information. However, these actions are performed with similar motion and TPSMMs are not effective in these cases. And it cannot perform well for action “Packing backpacks ” due to view-point changes.

Fig. 5
figure 5

Confusion matrix for TPSMMs on SYSU-3D Dataset

4.5 UTD-MHAD dataset

UTD-MHAD [2] contains 27 actions performed by 8 subjects (4 females and 4 males) with each subject performing each action 4 times. For this dataset, cross-subjects protocol is adopted as in [2], namely, the data from the subject numbers 1, 3, 5, 7 used for training and 2, 4, 6, 8 used for testing. The results are shown in Table 7. Note that the method in [2] uses both depth and inertial sensor data. It can be seen that the proposed method performed much better than handcrafted feature based methods and Convnets based fusion methods. And we can see that deep learning based methods [8, 38] outperform hand-crafted feature based methods [2, 47]. The method [36] is not better than hand-crafted feature based methods due to losing the temporal information in WHDMMs. Our proposed TPSMMs perform better than encoding methods only using temporal information [8, 38] which cannot capture long temporal information due to trajectory overlapping.

Table 7 Comparison of the proposed method with previous approaches on UTD-MHAD Dataset

The confusion matrix is shown in Fig. 6. From the confusion matrix, we can see that the proposed method can not distinguish some actions well, such as “Throw”, and ”Draw circle CCW”. Since these TPSMMs images are visually similar probably caused by 3D to 2D projection. And the execution rate is not considered in our proposed method which leads to not recognizing action “jog” well. However, the proposed method can distinguish some similar actions such “knock” and “catch” where the JTM and SOS method cannot recognize well.

Fig. 6
figure 6

Confusion matrix for TPSMMs on UTD-MHAD Dataset

4.6 MSRC-12 Kinect Gesture Dataset

MSRC-12 [6] is a relatively large dataset for gesture or action recognition from 3D skeleton data captured by using a Kinect sensor. The dataset has 594 sequences, containing 12 gestures by 30 subjects, and 6244 gesture instances in total. The 12 gestures include “lift outstretched arms”, “duck”, “push right”, “goggles”, “wind it up”, “shoot”, “bow”, “throw”, “had enough”, “beat both”, “change weapon” and “kick”. For this dataset, cross-subjects protocol is adopted, that is odd subjects for training and even subjects for testing. Table 8 lists the performance of the proposed method.

Table 8 Comparison of the proposed method with previous approaches on MSRC-12 Dataset

The confusion matrix is shown in Fig. 7. From the confusion matrix, we can see that the proposed method can distinguish most actions very well. Compared with JTM, our proposed method outperforms JTM in spatial information mining. JTM is not effective to distinguish ”goggles” and ”had enough”. However, these actions were quite well distinguished in our method.

Fig. 7
figure 7

Confusion matrix for TPSMMs on MSR-12 Dataset

4.7 Computational cost

Table 9 shows total time and the time per sample of Cov3DJ [12], ELC-KSVD [47], SOS [8] and the proposed method on the UTD-MHAD dataset. 431 samples from the UTD-MHAD were used for training and 430 samples were used as testing. The computing platform is Ubuntu16.04 with Intel Core E5-2640v4 Processor and a NVIDIA TITAN X GPU card. Both Cov3DJ [12] and ELC-KSVD [47] were implemented in Matlab and executed on the CPU. The method SOS was implemented in matlab and Caffe, and executed on the CPU and GPU. The proposed method was implemented in Matlab and Keras, and executed on the CPU and GPU. The total time and time per sample are reported in Table 9 for training and testing respectively. The CPU time for the proposed method was mainly spent on the generation of TPSMM images.

Table 9 Compare the computational costs of different methods

5 Conclusion

This paper addresses the problem of skeleton-based human action recognition with ConvNets, and an effective texture image called Temporal Pyramid Skeleton Motion Maps (TPSMMs) is proposed to extract spatial-temporal information of skeleton sequence. The proposed method is tested on three datasets, including SYSU-3D dataset, UTD-MHAD dataset and MSRC-12 dataset. Experimental results show that the proposed method can effectively utilize the spatio-temporal information of skeleton data. However, it is still not easy to distinguish for some actions. In the future work, the appearance information and execution rate is considered to achieve better performance.