ConvNets-based action recognition from skeleton motion maps

Chen, Yanfang; Wang, Liwei; Li, Chuankun; Hou, Yonghong; Li, Wanqing

doi:10.1007/s11042-019-08261-1

ConvNets-based action recognition from skeleton motion maps

Published: 07 November 2019

Volume 79, pages 1707–1725, (2020)
Cite this article

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Multimedia Tools and Applications Aims and scope Submit manuscript

ConvNets-based action recognition from skeleton motion maps

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Yanfang Chen¹,
Liwei Wang¹,
Chuankun Li ORCID: orcid.org/0000-0001-9427-8780¹,
Yonghong Hou¹ &
…
Wanqing Li²

643 Accesses
20 Citations
Explore all metrics

Abstract

With the advance of deep learning, deep learning based action recognition is an important research topic in computer vision. The skeleton sequence is often encoded into an image to better use Convolutional Neural Networks (ConvNets) such as Joint Trajectory Maps (JTM). However, this encoding method cannot effectively capture long temporal information. In order to solve this problem, This paper presents an effective method to encode spatial-temporal information into color texture images from skeleton sequences, referred to as Temporal Pyramid Skeleton Motion Maps (TPSMMs), and Convolutional Neural Networks (ConvNets) are applied to capture the discriminative features from TPSMMs for human action recognition. The TPSMMs not only capture short temporal information, but also embed the long dynamic information over the period of an action. The proposed method has been verified and achieved the state-of-the-art results on the widely used UTD-MHAD, MSRC-12 Kinect Gesture and SYSU-3D datasets.

A Spatio-Temporal Convolutional Neural Network for Skeletal Action Recognition

Two-Stream Temporal Convolutional Networks for Skeleton-Based Human Action Recognition

Article 29 May 2020

A Novel Multi-feature Skeleton Representation for 3D Action Recognition

1 Introduction

Recently, recognition of human actions from RGB-D data has attracted increasing attention in computer vision with its extensive applications from surveillance to human-computer interaction. Compared with traditional RGB and depth data, skeleton data containing the positions of human joints, is a relatively high-level feature for motion recognition. It is robust to scale and illumination changes, and can be invariant to camera view, human body rotation and motion speed.

The skeleton sequences are commonly used to represent human motion in conventional research due to the intuitive property. To date, many methods based on hand-crafted features [7, 12, 24, 25, 27, 32,33,34, 42, 43] have been reported for recognising actions from skeleton data. Along with the first work of [17] reported in 2010, deep learning develops rapidly in recent years, and many methods based on Convolutional Neural Networks(ConvNets) [29, 35,36,37] or Recurrent Neural Networks (RNNs) [4, 19, 31, 49] have been proposed for skeleton-based human action recognition. However, how to effectively extract discriminative features from skeleton sequence and model the spatial and temporal evolution for robust recognition of a large scale and complex actions still remains an open problem.

For hand-designed skeleton features, the Bag-of-words model is used widely but often ignores the spatial information. Dynamic Time Warpings (DTWs), Fourier Temporal Pyramid (FTP) or Hidden Markov Models (HMMs) methods are commonly used in capturing the temporal information. However, these hand-crafted features are always shallow and dataset-dependent. Compared with hand-designed skeleton features, the features based on deep learning method including RNNs and ConvNets are more effective. In RNNs based methods [4, 19, 31, 49], RNNs tend to overemphasize the temporal information and it is difficult to construct deep RNNs to extract high-level features [28]. In methods based on ConvNets, the key and most commonly used approach is to convert a skeleton sequence into a texture image and encode the temporal information in the texture image. For example, Wang et al. [38] proposed to project skeleton onto three orthogonal Cartesian planes, and encode the joint trajectory into Joint Trajectory Maps (JTM) by using HSV (hue, saturation, value) space. However, the temporal information is lost due to trajectory overlapping, and some actions can not distinguish well such as action “knock” and “catch”.

In this paper, we proposed an effective hierarchical architecture to encode skeleton sequences into color texture images, referred to as Temporal Pyramid Skeleton Motion Maps(TPSMMs), and texture images as the input of ConvNets for recognition. The TPSMMs not only capture short temporal information, but also embed the long dynamic information over the period of an action. Meanwhile we proposed a new encoding method to encode the spatial-temporal information more effectively. For instance, Our TPSMMs can effectively discriminate action “knock” and “catch” as shown in Fig. 1c and d which can not distinguish well by using JTM. The proposed method has been verified and achieved the state-of-the-art results on the widely used UTD-MHAD [2], MSRC-12 Kinect Gesture [6] and SYSU-3D [9] datasets.

2 Related work

In this section, we introduce the related literature of skeleton-based action recognition methods using hand-crafted features, ConvNets-based approach and RNN-based approach.

Hand-crafted features

Hussein et al. [13] encoded joint movement and time information by calculating multiple covariance matrices over sub-sequences in a hierarchical temporal levels. Chaudhry et al. [1] encoded spatial and temporal information of the 3D skeleton data using Linear Dynamical Systems (LDSs), and used discriminative metric learning to perform action recognition. Vemulapalli et al. [32] used rotations and translations in 3D space to model the 3D geometric relationships between various skeleton joints. Ofli et al. [26] proposed Most Informative Joints(SMIJ) to automatically select a few skeletal joints with the most informative for the action.

ConvNets-based approach

Du et al. [5] represented a skeleton sequence as a matrix by concatenating the joint coordinates at each instant and arranging the vector representations in a chronological order. The matrix is then quantified into an image and normalized to handle the variable-length problem. Hou et al. [8] encoded the joint over the sequence into 3D points with color and drew the skeleton joints with a specific pen to three orthogonal canvases, then encodes the dynamic information in the skeleton sequences with color encoding. Wang et al. [38] proposed Joint Trajectory Maps (JTM), which represents both spatial configuration and dynamics of joint trajectories into three texture images through color encoding. However, both methods in [8, 38] are view dependent and easily loss temporal information due to the direct trajectory encoding. In order to overcome the drawbacks, Li et al. [18] calculated skeleton joints of single or multiple subjects of pair-wise distances, encoding into texture images, as the input of ConvNets for action recognition. Liu et al. [20] arranged the skeleton joints in a 2D array to generate a set of RGB images, and select the arrangement by maximizing a unique distance metric. All frames are then concatenated to construct a large image as the final representation. Ke et al. [14] presented the sequence as a clip with several gray images for each channel of the 3D coordinates, and used a deep ConvNets to learn high-level features, then the features of all the three clips at the same time-step are concatenated in a feature vector. A Multi-Task Learning Network is adopted for final classification. However,these methods have broken the spatial structure of human skeleton data. Liu et al. [23] proposed a sequence-based view invariant transform to address the problem of view variation, color images enhanced by visual and motion, and multi-stream ConvNets fusion method is adopted to conduct recognition. But this method uses ten CNN models and, hence, has a high complexity. Tang et al. [30] propose a deep progressive reinforcement learning (DPRL) method, distil the most informative frames, discard ambiguous frames in sequences, and employ the graph-based convolutional neural network to capture the dependency between the joints for action recognition. This paper keeps structure of human skeleton and provides a new color encoding method to capture spatio-temporal information more effectively.

RNN-based approach

The basic idea of RNN-based methods is to extract frame-based feature from a skeleton sequence and pass the feature sequence into an RNN model to map the sequence of features into a class of actions. These methods tend to overstress the temporal information and underestimate the spatial information. To overcome this shortcoming, different improvements have been proposed. The works in [4, 16, 19, 29, 45, 48] extended LSTM models to spatial domains. Specifically, Du et al. [4] divided skeleton joints into five parts according to the structure, as the input of five BRNNs. Along with the number of layers increases, the feature was stratified fusion and classification. Shahroudy et al. [29] proposed to model the long-term temporal correlation of the features for each body part. All body parts are concatenated by a shared output gate. Liu et al. [19] extended the traditional LSTM-based learning to spatial-temporal domains (ST-LSTM) and used a trusted gate to deal with unreliable input data. A skeleton tree traversal algorithm was used to discover stronger long-term spatial dependency patterns. [45] designed a view adaptive LSTM architecture that enables the network itself to automatically adjust to most suitable observation viewpoints. Lee et al. [16] proposed an ensemble Temporal Sliding LSTM (TS-LSTM) networks for skeleton-based action recognition. The proposed network is composed of multiple parts containing short-term, medium- term and long-term TS-LSTM networks, respectively. Zhu et al. [48] proposed an end-to-end fully connected deep LSTM network and added a mixed-norm regularization term to the cost function in order to learn the co-occurrence features of skeleton. Instead of extending LSTM models, Zhang et al. [46] provided a simple universal spatial modeling method and utilized eight geometric relational features to feed into a basic 3-layer LSTM model. However, concatenation of the eight types of geometric features performs worse compared to a single type of these features.

Unlike the above mentioned works, the proposed method TPSMMs can effectively distinguish similar actions by using the temporal and spatial information of the human skeleton sequence, and to further improve the performance by using decision fusion.

3 The proposed method

The proposed method is illustrated in Fig. 2. Given a skeleton sequence, six TPSMMs are created by projecting the 3D points to the three orthogonal, and using Temporal Pyramid to construct a hierarchical architecture. Six TPSMMs serve as inputs to six ConvNets for classification. Final classification of the given skeleton sequence is obtained through a late fusion of the six ConvNets. Many strategies have been developed to exploit the spatial-temporal information in skeleton sequences. Firstly, different shapes are used to represent skeleton joints and visualization of the skeleton in physicals. Secondly, skeletons are segmented and accumulated the absolute difference between consecutive frames on each joint over each segmentation. Thirdly, each TPSMMs goes through a pseudo-color coding process to enhance different motion patterns into the pseudo-RGB channels.

The TPSMMs processing is shown in Fig. 3, it consists of Skeleton Semi-Visualization processing, Temporal Pyramid (TP) processing, Segmented Skeleton Motion Maps (S-SMMs) processing, Pseudo-color Coding and Temporal Pooling. In the rest of this section, it will be described in detail.

3.1 Semi-visualization

Give a skeleton sequence S, we assume that it contains n frames as follows:

$$ S=\{F_{1}, F_{2},...,F_{n}\} $$

(1)

where $F_{i}=\{{p_{i}^{1}},{p_{i}^{2}},...{p_{i}^{m}}\}$ indicates the skeleton of the i-th frame; m denotes the total number of joints; ${p_{i}^{j}}=({x_{i}^{j}}, {y_{i}^{j}}, {z_{i}^{j}})$ represents the 3D coordinates of the j-th joint in F_i. The skeleton joints are projected to three orthogonal cartesian planes including (xy plane,xz plane and yz plane). For each plane, we map the joints of each frame into an image. For example, for the xy plane, the joint coordinates (x,y) of a skeleton sequence are seen respectively as the coordinates (x,y) of an image. And the z is seen as value in B channel. R channel and G channel are set to 0. Then B channel is normalized and expressed as follows:

$$ {Z_{i}^{j}}=\frac{Z_{max}-Z_{i}}{Z_{max}} \times 255 $$

(2)

where ${Z_{i}^{j}}$ represents z value of each joint; Z_max represents the maximum z value of a joint in a skeleton sequence.

In order to distinguish different body parts, the skeleton of human body is divided into five parts: left upper body part (left_shoulder, left_elbow, left_wrist and left_hand), left lower body part (left_hip, left_knee, left_ankle and left_foot), right upper body part (right_shoulder, right_elbow, right_wrist and right_hand), right lower body part (right_hip, right_knee, right_ankle and right_foot) and middle body part (head, shoulder_center, spine and hip_center). Each part is represented by using different shapes.

After projecting the skeleton sequence onto the three coordinate planes, we can get the skeleton distribution map D of three different view as follows:

$$ D_{v}=\left\{ma{p_{v}^{1}}, ma{p_{v}^{2}},...,ma{p_{v}^{k}},...,ma{p_{v}^{n}}\right\} $$

(3)

where v ∈{xy,xz,yz} denotes the three projected views in the three orthogonal cartesian planes, $ma{p_{v}^{k}}$ represents the k-th frame of skeleton data projected onto the view v.

3.2 Temporal pyramid processing

The paper [33] proposes that Fourier Temporal Pyramid (FTP) is a descriptive representation of the temporal structure of the actions, it explores temporal information of features through sequence partition and combines fourier features of segmentation finally. We apply the Temporal Pyramid to skeleton-based human action recognition.

A skeleton distribution map is split into w segmentations and most commonly, w = 1,2,4. Considering the computational cost and effectiveness, w = 1,4 are adopted for skeleton segmentation to construct a two layers architecture. The segmented skeleton distribution map is expressed as follows:

$$ Se{g_{v}^{k}}=\{map_{v}^{\frac{n}{w}*(k-1)+1}, map_{v}^{\frac{n}{w}*(k-1)+2},...,map_{v}^{\frac{n}{w}*k}\} \quad k \in\{1,2,...,w \} $$

(4)

where w is the number of segments of the skeleton distribution map. $Se{g_{v}^{k}}$ is the k-th of the segmented skeleton distribution map in the view v, n is the number of frames in a skeleton distribution map.

3.3 Segmented skeleton motion maps processing

The paper [43] proposed the method DMMs, which are computed by adding all the absolute differences between consecutive frames in depth map. Expanding to skeleton sequence, we map the segmented skeleton distribution map into segmented skeleton motion maps(S-SMMs). The motion energy is stacked in each skeleton distribution map segment as follows:

$$ S\textrm{-}SM{M_{v}^{k}} = \sum\limits_{j = 1}^{M} {\left| {map_{v}^{j + 1} - ma{p_{v}^{j}}} \right|} \quad ma{p_{v}^{j}}\in{Se{g_{v}^{k}}} $$

(5)

where $M=\frac {n}{w}$ is the number of frames in a segmented skeleton distribution map.

3.4 Pseudo-color coding

Color-coding can utilize the perceptual capabilities of the human visual system to extract more information from grayscale images, and effectively enhance the detailed texture patterns contained in the image. For better exploit the motion patterns, the temporal information and depth value of the joint are used to encode S-SMMs. Each S-SMMs has its fixed color in R and G chanels. And we map the depth value of the joint into B channel.

When a skeleton distribution map is divided into four segments(w = 4), we set the values of R channel and G channel of the four colormaps to (0,0.498), (0.498,1), (1,0.502) and (0.502,0), respectively. The B channel of the four colormaps changes are as follows:

$$ B= \left\{\begin{array}{ll} (131+4\times (I-1))\div 255& {0 \le \text{I} < 32}\\ 1 & {32 \le \text{I} < 96}\\ (255-4\times(I-96)+1)\div 255 & {96 \le \text{I} < 160}\\ 0 & {160 \le \text{I} \le 255} \end{array}\right. $$

(6)

$$ I=\frac{g-g_{min}}{g_{max}-g_{min}}\times 255 $$

(7)

where I is normalized gray value, g_max is maximum value of S-SMMs, g_min is the minimum value of S-SMMs.

The effect of the four colormaps as the value of I changes is shown in Fig. 4. It can be seen from the figure that the four colormaps are in four different color ranges, and it is no overlapping color between the colormaps.

In addition, Temporal Pooling is applied on S-SMMs to merge into Temporal Pyramid Skeleton Motion Maps (TPSMMs) finally, and it can be applied to the input of convolutional neural networks.

3.5 Network training

Six ConvNets are trained on the TPSMMs constructed from the skeleton sequences in the three cartesian planes for two layers of our architecture respectively. The layer configuration of the six ConvNets are used DenseNet 121 [11]. The deep DenseNet contains many dense blocks and transition layers as shown in Table 1. Each dense block consist of two kinds of “conv” and each “conv” layer corresponds the sequence BN-ReLU-Conv. The kernel size of the first and second “conv” are 128 and 32 respectively. And The transition layers used in our experiments consist of a batch normalization layer and an 1 × 1 convolutional layer followed by a 2 × 2 average pooling layer.

Table 1 DenseNet architectures for ImageNet

Full size table

The implementation is derived from the publicly available platform Keras toolbox [3] with tensorflow based on one NVIDIA GeForce GTX TITAN X card. The pre-trained models over ImageNet [15] are used for initialization in training. The network weights are learned using the mini-batch stochastic gradient descent with momentum being set to 0.9 and weight decay being set to 0.0005. All hidden weight layers use the rectification (RELU) activation function. At each iteration, the batch size is set to 32. All the images are resized to 224 × 224. The learning rate is set to 10^− 3 for fine-tuning the pre-trained models, and then it is decreased according to a fixed schedule, which is kept the same for all training sets. For each ConvNet, the training undergoes 40 epochs.

3.6 Score fusion for classification

Given a test skeleton sequence (sample), six TPSMMs are generated and fed into six different trained ConvNets. The score vectors of six neural networks are fused and the max score in the resultant vector is assigned as the probability of the test sequence being the recognized class. The common fusion method consists of max-score fusion, summation-score fusion and multiply-score fusion, that is, the score vectors outputted by the six neural networks are maximized, added and multiplied in an element-wise way respectively. Compared with the max-score fusion, summation-score fusion [40, 41] and multiply-score fusion perform better in our application and expressed as follows:

$$ label=Fin(\alpha (v_{11} + v_{12} + v_{13})+ \beta (v_{21} + v_{22} + v_{23})) $$

(8)

$$ label=Fin(v_{11} \circ v_{12} \circ v_{13} \circ v_{21} \circ v_{22} \circ v_{23}) $$

(9)

where v₁₁,v₁₂, v₁₃ are the score vector of three TPSMMs with w = 1 respectively. v₂₁,v₂₂, and v₂₃ are the score vector of three TPSMMs with w = 4 respectively. α and β are weights respectively, and α + β = 1. ∘ refers to element-wise multiplication, Fin() is a function to find the index of the element having the maximum score.

4 Experiments

The proposed method is evaluated on three widely used benchmark RGB-D datasets, namely, SYSU-3D [9], MSRC-12 [6] and UTD-MHAD [2] datasets. These three datasets cover a wide range of different types of actions including simple actions, daily activities, human-object interactions and fine-grained activities. Finally, the results on the three datasets and the detailed analysis are presented.

4.1 Comparison of TPSMMs in different views

The effectiveness of different views and different skeleton segments were evaluated on three datasets, and the recognition accuracies are listed in Table 2. From table, we can see that the TPSMMs with w = 4 performs better compared with TPSMMs with w = 1 due to encoding more colors into TPSMMs.

Table 2 Comparison of different views and different skeleton segments on three datasets in terms of recognition accuracy

Full size table

4.2 Comparison of three score fusion methods

Table 3 shows the different weights α and β on three datasets in terms of recognition accuracy. It can be seen from the Table 3 that different weight α and β make the recognition accuracy increase differently. For SYSU-3D dataset, it can get better result when α and β are 0.5 and 0.5 respectively. It performs better for UTD-MHAD dataset and MSRC-12 dataset when α and β are 0.3 and 0.7 respectively. And the comparison of three score fusion methods on the three datasets for final recognition are listed in Table 4. From the table, we can see that on the evaluated three datasets, the summation and multiply score fusion consistently outperformed max score fusion. This verifies that the six TPSMMs are likely to be statistically independent and provide complementary information.

Table 3 Comparison of different weights α and β on three datasets in terms of recognition accuracy

Full size table

Table 4 Comparison of three fusion methods on three datasets in terms of recognition accuracy

Full size table

4.3 Comparison of different color encoding methods

We conduct extensive experiments with different color encoding to verify the effectiveness of individual technical contributions proposed, as follows:

Overlapping: We just use overlapping technology instead of temporal pooling to combine coloring S-SMMs and generate TPSMM.
Jet colormap: We only utilize the jet colormap, ranging from blue to red, and passing through the colors cyan, yellow, and orange, and divide the jet map four parts instead of original four channels of color.
Temporal encoding: A skeleton sequence is split into w segmentations and obtain w S-SMMs. And we divide the jet map into w parts. we don’t use the depth information of skeleton. For each S-SMMs, the middle color of corresponding color map is selected. And temporal pooling is used to combine coloring S-SMMs.

We use the same scores of w = 1 and multiply-score fusion to compare different encoding methods. Table 5 shows the results of different encoding methods on UTD-MHAD Dataset. From the table, it can be seen that encoding method using temporal pooling can improve the performance. We use other color map in proposed encoding method and also get similar result. With the increasement of w, the result of temporal encoding based method is stable, but it is worse than our encoding method.

Table 5 Comparison of the different color encoding on UTD-MHAD Dataset

Full size table

4.4 SYSU-3D human-object interaction dataset

The SYSU-3D dataset [9] collected with the Microsoft Kinect contains 12 human-object interaction performed by 40 subjects. It has 480 skeleton sequences. And it is very challenging due to similar the motion patterns and a lot of viewpoint variations. We follow the standard protocol [9] to evaluate our method. The half of the subjects are used for training and the rest for testing. Table 6 shows the results of the proposed method and its comparison to the state-of-the-art results achieved by either hand-crafted feature based methods and deep learning based methods. Compared with most methods [9, 21, 30, 45], our method performs better. From the table, we can see that deep learning based methods [21, 30, 45] outperform hand-crafted feature based methods [9, 10]. The method using RNN and CNN [39] is perform better than these methods using RNN [21, 22, 45] which tend to overemphasize the temporal information. However, the accuracy will drop dramatically when using deep networks due to over-fitting in small dataset. The graph based method [30] is not better than RNN based some methods [22, 45]. Since the spatial graph is built over clusters, each of which is assigned a weight and thus may not capture delicate pairwise spatial relationship among joints.

Table 6 Comparison of the proposed method with existing methods on the SYSU-3D Dataset

Full size table

The confusion matrix is shown in Fig. 5. From the confusion matrix we can see that the proposed method can not distinguish some actions well such as “Drinking ”, “calling phone”, “mopping” and “sweeping” . For skeleton data, it does not have any appearance information. However, these actions are performed with similar motion and TPSMMs are not effective in these cases. And it cannot perform well for action “Packing backpacks ” due to view-point changes.

4.5 UTD-MHAD dataset

UTD-MHAD [2] contains 27 actions performed by 8 subjects (4 females and 4 males) with each subject performing each action 4 times. For this dataset, cross-subjects protocol is adopted as in [2], namely, the data from the subject numbers 1, 3, 5, 7 used for training and 2, 4, 6, 8 used for testing. The results are shown in Table 7. Note that the method in [2] uses both depth and inertial sensor data. It can be seen that the proposed method performed much better than handcrafted feature based methods and Convnets based fusion methods. And we can see that deep learning based methods [8, 38] outperform hand-crafted feature based methods [2, 47]. The method [36] is not better than hand-crafted feature based methods due to losing the temporal information in WHDMMs. Our proposed TPSMMs perform better than encoding methods only using temporal information [8, 38] which cannot capture long temporal information due to trajectory overlapping.

Table 7 Comparison of the proposed method with previous approaches on UTD-MHAD Dataset

Full size table

The confusion matrix is shown in Fig. 6. From the confusion matrix, we can see that the proposed method can not distinguish some actions well, such as “Throw”, and ”Draw circle CCW”. Since these TPSMMs images are visually similar probably caused by 3D to 2D projection. And the execution rate is not considered in our proposed method which leads to not recognizing action “jog” well. However, the proposed method can distinguish some similar actions such “knock” and “catch” where the JTM and SOS method cannot recognize well.

4.6 MSRC-12 Kinect Gesture Dataset

MSRC-12 [6] is a relatively large dataset for gesture or action recognition from 3D skeleton data captured by using a Kinect sensor. The dataset has 594 sequences, containing 12 gestures by 30 subjects, and 6244 gesture instances in total. The 12 gestures include “lift outstretched arms”, “duck”, “push right”, “goggles”, “wind it up”, “shoot”, “bow”, “throw”, “had enough”, “beat both”, “change weapon” and “kick”. For this dataset, cross-subjects protocol is adopted, that is odd subjects for training and even subjects for testing. Table 8 lists the performance of the proposed method.

Table 8 Comparison of the proposed method with previous approaches on MSRC-12 Dataset

Full size table

The confusion matrix is shown in Fig. 7. From the confusion matrix, we can see that the proposed method can distinguish most actions very well. Compared with JTM, our proposed method outperforms JTM in spatial information mining. JTM is not effective to distinguish ”goggles” and ”had enough”. However, these actions were quite well distinguished in our method.

4.7 Computational cost

Table 9 shows total time and the time per sample of Cov3DJ [12], ELC-KSVD [47], SOS [8] and the proposed method on the UTD-MHAD dataset. 431 samples from the UTD-MHAD were used for training and 430 samples were used as testing. The computing platform is Ubuntu16.04 with Intel Core E5-2640v4 Processor and a NVIDIA TITAN X GPU card. Both Cov3DJ [12] and ELC-KSVD [47] were implemented in Matlab and executed on the CPU. The method SOS was implemented in matlab and Caffe, and executed on the CPU and GPU. The proposed method was implemented in Matlab and Keras, and executed on the CPU and GPU. The total time and time per sample are reported in Table 9 for training and testing respectively. The CPU time for the proposed method was mainly spent on the generation of TPSMM images.

Table 9 Compare the computational costs of different methods

Full size table

5 Conclusion

This paper addresses the problem of skeleton-based human action recognition with ConvNets, and an effective texture image called Temporal Pyramid Skeleton Motion Maps (TPSMMs) is proposed to extract spatial-temporal information of skeleton sequence. The proposed method is tested on three datasets, including SYSU-3D dataset, UTD-MHAD dataset and MSRC-12 dataset. Experimental results show that the proposed method can effectively utilize the spatio-temporal information of skeleton data. However, it is still not easy to distinguish for some actions. In the future work, the appearance information and execution rate is considered to achieve better performance.

References

Chaudhry R, Ofli F, Kurillo G, Bajcsy R, Vidal R (2013) Bio-inspired dynamic 3d discriminative skeletal features for human action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 471–478
Chen C, Jafari R, Kehtarnavaz N (2015) Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: 2015 IEEE International Conference on Image Processing (ICIP), pp 168–172
Chollet F (2015) Keras. https://github.com/fchollet/keras
Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: CVPR, pp 1110–1118
Du Y, Fu Y, Wang L (2016) Skeleton based action recognition with convolutional neural network. In: Proc. Asian Conference on Pattern Recognition(IAPR), pp 579–583
Fothergill S, Mentis HM, Nowozin S, Kohli P (2012) Instructing people for training gestural interactive systems. In: ACM Conference on Computer-Human Interaction (ACM HCI), pp 1737–1746
Gowayyed MA, Torki M, Hussein ME, El-Saban M (2013) Histogram of oriented displacements (HOD) Describing trajectories of human joints for action recognition. In: IJCAI, pp 1351–1357
Hou Y, Li Z, Wang P, Li W (2016) Skeleton optical spectra-based action recognition using convolutional neural networks. IEEE Trans Circ Syst Video Technol 28(3):807–811
Article Google Scholar
Hu J-F, Zheng W-S, Lai J, Zhang J (2015) Jointly learning heterogeneous features for RGB-D activity recognition. In CVPR, pages 5344–5352
Hu J-F, Zheng W-S, Ma L, Wang G, Lai J (2016) Real-time rgb-d activity prediction by soft regression. In: ECCV, pp 280–296
Huang G, Liu Z, van der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4700–4708
Hussein ME, Torki M, Gowayyed MA, El-Saban M (2013) Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations. In: IJCAI, pp 2466–2472
Hussein ME, Torki M, Gowayyed MA, El-Saban M (2013) Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations. In: International Joint Conference on Artificial Intelligence, pp 639–44
Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2017) A new representation of skeleton sequences for 3d action recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3288–3297
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Proc. Annual Conference on Neural Information Processing Systems (NIPS), pp 1106–1114
Lee I, Kim D, Kang S, Lee S (2017) Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1012–1020
Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3D points. In: CVPRW, pp 9–14
Li C, Hou Y, Wang P, Li W (2017) Joint distance maps based action recognition with convolutional neural networks. IEEE Signal Processing Letters, pp 624–628
Article Google Scholar
Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Proceedings of European Conference on Computer Vision, pp 816–833
Chapter Google Scholar
Liu J, Akhtar N, Mian A (2017) Skepxels: Spatio-temporal image representation of human skeleton joints for action recognition. arXiv:1711.05941
Liu J, Shahroudy A, Xu D, Kot AC, Wang G (2017) Skeleton-based action recognition using spatio-temporal lstm network with trust gates. IEEE Trans Pattern Anal Mach Intell 40(12):3007–3021
Article Google Scholar
Liu J, Wang G, Hu P, Duan LY, Kot AC (2017) Global context-aware attention lstm networks for 3d action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 3671–3680
Liu M, Liu H, Chen C (2017) Enhanced skeleton visualization for view invariant human action recognition. Pattern Recogn, pp 346–362
Lu X, Chen C-C, Aggarwal JK (2012) View invariant human action recognition using histograms of 3D joints. In: CVPRW, pp 20–27
Lu C, Jia J, Tang C-K (2014) Range-sample depth feature for action recognition. In: CVPR, pp 772–779
Ofli F, Chaudhry R, Kurillo G, Vidal R, Bajcsy R (2012) Sequence of the most informative joints (smij) A new representation for human skeletal action recognition. In: Computer Vision and Pattern Recognition Workshops, pp 24–38
Oreifej O, Liu Z (2013) HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences. In: CVPR, pp 716–723
Sainath TN, Vinyals O, Senior A, Sak H (2015) Convolutional, Long short-term memory, fully connected deep neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp 4580–4584
Shahroudy A, Liu J, Ng T-T, Wang G (2016) NTU RGB+ D: A large scale dataset for 3D human activity analysis. In CVPR, pages 1010–1019
Tang Y, Yi T, Lu J, Li P, Zhou J (2018) Deep progressive reinforcement learning for skeleton-based action recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 5323–5332
Veeriah V, Zhuang N, Qi G-J (2015) Differential recurrent neural networks for action recognition. In: ICCV, pp 4041–4049
Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. In: CVPR, pp 588–595
Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: CVPR, pp 1290–1297
Wang P, Li W, Ogunbona P, Gao Z, Zhang H (2014) Mining mid-level features for action recognition based on effective skeleton representation. In: DICTA, pp 1–8
Wang P, Li W, Gao Z, Tang C, Zhang J, Ogunbona PO (2015) Convnets-based action recognition from depth maps through virtual cameras and pseudocoloring. In: ACM MM, pp 1119–1122
Wang P, Li W, Gao Z, Zhang J, Tang C, Ogunbona P (2016) Action recognition from depth maps using deep convolutional neural networks. IEEE Trans Hum-Mach Syst 46(4):498–509
Article Google Scholar
Wang P, Li Z, Hou Y, Li W (2016) Action recognition based on joint trajectory maps using convolutional neural networks. In: ACM MM, pp 102–106
Wang P, Li Z, Hou Y, Li W (2016) Action recognition based on joint trajectory maps using convolutional neural networks. Proceedings of the 24th ACM international conference on Multimedia. ACM, pp 102–106
Xie C, Li C, Zhang B, Chen C, Han J, Zou C, Liu J (2018) Memory attention networks for skeleton-based action recognition. arXiv:1804.08254
Xu Y, Qi Z, Zhang D (2011) Combine crossing matching scores with conventional matching scores for bimodal biometrics and face and palmprint recognition experiments. Neurocomputing 74(18):3946–3952
Article Google Scholar
Xu Y, Zhu X, Li Z, Liu G, Lu Y, Liu H (2013) Using the original and symmetrical facetraining samples to perform representation based two-step face recognition. Pattern Recogn 46(4):1151–1158
Article Google Scholar
Yang X, Tian Y (2014) Super normal vector for activity recognition using depth sequences. In: CVPR, pp 804–811
Yang X, Zhang C, Tian Y (2012) Recognizing actions using depth motion maps-based histograms of oriented gradients. In: ACM MM, pp 1057–1060
Yang S, Yuan C, Hu W, Ding X (2014) A hierarchical model based on latent dirichlet allocation for action recognition. In: 2014 22nd International Conference on Pattern Recognition (ICPR). IEEE, pp 2613–2618
Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N (2017) View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: IEEE International Conference on Computer Vision, pp 2136–2145
Zhang S, Liu X, Xiao J (2017) On geometric features for skeleton-based action recognition using multilayer lstm networks. In: Applications of Computer Vision, pp 148–157
Zhou L, Li W, Zhang Y, Ogunbona P, Nguyen DT, Zhang H (2014) Discriminative key pose extraction using extended lc-ksvd for action recognition. In: DICTA. IEEE
Zhu W, Lan C, Xing J, Zeng W, Li Y, Li S, Xie X (2016) Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM, networks In Proc. AAAI Conference on Artificial Intelligence, pp 3697–3704
Zhu W, Lan C, Xing J, Zeng W, Li Y, Shen L, Xie X (2016) Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: AAAI

Download references

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant No.61571325 and in part by the Key Projects in the Tianjin Science and Technology Pillar Program under Grant No.16ZXHLGX00190.

Author information

Authors and Affiliations

School of Electronic and Information Engineering, Tianjin University, Tianjin, People’s Republic of China
Yanfang Chen, Liwei Wang, Chuankun Li & Yonghong Hou
Advanced Multimedia Research Lab, University of Wollongong, Wollongong, Australia
Wanqing Li

Authors

Yanfang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Liwei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chuankun Li
View author publications
You can also search for this author in PubMed Google Scholar
Yonghong Hou
View author publications
You can also search for this author in PubMed Google Scholar
Wanqing Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chuankun Li.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, Y., Wang, L., Li, C. et al. ConvNets-based action recognition from skeleton motion maps. Multimed Tools Appl 79, 1707–1725 (2020). https://doi.org/10.1007/s11042-019-08261-1

Download citation

Received: 16 December 2018
Revised: 14 July 2019
Accepted: 16 September 2019
Published: 07 November 2019
Issue Date: January 2020
DOI: https://doi.org/10.1007/s11042-019-08261-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

ConvNets-based action recognition from skeleton motion maps

Abstract

Similar content being viewed by others

A Spatio-Temporal Convolutional Neural Network for Skeletal Action Recognition

Two-Stream Temporal Convolutional Networks for Skeleton-Based Human Action Recognition

A Novel Multi-feature Skeleton Representation for 3D Action Recognition

1 Introduction