Keywords

1 Introduction

Extracting robust visual feature is an essential step towards achieving high performance in real-world visual recognition tasks. Recently, large-scale feature learning has achieved great success in various visual recognition tasks, in which powerful machine learning algorithms are integrated closely with high performance parallel computing platforms, clusters or GPU, so as to support data intensive computing from millions of training images. In this paper, we propose a novel large-scale feature learning architecture based on Slow Feature Analysis (SFA) [15] and Apache Spark [2], where the slowness learning principle is implemented to learn invariant visual features from millions of local image patches.

SFA is an unsupervised feature learning method, which intends to learn the invariant and slowly varying features from input signals. It has been successfully used for action recognition [12, 21], trajectory clustering [20], handwriting digital recognition [7] and dynamic scenes classification [14]. Recently, Escalante-B and Wiskott [16] extend the SFA to supervised dimensionality reduction and the notion of slowness is generalized from temporal sequences to training graphs (a.k.a GSFA). Different with previous work, e.g., Sparse coding [9] and Autoencoder [6], most of which learn features to minimize the reconstruction error, we utilize the GSFA to explore the neighbor relationships in large-scale local patches from a view of manifold learning, where the learnt features will encode the intrinsic local geometric structures existing in a large number of training local patches.

Feature learning by the GSFA is to construct a feature space, in which the nearby points in original space will have similar locations. Thinking about a gray image with 16\(\times \)16 pixels, and the value of each pixel ranges from 0 to 255. It means that there may exits \(256^{256}\) possibilities [19] in the input space. Thus, the feature space learnt with limited data may have two issues: (1) Losing mush useful information makes visual samples belong to different patterns locate nearly in the feature space, i.e., inaccuracy. (2) Incomplete distribution of the training samples makes the learnt features are not effective in practical application, i.e., overfitting. On the other hand, it is the most important step of GSFA to construct the local neighbor relation graph in input space. Such graph is usually constructed through finding the top k nearest neighbors (k-NN). The hypothesis, near patches have similar patterns, is established only with large-scale data. While the inaccurate k-NN results will result in inaccurate slow feature functions.

As mentioned above, it is necessary to learn robust features with large-scale data. Meanwhile, relation graph construction in the input space is the key step of GSFA. However, it is a \(O\left( n^{2}\right) \) problem [13] to construct it with k-NN searching. So it is a computational challenge with such large-scale data (about 10 million in this paper). The available of high performance hardware and parallel computing platform make it possible to alleviate this problem. Now, Hadoop [1] and Spark [2] attract much attention in big data tasks. The advantages of Spark, i.e., in-memory computation, sufficient APIs and good compatibility, make it faster and more convenient than MapReduce in Hadoop. Recently, some projects on deep learning with Spark are released, such as CaffeOnSpark [17] and SparkNet [10].

Motivated by above reasons, in this paper we construct a parallel architecture for large-scale slow feature learning using Spark. In summary, the efforts of this paper includes:

  1. (1)

    We present a new large-scale feature learning architecture based on SFA and Spark, which is used to train invariant features with more than 10 million of local patches. The task can be completed efficiently within 140 h.

  2. (2)

    We study the effect of the quantity of training patches for the visual recognition task on pedestrian recognition based on this architecture. The experimental results demonstrate that the performance is indeed improved significantly with the growth of training patches consistently.

  3. (3)

    The learnt slow features can achieve superior performance than that of classical hand-crafted feature, i.e., Histogram of Oriented Gradients (HOG) [3] feature, which validate the effectiveness of large-scale slow feature learning.

2 Method

In this section, we first give a brief introduction on SFA, mainly about its essential problem and mathematical representation. Then, we explain the details about the parallel computing architecture for large-scale SFA with Spark. Finally, we will discuss how to use the learnt slow feature functions for an instance of visual recognition tasks, i.e., pedestrian recognition.

2.1 Slow Feature Analysis

Thinking about an object move through one’s visual field, the signals fall on his retina change rapidly. However, the relevant abstract information, e.g. identity, changes slowly in a timescale [16]. It is known as slowness principle which first formulated by Hinton in 1989 [5]. Based on the slowness principle, SFA is to learn a group of slow feature functions to transform the input signal to a new feature space, in which the transformed signal varies as slow as possible. It is first proposed by Wiskott [15]. GSFA [16] is the extension of SFA, which generalizes the slowness principle from sequences to a training graph. Because the lack of temporal information in the training local patches, we adopt the GSFA to learn slow features in this paper. Mathematically, GSFA is formulated as follows:

Given a training graph \(G=(V,E)\) with a set of nodes \(V=\lbrace \varvec{x}(1),..,\varvec{x}(N)\rbrace \) and a set of edges \(E:=(\varvec{x}(n), \varvec{x}(n^\prime ))\), where \(\langle n,n^\prime \rangle \) denotes a pair of samples with \(1\le {n,n^\prime }\le N\). GSFA aims at learning a group of input-output functions \(\lbrace g_j \rbrace ,1\le j\le J\) such that the J-dimensional output signals \(y_j(n)=g_j(\varvec{x}(n))\) have the minimum difference with their nearest neighbors,

$$\begin{aligned} minimize~\varDelta _j=\frac{1}{R}\sum _{n,n^\prime }\gamma _{n,n^\prime }\left( y_j\left( n^\prime \right) -y_j\left( n\right) \right) ^2 \end{aligned}$$
(1)

s.t.

$$\begin{aligned} \frac{1}{Q}\sum _{n}\nu _ny_j(n)=0~~weighted~zero~mean~, \end{aligned}$$
(2)
$$\begin{aligned} \frac{1}{Q}\sum _{n}\nu _n\left( y_j\left( n\right) \right) ^2=1~~weighted~unit~variance~, \end{aligned}$$
(3)
$$\begin{aligned} \frac{1}{Q}\sum _{n}\nu _ny_j(n)y_{j^\prime }(n)=0~~weighted~decorrelation~, \end{aligned}$$
(4)

with

$$\begin{aligned} R=\sum _{n,n^\prime }\gamma _{n,n^\prime }~and~Q=\sum _n\nu _n~, \end{aligned}$$
(5)

where \(\nu \) and \(\gamma \) denote the weight of node and the weight of edge respectively. And \(\gamma _{n,n^\prime }=\gamma _{n^\prime ,n}\) means that the edges are undirected. Constraint (2) is used to make constraint (3) and (4) with concise form. Constraint (3) avoids the trivial solution. And constraint (4) ensures that different functions \(g_j\) take different properties of the input signal. Meanwhile constraint (4) enforces the functions \(g_j\) are ordered based on their slowness, in which the first is the slowest one.

For the linear GSFA, \(g_j(\varvec{x})=\varvec{w}_j^T\varvec{x}\). This optimization problem equals to solve a generalized eigenvalue problem,

$$\begin{aligned} AW=BW\varLambda , \end{aligned}$$
(6)

with

$$\begin{aligned} A=\frac{1}{R}\sum _{n,n^\prime }\gamma _{n,n^\prime }(\varvec{x}(n^\prime )-\varvec{x}(n))(\varvec{x}(n^\prime )-\varvec{x}(n))^T \end{aligned}$$
(7)

and

$$\begin{aligned} B=\frac{1}{Q}\sum _n\nu _n(\varvec{x}(n)-\widehat{\varvec{x}})(\varvec{x}(n)-\widehat{\varvec{x}})^T \end{aligned}$$
(8)

where \(\widehat{\varvec{x}}={(1/Q)}\sum _n\nu _n\varvec{x}(n)\) is the weighted mean of all the samples. For the nonlinear GSFA, the inputs can be firstly mapped to a nonlinear expansion space and then solve the problem with the steps in the linear case. The nonlinear expansion function is defined by

$$\begin{aligned} \varvec{h}(\varvec{x}):=[h_1(\varvec{x}),...,h_M(\varvec{x})]~. \end{aligned}$$
(9)

So the slow feature functions of GSFA can be calculated with three steps:

  1. (1)

    Construct the undirected graph \(G=(V,E)\); and confirm the weights of nodes and edges. In this paper, \(\nu _n=1\) and \(\gamma _{n,n^\prime }=1/N_{edges}\).

  2. (2)

    Map the inputs to a nonlinear space with a nonlinear function and centralize them. In this paper, we use the quadratic expansion. For a I-dimensional input signal, \(\varvec{h}(\varvec{x})=[x_1,...,x_I,x_1 x_1,x_1 x_2,...,x_1 x_I]\).

  3. (3)

    Solve the generalized eigenvalue problem \(AW=BW\varLambda \). The eigenvectors corresponding to the first J smallest eigenvalues are selected as slow feature functions.

2.2 Large-Scale SFA in Spark

The proposed parallel computing architecture for large-scale SFA runs on a cluster consists of 10 machines with 96 CPU cores, as shown in Fig. 1, in which one for master node and the other nines for worker nodes. And the total memory is 510 GB. Since the advantages of Spark in big data processing, we choose it as our computational platform to implement SFA with large-scale data. For a Spark platform, the core concept is Resilient Distributed Datasets (RDD) [18], which is a distributed memory abstraction for in-memory computations in cluster. Furthermore, the application running on Spark consists of a driver program and many executors, in which driver program runs the main function and executors run different tasks. The driver program launches different tasks to workers with a variance of RDD transformations and actions (shown in the brace of Fig. 1). The launched tasks will be implemented by the executors in each worker, and the results will return to driver program or produce new RDDs. Our large-scale SFA parallel architecture based on above principle as well.

Fig. 1.
figure 1

An overview of the computing platform with Spark.

Fig. 2.
figure 2

Diagram of large-scale SFA in Spark for pedestrian classification. The upper part is the architecture for large-scale SFA in Spark. The lower part shows how to extract feature using slow feature functions in pedestrian classification. Two methods (with max pooling or not) are tested in experiment. (Color figure online)

An overview of the architecture for large-scale SFA is shown in Fig. 2. We may summarize the large-scale SFA with Spark as five tasks:

  1. (1)

    Data loading and patches extraction. Load the images data from hard disks to Spark RDD. Then extract patches from each image by dense sampling in parallel. The extracted patches will be stored as new Spark RDDs.

  2. (2)

    Pre-processing. Implement PCA whitening on extracted patches to remove the redundant information and make the patches with zero mean and unit variance. It will also bring convenience for the following operations. Its result (a transformation matrix) will act on the input of nonlinear expansion.

  3. (3)

    Graph construction. It is the core step in GSFA. We construct the graph with k-NN searching in parallel, in which the broadcast variables are used to improve the efficiency and the patches are divided into multiple RDDs (see Fig. 2) to satisfy the limitations of memory resources. FLANNFootnote 1 is used for k-NN searching locally (more details in Algorithm 1). The k-NN results and their source patches are used to compute the differences matrix after nonlinear expansion.

  4. (4)

    Nonlinear expansion (Eq. (9)). Implement a quadratic expansion to the pre-processed patches data in parallel.

  5. (5)

    Solve the generalized eigenvalue problem. Firstly, compute the covariance matrices (Eqs. (7) and (8)) in parallel. Then collect the matrices from Spark RDD to the memory of master node to calculate the eigenvalues and eigenvectors.

figure a

2.3 Pedestrian Recognition Using Slow Feature Functions

Pedestrian recognition is one of the essential tasks in visual object recognition. Instead of detecting pedestrian over whole images where the performance depends not only on the feature itself, but also on the post-processing step, e.g., Non-Maximum Suppression (NMS) [11], we only perform binary classifications over a set of candidate windows extracted from whole images, for a straight evaluation of the effectiveness of the slow features.

As shown in the lower part of Fig. 2, the learnt function can be seen as a filter (red block). For a candidate window (input), it is filtered using all the learnt functions with a specific stride that will generate a number of feature maps where each feature map corresponds to one slow feature function. Then, a nonlinear normalization function, e.g., the Sigmoid function used in this paper, is performed to clip the responses with large values. After that, we test two methods to generate the final feature vector: (1) concatenate the normalized feature maps directly; (2) execute max-pooling operation over a local region on the feature maps before the concatenation. To predict whether there is a person in the input window or not, a linear Support Vector Machine (SVM) is trained as the classifier.

3 Experimental Results

In this paper, we select the INRIA Person dataset [3] as the training and test dataset. It totally includes 2416 positive training windows and 1126 positive test windows, where each window corresponding to one person annotated in the dataset. To generate negative windows, we randomly extract ten windows per training negative image (12180 windows) and five windows per test negative image (2265 windows). The size of each window is 96\(\times \)160 pixels.

We perform the slow feature learning with two strategies, i.e., unsupervised SFA and supervised SFA. For the supervised SFA, three kinds of methods are implemented as shown in the lower part of Table 1. The key parameters in training SFA are chosen as follows: patch size is 16\(\times \)16 pixels, k in k-NN searching is 3 and dense sampling stride is 4. When generate final features, we set filtering stride to 8 and window size to 64\(\times \)128 pixels. For different numbers of selected functions (as shown in Table 1), the dimensions of final features are 2100, 2100, 13000 and 3150, respectively. Figure 3 shows the results of the visualization of the first 25 slow feature functions learnt by supervised and unsupervised learning strategies.

Fig. 3.
figure 3

Visualization of learnt slow feature functions.(a) The slow feature functions learnt by negative samples. (b) The slow feature functions learnt by positive samples. (c) The slow feature functions learnt by unsupervised SFA.

We study the relationship between the populations of training patches and the classification performance, under the case of unsupervised SFA. From Fig. 4, it clearly shows that the classification performance can be improved (precision is increased and miss rate is decreased) with the growth of training patches, which indicates that the large-scale training samples enhance the ability of the learnt slow features for pedestrian recognition. Meanwhile, compare the elapsed time of SFA learning between single-core machine and our proposed SFA+SPARK architecture, the speed-up ratio ranges from 3.56 to 14.67 when the number of training patches increases from 6k to 120k. It indicates that the efficiency has been improved significantly, especially when the data size is very large. And it can handle more than 10 million local patches within 140 h efficiently.

Fig. 4.
figure 4

Effect of the number of training samples on classification performance. The variations of precision and miss rate is shown on the left side, while the variation of the number of false positives located on the right.

Table 1. Comparison on classification performance with other features.

We compare both unsupervised and supervised SFA with the HOG feature and CNN feature. In experiment, we use the scikit-imageFootnote 2 library to extract HOG feature, in which the cells per block is \(2\times 2\), pixels per cell is \(8\times 8\), number of orientations is 9 and the stride is 8. So the dimension of HOG feature is 3780. While the CNN feature is trained on the INRIA dataset using CaffeFootnote 3. We select the output of fc7 as the final features which dimension is 4096.

The classification results of different methods are list in Table 1. The performance of our method trained with both the positive and negative samples under supervised learning strategy is superior to the performance of the HOG feature. However, it is worse than the CNN feature. It may because the CNN extracts features with multiple layers that can capture much high-level information. The hierarchical SFA [4] learning architecture may be introduced into our future work to improve the performance. From the table, we can also conclude that, the performance of slow functions trained with both the positive and negative samples is superior to that of the slow features trained only using the positive samples. Furthermore, along with the increasing of the number of slow feature functions, the max pooling may achieve more superior performance, compared to the feature representation obtained by directly concatenation. In summary, compared with the HOG feature and CNN feature, the comparable performance validates the effectiveness of our large-scale SFA+SPARK learning architecture.

4 Conclusion

In this paper, we present a large-scale feature learning architecture based on SFA and Apache Spark. It is used to learn invariant visual features from more than 10 million of local patches under supervised and unsupervised learning strategies. Then, we extract features using the learnt slow feature functions to do pedestrian classification on the INRIA person dataset. The experiment results indicate that the efficiency of the SFA learning using large-scale data can be improved greatly with our proposed parallel computing architecture. It also demonstrates that the classification performance can be improved along with the increasing of the number of training patches. Furthermore, the performance of learned slow feature is comparable to the classic HOG feature and CNN feature in pedestrian classification.