Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In this paper, we consider the regularized risk minimization problems using L1 regularization, which is formulated as minimization of the following convex function [12]:

$$\begin{aligned} P(\mathbf {w}) \triangleq \left\| \mathbf {w}\right\| _{1} + C\sum _{i=1}^{n} \ell (\left\langle \mathbf {w},\phi ({\mathbf {x}}_{i}) \right\rangle ,y_{i}). \end{aligned}$$
(1)

Here, \({{\mathbf {x}}}_i \in \mathcal {X}\) and \(y_i\in \mathcal {Y}\) represent the input and output of a datapoint, \(\mathbf {w}\in \mathbb {R}^p\) is a vector of parameters, \(\ell :\mathbb {R}\times \mathcal {Y}\rightarrow \mathbb {R}\) defines a convex function for any given \(y\in \mathcal {Y}\), and \(\phi :\mathcal {X}\rightarrow \mathbb {R}^p\) explicitly denotes a certain feature function. This formulation includes many important problems in machine learning. For example, \(\ell (\left\langle \mathbf {w},\phi ({\mathbf {x}}_i) \right\rangle ,y_i)= \left( \left\langle \mathbf {w},\phi ({\mathbf {x}}_i) \right\rangle -y_i\right) ^2\) corresponds to an equivalent problem in the LASSO introduced in [16]. The setting of \(\ell (\left\langle \mathbf {w},\phi ({\mathbf {x}}_i) \right\rangle ,y_i)=\log \left( 1 + \exp \left( -y_i\left\langle \mathbf {w},\phi ({\mathbf {x}}_i) \right\rangle \right) \right) \) corresponds to L1 logistic regression in binary classification and \(\ell (\mathbf {w},{\mathbf {x}}_i,y_i) = -\left\langle \mathbf {w},\phi ({\mathbf {x}}_i) \otimes y_i \right\rangle + \log \left( \sum _y\exp \left( \left\langle \mathbf {w},\phi ({\mathbf {x}}_i) \otimes y \right\rangle \right) \right) \) corresponds to multiclass classification.

With the availability of a larger number of datapoints because of recent developments in information technology, it has become important to consider larger number of features to avoid underfitting, fully utilize data, and enhance the performance of a predictor. Recent intensive research has revealed that complex multilayer nonlinear models perform better than simple linear models without overfitting despite a much larger hypothesis space [7]. This suggests that preparing a large class of feature functions and adaptively choosing informative feature functions can enhance the performance of the predictor for linear models as well.

Conventionally, when we solve (1) in practice, the process of computing \(\varPhi \triangleq \{\phi _j({\mathbf {x}}_i)\}_{1\le i \le n, 1\le j\le p}\), and finding a solution for the formulation are separated. We develop the features \(\phi _j({\mathbf {x}}_i)\) for all i and j before optimization (typically into the lower level of memory such as hard disk), and run an optimization algorithm to find the solution. However, when the number of features is extremely large, their extraction leads to huge memory costs, associated with storing all the values of \(\varPhi \), and substantial computational time costs. Especially, when \(\varPhi \) cannot fit into the main memory, running an optimization algorithm with frequent accesses to lower levels of memory is impractical. Moreover, finding the solution requires considerable computational time because of the increasing size of the optimization problem. Furthermore, designing feature spaces containing an exponentially large number of features is straightforward in most cases, as discussed in Sect. 3.1. Therefore, the entire process (or scheme), including the optimization algorithm and methods of developing the feature \(\varPhi \), needs to be efficient.

The block minimization scheme presented by Yu et al. was the first scheme to propose solving of the regularized risk minimization problem when the data does not fit in the memory [20]. In this scheme, it was proposed that the data would be split into several blocks, each containing a relatively small number of datapoints to fit each block into a higher level of memory. By considering both primal and dual variables, they obtained a subproblem in which only datapoints in a single block can be accessed at a time. They showed that the global solution can be achieved by solving subproblems successively and repeatedly. Matsushima et al. proposed the dual cached loop method, which uses multithreading to run a reading thread that accesses a disk to read each datapoint and a training thread that updates parameters simultaneously [8]. The aforementioned schemes are all focused on L2 regularized risk minimization problems, in which it is preferable to solve the dual problem rather than the primal problem. For the L1 regularized problem, which has rarely been focused upon, it is preferable to solve the primal problem than the dual problem.

From an algorithmic perspective, a key insight that has been used to scale up this risk minimization problem is that there are several optimization methods that do not require all the information within \(\varPhi \) at once. Stochastic gradient methods [14] are widely used in large-scale optimization problems because they require only one instance at a time. Similarly, coordinate descent methods [9] can update parts of parameters, j-th component of \(\mathbf {w}\) in this case, with only information from the j-th column of \(\varPhi \), as explained in Sect. 2. Moreover, the L1 regularization problem (1) implies that most features are redundant and contribute no information to the estimated predictor. In particular, the following statement is said to hold true [14].

Property 1

Let \(w^*_j\) be the j-th component of the solution of (1) based on \((\varPhi , \mathbf {y})\). Further, let \(J^*\) represent columns \(\{j|w_j^* \ne 0\}\) and \(\varPhi _{J^*}\) be a matrix containing only the j-th column such that \(j\in J^*\). Then, \(\hat{w}^*_j\), the component of the solution of (1) based on \((\varPhi _{J^*}, \mathbf {y})\) corresponding to j, coincides with \(w^*_j\).

This implies that the intrinsic size of the optimization problem (1) is much smaller than it appears; the optimization problem can be reduced, given the information in which components of the parameter vector will be annihilated. This suggests that computational efforts to not only optimize by considering such parts of the parameter but also to develop such features and load them into memory are inefficient.

The dual cached loop scheme, similar to the block minimization scheme, can be easily integrated into the stochastic gradient descent (SGD) method, although this is not explicitly indicated in prior research [8]. Therefore, the L1 regularized problem when the dataset cannot fit into memory can be solved using the SGD method on the basis of the aforementioned schemes. However, as SGD requires access to one row of the data matrix, we cannot exploit the fact that the intrinsic size of the optimization problem is much smaller than it initially appears.

In this study, we develop a scheme to efficiently compute the solution of (1) with a large value of \(\varPhi \) that cannot fit in the main memory by utilizing structures of coordinate descent algorithms and L1 regularization problems. We call our scheme feature cached loops (FCL). In this scheme, two threads run asynchronously and simultaneously: one for extracting features and another for updating the solutions to the optimization problem. We thus aim to efficiently extract effective features for the temporal values of parameters. As discussed in Sect. 3.2, this algorithm can be said to operate on a principle similar to that used in boosting algorithms [13].

The remainder of this paper is organized as follows. Section 2 reviews the coordinate descent method, which is the most important building block of our scheme. In Sect. 3, we explain our scheme in detail and discuss similarities with the boosting algorithm. Section 4 presents the experimental evaluation of our method. Finally, Sect. 5 concludes this paper.

2 Coordinate Descent Method

Coordinate descent methods are well-known optimization methods used for minimizing convex functions. Recently, several studies have highlighted the methods computational efficiency, fast theoretical convergence rate, and suitability to large scale learning [1, 9]. Coordinate descent methods aim to find the solution by selecting one component of the parameters, \(w_j\) in case of (1), and then solving the one-variable optimization problem to update \(w_j\). This procedure is repeated for each choice of a parameter component. The coordinate choice can be random, cyclic, or sampled from an arbitrary distribution. In the LASSO formulation, that is, minimizing (1) when \(\ell (\left\langle \mathbf {w},\phi ({\mathbf {x}}_i) \right\rangle ,y_i)=\left( \left\langle \mathbf {w},\phi ({\mathbf {x}}_i) \right\rangle -y_i\right) ^2\), the one-variable optimization can be solved analytically as follows:

$$\begin{aligned} w^{t+1}_j&= \mathop {\mathrm {argmin}}_{w_j} P(\mathbf {w}^t + (w_j - w^t_j)\mathbf {e}_j )\\&= {\left\{ \begin{array}{ll} w^t_j - \frac{\sum _i \left( \left\langle \mathbf {w}^t,\phi ({\mathbf {x}}_i) \right\rangle -y_i\right) \phi _{j}({\mathbf {x}}_i) +\frac{1}{2C}}{\sum _{i}{\phi ^2_j({\mathbf {x}}_i)}} &{} w^t_j > \frac{\sum _i \left( \left\langle \mathbf {w}^t,\phi ({\mathbf {x}}_i) \right\rangle - y_i \right) \phi _j({\mathbf {x}}_i) + \frac{1}{2C}}{ \sum _i \phi ^2_j({\mathbf {x}}_i)} \\ w^t_j - \frac{\sum _i \left( \left\langle \mathbf {w}^t,\phi ({\mathbf {x}}_i) \right\rangle -y_i\right) \phi _{j}({\mathbf {x}}_i) -\frac{1}{2C}}{\sum _{i}{\phi ^2_j({\mathbf {x}}_i)}} &{} w^t_j < \frac{\sum _i \left( \left\langle \mathbf {w}^t,\phi ({\mathbf {x}}_i) \right\rangle - y_i \right) \phi _j({\mathbf {x}}_i) - \frac{1}{2C}}{ \sum _i \phi ^2_j({\mathbf {x}}_i)} \\ 0 &{} \mathrm{o.w.} \end{array}\right. } \end{aligned}$$

In binary logistic regression, \(\ell (\left\langle \mathbf {w},{\mathbf {x}}_i \right\rangle ,y_i)=\log \left( 1 + \exp \left( -y_i\left\langle \mathbf {w},\phi ({\mathbf {x}}_i) \right\rangle \right) \right) \), implying that we cannot solve the subproblem analytically. Therefore, it is suggested that the quadratic approximation of the function \( P(\mathbf {w}^t + \delta \mathbf {e}_j) \sim P_j^t(w^t_j+\delta )\) must be utilized as follows [21]:

$$\begin{aligned} P_j^t(w^t_j+\delta ) \triangleq |w^t_j+\delta | + \nabla _jL(\mathbf {w}^t)\delta + \frac{1}{2}\nabla _{jj}L(\mathbf {w}^t)\delta ^2, \end{aligned}$$

where

$$\begin{aligned} \nabla _j L(\mathbf {w})&\triangleq P_j^{t\prime }(w^t_j) = C\sum _{i=1}^n \frac{y_i \phi _j({\mathbf {x}}_i) }{1+\exp \left( y_i\left\langle \mathbf {w},\phi ({\mathbf {x}}_i) \right\rangle \right) }, \\ \nabla _{jj} L(\mathbf {w})&\triangleq P_j^{t\prime \prime }(w^t_j) = C\sum _{i=1}^n \frac{\phi ^2_j({\mathbf {x}}_i) \exp \left( y_i \left\langle \mathbf {w},\phi ({\mathbf {x}}_i) \right\rangle \right) }{\left( 1+\exp \left( y_i\left\langle \mathbf {w},\phi ({\mathbf {x}}_i) \right\rangle \right) \right) ^2 }. \end{aligned}$$

To stabilize the algorithm, a sufficient decrease condition is examined while the stepsize \(\beta d\) is geometrically discounted [17]. This can be said to be the modified version of Armijo’s rule and is denoted as

$$\begin{aligned} P(\mathbf {w}) - P(\mathbf {w}+ \beta \delta \mathbf {e}_j) \ge \sigma \beta \left( \nabla _jL(\mathbf {w}^t)\delta + |w_j+\delta | - |w_j| \right) , \end{aligned}$$
(2)

where \(0< \beta \le 1\) and \(\sigma >0\) is a fixed value throughout the optimization. First, condition (2) is verified by setting \(\beta =1\) and \(\delta = \mathop {\mathrm {argmin}}P _j^t(w^t_j+\delta )\). Next \(\beta \) is decreased geometrically until (2) is satisfied. The resulting update is written as

$$\begin{aligned} w_j ^{t+1} = w_j^t + \beta \delta . \end{aligned}$$
(3)

A remarkable property of coordinate descent methods regarding the solving of (1) is that we only need to look at \(\phi _j\), one column of \(\varPhi \), while updating. In LASSO, by monitoring \(u_i = \left\langle \mathbf {w},\phi ({\mathbf {x}}_i) \right\rangle \), the update can be given as

$$\begin{aligned} w^{t+1}_j&= {\left\{ \begin{array}{ll} w^t_j - \frac{\sum _i \left( u_i-y_i\right) \phi _{j}({\mathbf {x}}_i) +\frac{1}{2C}}{\sum _{i}{\phi ^2_j({\mathbf {x}}_i)}} &{} w^t_j > \frac{\sum _i \left( u_i - y_i \right) \phi _j({\mathbf {x}}_i) + \frac{1}{2C}}{ \sum _i \phi ^2_j({\mathbf {x}}_i)} \\ w^t_j - \frac{\sum _i \left( u_i-y_i\right) \phi _{j}({\mathbf {x}}_i) -\frac{1}{2C}}{\sum _{i}{\phi ^2_j({\mathbf {x}}_i)}} &{} w^t_j < \frac{\sum _i \left( u_i - y_i \right) \phi _j({\mathbf {x}}_i) - \frac{1}{2C}}{ \sum _i \phi ^2_j({\mathbf {x}}_i)} \\ 0 &{} \mathrm{o.w.} \end{array}\right. }, \end{aligned}$$

where \(\varOmega _j \triangleq \{j|\phi _{j}({\mathbf {x}}_i) \ne 0\}\) and \(u_i\) is updated as

$$\begin{aligned} u_i ^{t+1} = u_i^t + (w_j^{t+1} - w_j^{t}) y_i x_{ij} \end{aligned}$$

in \(O(|\varOmega _j|)\) time. Similarly, in case of logistic regression, we can compute \(\nabla _j L(\mathbf {w}^t)\) and \(\nabla _{jj} L(\mathbf {w}^t)\) by monitoring \(u_i^t = \exp \left( y_i\left\langle \mathbf {w}^t,\phi ({\mathbf {x}}_i) \right\rangle \right) \),

$$\begin{aligned} \nabla _jL(\mathbf {w}^t) = C\sum _{i\in \varOmega ^j} \frac{y_i \phi _j({\mathbf {x}}_i) }{1+ u^t_i}, \end{aligned}$$
(4)
$$\begin{aligned} \nabla _{jj}L(\mathbf {w}^t) = C\sum _{i\in \varOmega ^j} \frac{u^t_i \phi _j^2({\mathbf {x}}_i)}{\left( 1+ u^t_i\right) ^2 }, \end{aligned}$$
(5)

each time we update \(u_i\) as

$$\begin{aligned} u_i ^{t+1} = u_i^t \exp (\beta \delta y_i x_{ij}) \end{aligned}$$

This holds for any function of form \(\ell \) that depends only on \(\left\langle \mathbf {w},\phi ({\mathbf {x}}) \right\rangle \). We maintain \(u_i^t = \exp \left( y_i\left\langle \mathbf {w}^t,\phi ({\mathbf {x}}_i) \right\rangle \right) \) in logistic regression problems to reduce the number of exponential and log computations.

Another remarkable property of coordinate descent methods is that it is possible to solve the optimization problem efficiently by concentrating on updating parameters that are not zero at the optimal solution, i.e., the value of a parameter remains at 0 after a certain update point for j such that \(w^*_j=0\) [21]. In other words, for sufficiently large t,

$$\begin{aligned} -1< C \nabla _j L(\mathbf {w}^t) < 1 \end{aligned}$$
(6)

will hold for all j, such that \(w^*_j=0\). This suggests that it is unlikely to observe a j such that

$$\begin{aligned} -1 +M^t< C \nabla _j L(\mathbf {w}^t) < 1 -M^t \end{aligned}$$
(7)

and \(w^*_j \ne 0\) holds simultaneously for a given large t. Here, \(M^t\) is an amount that expresses a suboptimality level or a closeness to the optimal solution. The value of \(M^t\) that was used in the implementation of the L1 problem solver in [5] (liblinear) is formally written as

$$\begin{aligned} M^t \triangleq \frac{\max _{\tau = \left\lceil t/n \right\rceil n - n+1, \ldots , \left\lceil t/n \right\rceil n } v_j^{\tau }}{n}, \end{aligned}$$

where

$$\begin{aligned} v_j^t \triangleq {\left\{ \begin{array}{ll} \left| \nabla _j L (\mathbf {w}^{t}) - 1 \right| &{} w^{t}_j <0 \\ \left| \nabla _j L (\mathbf {w}^{t}) + 1 \right| &{} w^{t}_j >0 \\ \max \{\nabla _j L (\mathbf {w}^{t}) - 1 , -\nabla _j L (\mathbf {w}^{t}) - 1, 0 \} &{} w^{t}_j =0. \end{array}\right. } \end{aligned}$$

3 Proposed Scheme

In this section, we explain our proposed FCL scheme, which can handle datasets with large feature spaces. This scheme is flexible because the class of basis functions (feature space) can be arbitrary. Therefore, after introducing our FCL scheme, we show two cases in which combinatorial features and random Fourier features are applied. In addition, we discuss the relationship between our algorithm and the boosting algorithm in Sect. 3.2.

In the FCL scheme, two types of threads are prepared asynchronously: the writer and trainer threads. The writer thread sequentially reads datapoints and writes a column of the matrix \(\varPhi \) into the main memory repeatedly. If the extracted column does not improve the current solution, it is discarded without being shared with the trainer thread to save space in the limited memory. The condition for discarding the column can be written as

$$\begin{aligned} -1< C \nabla _j L(\mathbf {w}^t) < 1. \end{aligned}$$
(8)

Note that this amount can be easily computed by allowing the value of \(\mathbf {u}\) to be shared. If the amount of memory used by the columns of \(\varPhi \) exceeds a prespecified amount, the thread discards columns randomly and places a new column in the freed location. The pseudo-code of this algorithm is shown in Algorithm 1.

In contrast, the trainer thread selects one random column uniformly performs the standard coordinate descent method, explained in Sect. 2. If the coordinate is not effective for learning, that is, if (8) holds, the column is discarded from the memory. Note that this condition (8) is stricter than that used in [5] and other studies as discussed in the previous section, and thus may not correctly discriminate against columns that correspond to 0 in the optimal solution. This enables the coordinate descent to update in the trainer thread and become more efficient, while the entire scheme is still guaranteed to reach the optimal solution because the reader thread repeatedly checks the condition in (8) for all j. The pseudo-code of this algorithm is shown in Algorithm 2.

figure a
figure b

3.1 Examples of Large Feature Spaces and Their Relation to Learning with Kernels

In this section, we show examples of feature spaces containing large number of features.

Use of Combinatorial Features. When \({\mathbf {x}}_i\) is already embedded in a Euclid space, that is, \(\mathcal {X}= \mathbb {R}^l\) for some natural number l, we can create a new feature by combining their components.

$$\begin{aligned} \phi _j({\mathbf {x}}_i) = c_j \prod _{k} \left\langle {\mathbf {x}}_i,\mathbf {e}_{j_k(j)} \right\rangle , \end{aligned}$$

where \(c_j\in \mathbb {R}\) and \(j_k(j) \in \{1,\ldots ,l\}\). For example, if the datapoint and its component correspond to a document and the number of occurrences of a word, \(\phi _j\) expresses the co-occurrence of a certain combination of words. When \(c_j\) and \(j_k\) are chosen appropriately, the set of features is equivalent to that induced by a polynomial kernel and is expressed as

$$\begin{aligned} k({\mathbf {x}},{\mathbf {x}}') = ( \left\langle {\mathbf {x}},{\mathbf {x}}' \right\rangle +c_1 )^{c_2}. \end{aligned}$$

with \(c_1 \in \mathbb {R}\) and \(c_2 \in \mathbb {N}\) given.

Use of Random Fourier Features. Again, we consider the case of \(\mathcal {X}= \mathbb {R}^l\). By generating a random vector \(\omega _j \in \mathbb {R}^l\), we can produce a feature function as

$$\begin{aligned} \phi _j({\mathbf {x}}) = c_j \cos (\left\langle \omega _j,{\mathbf {x}} \right\rangle ). \end{aligned}$$

This class of feature functions can be said to be induced by the Gaussian kernel,

$$\begin{aligned} k({\mathbf {x}},{\mathbf {x}}) = \exp ( -\mu \left\| {\mathbf {x}}- {\mathbf {x}}'\right\| ^2 ), \end{aligned}$$

by appropriately setting \(c_j\) and sampling \(\omega _j\) from the specific distributions. Furthermore, an arbitrary shift-invariant kernel can be approximated as in [10, 11].

These examples imply that our scheme can handle several types of kernelized versions of L1 regularized risk minimization problems. Note that a representer theorem cannot be applied with L1 regularization; therefore, applying a kernel function to L1 regularization is usually difficult.

3.2 Relation to Boosting

A boosting algorithm consists of two processes: first, an oracle hypothesizes that \(h^{(t)}\in \mathcal {H}\) under the current distribution of each datapoint \(d^{(t)} \in \mathbb {R}_{\ge 0}^n\), then the next distribution \(d^{(t+1)}\) is determined, for which the past hypothesis performs poorly. Each boosting algorithm differs in choosing a new hypothesis and updating the distribution over the datapoints. The abstraction of the algorithm is summarized in Algorithm 3. It is well known that a greedy coordinate descent method with respect to empirical risk minimization problems can be described as a boosting method [13]. Furthermore, a similarity to the problem in the form of (1) is reported in [3, 4]. Therefore, we explain in this section that the writer thread of our scheme continuously generates new Hypotheses, whereas the trainer thread of our scheme continuously updates the adversarial distributions.

figure c

For simplicity, we focus on binary classification problems. In the context of boosting, the parameter \(\mathbf {w}\) in our scheme can be interpreted as the parameter defining unnormalized distribution d such that

$$\begin{aligned} d_i (\mathbf {w}) = { -y_i \nabla \ell ( y_i\left\langle \mathbf {w},\phi ({\mathbf {x}}_i) \right\rangle )}, \end{aligned}$$
(9)

if we aim to minimize (1), where \(\ell (\left\langle \mathbf {w},\phi ({\mathbf {x}}) \right\rangle ,y_i)\) is a form of \(\ell (y_i\left\langle \mathbf {w},\phi ({\mathbf {x}}) \right\rangle )\). Furthermore, a hypothesis h corresponds to a feature expression \(\phi _j\) if the range of our feature functions is restricted to \(\{+1,-1\}\).

Generating Hypothesis h. In conventional boosting algorithms, such as Adaboost introduced in [6], at the t-th iteration, the oracle formulates the following hypothesis h that maximizes the edge \( \gamma (h) \triangleq \sum _i d^{(t)}_i y_i h({\mathbf {x}}_i) \) among all possible hypothesis \(h\in \mathcal {H}\). That is,

$$\begin{aligned} h^{(t)} = \mathop {\mathrm {argmax}}_{h \in \mathcal {H}} \gamma (h). \end{aligned}$$
(10)

In contrast, the writer in our scheme repeatedly searches for \(\phi _j\) such that the derivative is sufficiently large to satisfy the following condition:

$$\begin{aligned} \left| C \sum _i \nabla \ell ( y_i\left\langle \mathbf {w},\phi ({\mathbf {x}}_i) \right\rangle )y_i\phi _j({\mathbf {x}}_i)\right| > 1, \end{aligned}$$
(11)

depending on the currently available parameter \(\mathbf {w}\). By defining and substituting

$$\begin{aligned} h(\cdot ) \triangleq \mathop {\mathrm {sign}}\left( \sum _i -\nabla \ell ( y_i\left\langle \mathbf {w},\phi ({\mathbf {x}}_i) \right\rangle )y_i\phi _j({\mathbf {x}}_i)\right) \phi _j (\cdot ), \end{aligned}$$

into (11), the condition can be rewritten as

$$\begin{aligned} \sum _i d_i^{(t)} y_ih({\mathbf {x}}_i) > C^{-1}. \end{aligned}$$

Therefore, the strategy of writer is to accept all hypotheses that show larger edge than a certain threshold. Note that this property is inherited from random coordinate descent methods, whereas the greedy coordinate descent method corresponds to the strategy of (10).

The Distribution Update. As in [13], the updates of the distribution over datapoints are conventionally formulated as follows:

$$\begin{aligned} d^{(t+1)} =&\mathop {\mathrm {argmin}}_{d}\ \mathrm{RE}(d | d^{(t)} ) \end{aligned}$$
(12)
$$\begin{aligned}&\mathrm{subject\ to}\ \sum _i d_i y_i h^{(t)}({\mathbf {x}}_i) = 0, \end{aligned}$$
(13)

where \(\mathrm{RE}(d | d^{(t)})\) denotes a certain type of relative entropy between d and \(d^{(t)}\) in case of (nonregularized) empirical risk minimization. Duchi and Singer provided an alternative formulation of distribution update corresponding to the L1 regularized risk minimization, in which (13) is replaced by

$$\begin{aligned} \sum _i d^{(t+1)}_i y_i h^{(t)}({\mathbf {x}}_i) \le \nu , \end{aligned}$$
(14)

while minimizing a relative entropy. Those strategies of the distribution updates have a unified consistent viewpoint as coordinate descent methods for minimizing

$$\begin{aligned} \sum _{i=1}^n \ell ( y_i\left\langle \mathbf {w},\phi ({\mathbf {x}}_i) \right\rangle ) + \sum _{j=1}^p r(w_j), \end{aligned}$$

where \(r(w_j)=0\) or \(r(w_j)=C|w_j|\). From this point of view, the exact form of relative entropy varies depending on the underlying function \(\ell \). The one-variable subproblem that the coordinate descent method defines can be reformulated as follows:

$$\begin{aligned}&\mathop {\mathrm {argmin}}_{w_j} \sum _{i=1}^n \ell ( y_i\left\langle \mathbf {w},\phi ({\mathbf {x}}_i) \right\rangle ) + r(w_j) \\&= \mathop {\mathrm {argmin}}_{w_j} \sum _{i=1}^n \max _{d_i} d_i y_i\left\langle \mathbf {w},\phi ({\mathbf {x}}_i) \right\rangle - \ell ^*_i(d_i) + r(w_j), \end{aligned}$$

for a fixed j, while any other components of \(\mathbf {w}\) are fixed. Here, \(\ell ^*\) denotes the Fenchel dual of \(\ell \). Therefore, the dual problem can be formulated as

$$\begin{aligned} d = \mathop {\mathrm {argmax}}_{d} \sum _{i=1}^n - \ell ^*(d_i) + \sum _{j'\ne j} d_iw_{j'} y_i \phi _{j'}({\mathbf {x}}_i) - r^*\left( \sum _{i=1}^n d_i y_i \phi _j({\mathbf {x}}_i)\right) . \end{aligned}$$

The primal-dual relationship can be written as in (9). Moreover, when \(r(w_j) = 0\), \(r^*\) is 0 if (13) holds and \(\infty \) otherwise. Furthermore, when \(r(w_j) = C|w_j|\), \(r^*\) is 0 if (14) is true for \(\nu =C^{-1}\) and \(\infty \) otherwise. Therefore, relative entropy terms correspond to \(\sum _{i=1}^n - \ell ^*(d_i) +\sum _{j'\ne j} d_iw_{j'} y_i \phi _{j'}({\mathbf {x}}_i)\), which coincides with various relative entropies except the difference of constant when \(\ell \) is appropriately set.

A similar consistency holds for the totally corrective boosting algorithm [18]. The updates to the distribution are provided by selecting \(d^{t+1}\) that satisfies

$$\begin{aligned} \sum _i d^{(t+1)}_i y_i h^{(s)}({\mathbf {x}}_i) \le \nu , \end{aligned}$$

for all \(s=1,\ldots ,t\), while some relative entropy is minimized. With a similar argument, we can see that this corresponds to the dual problem of (1), in which we restrict to hold \(w_j=0\) if \(w_j^t=0\) for all j and set \(\nu =C^{-1}\). Therefore, the procedure of the trainer thread in our scheme, which aims to solve a subproblem restricted to a feature cache, is similar to the strategy of updating distributions over datapoints defined by totally corrective boosting algorithms.

4 Experimental Results

In this section, we verify the effectiveness of our scheme for large-scale optimization of L1 logistic regression that uses \(\ell (\left\langle \mathbf {w},\phi ({\mathbf {x}}_i) \right\rangle , y_i) = \log (1+\exp (y_i\left\langle \mathbf {w},{\mathbf {x}} \right\rangle ))\) in (1). We first demonstrate that optimization can be performed efficiently by the asynchronous feature extraction of the FCL scheme, and secondly show that the method can be used effectively on problems with a large number of features that is overly difficult for any other known scheme. We consider the binary classification of spam e-mail recognition in our first set of experiments, and splice site recognition by using over one million DNA sequences for the second set of experiments. Further, we implemented the proposed scheme and the DCL scheme using C++. All experiments were performed on Intel Xeon CPU X5690 3.47 GHz processor with 96 GB memory.

Table 1. Dataset configuration

4.1 Spam Recognition

In the initial experiments, we verified that our scheme can find the optimal solution efficiently even when the feature cache cannot fit all the data matrix using of the largest public dataset for binary classification. We used webspam dataset [19], in which trigram features are already developed. We limited the memory capacity for storing the columns of the data matrix to 20 GB, whereas the dataset utilizes more than 20 GB in the text format. We plotted the relative value of the objective function as a function of elapsed time and compared it to the dual cached loops with SGD. The relative value of the objective function refers to \( \frac{P(\mathbf {w}^t)}{P(\mathbf {w}^t)-P^*}\) for a given time t. For comparison, we also implemented the modified version of the dual cached loop scheme with SGD, as discussed in Sect. 1. We omitted a comparison with the block minimization scheme because the dual cached loops are reported to be consistently superior [8]. Following [2], we used scheduling of the step size denoted as \( \eta _t \triangleq {\eta _0} \left( 1 +\frac{1}{Cn}\eta _0 t\right) ^{-1}.\) For the value of the hyper-parameter, we used \(C=0.1,1,10\), and 100.

Fig. 1.
figure 1

Relative objective function value versus elapsed time (sec)

Fig. 2.
figure 2

Relative objective function value versus elapsed time (sec)

The results are shown in Figs. 1 and 2. FCL achieves the optimal solution more quickly than the dual cached loops scheme by using any step size for any value of C. This is not only because the coordinate descent method enjoys a faster convergence rate for this specific form of optimization problem but also because the selective deletion of columns facilitates to focus updates on parameters corresponding to important features.

4.2 Splice Site Recognition

In the second set of experiments, we examined the performance of the FCL scheme on the binary splice site classification problem for a DNA sequence. Each \({\mathbf {x}}_i\) and \(y_i\) represent a DNA sequence of length 141 and a label of whether the corresponding sequence has a splice site respectively [15]. The dataset can be obtained from http://sonnenburgs.de/soeren/item/coffin/. We used 50,000,000 sequences as the training data, and used the first 100,000 sequences from the test data to compute the testing indicators. Owing to the dataset being extremely imbalanced, only 143,688 datapoints out of 50,000,000 have positive labels, we evaluated the learning performance by using the area under precision-recall curve (AUPRC).

Fig. 3.
figure 3

Progress of feature cached loops (FCL) in splice site recognition

We consider simple combinatorial features using the following methods. We denoted a sequence of \(\{\texttt {A,T,C,G,?}\}\), with length d that does not begin with “?,” with b and a natural number less than \(141-(d-1)\) with k, and consider a one-to-one correspondence between (bk) and \(j=1,\ldots ,p\). The value of \(\phi _j({\mathbf {x}}_i)\) is defined as 1 if the subsequence of \({\mathbf {x}}_i\) from k to \(k+d\) matches the regular expression represented by b, and 0 otherwise. Therefore, the total possible number of features p is \((141-(d-1)) \times 4\times 5^{d-1}\) and a single datapoint consists of \((141-(d-1)) \times 2^{d-1}\) nonzero elements. The entire data matrix requires approximately 3,000 GB of memory even when formulated as a sparse matrix, assuming that one nonzero element requires 4 bytes when \(d=8\). We examined the cases for \(d=8,9,10\) and \(C=0.1,0.01\), by setting the capacity of feature cache as \( 70 \times 2^{d-1}\). To accelerate feature extraction, we asynchronously ran 5 writer threads for this experiment. Furthermore, to evaluate the performance of those results, we examined the performance obtained by liblinear. We randomly sampled a part of the features so that the required amount of memory becomes 60 GB, as described in Table 1 Footnote 1.

Figure 3 plots the objective function value versus elapsed time and AUPRC versus elapsed time for different values of C and d. As shown, \(d=8\) achieves the highest value of AUPRC and the lowest value of the objective function. \(d=10\) achieves lower AUPRC and higher objective values. This result contradicts the function value at the optimal solutions as the feature space of \(d=8\) is strictly included by that of \(d=9\), which is similarly included by \(d=10\). This indicates that when \(d=8\), the optimization progresses more efficiently by focusing on important features. For \(d=10\), a more efficient extraction of features is required to accelerate the entire scheme.

Notably, even for developing sampled features used by liblinear, more than two days were required for each dataset. Therefore, our scheme could obtain higher values of AUPRC than those obtained by randomly sampling features in shorter time when the same hyper-parameters are set. In addition, our scheme keeps decreasing the objective values and increasing AUPRC for more than a week. This implies that the optimization is not completed yet and AUPRC could be further improved. This observation also indicates that a more efficient scheme is desired for better performance for solving even larger feature spaces.

5 Conclusion

In this paper, we proposed a scheme for solving the L1 regularization problem with a large number of datapoints and an even larger number of features. By simultaneously developing the data matrix \(\varPhi \) and optimizing the parameters, the proposed scheme can efficiently learn the parameters, while only the seemingly important features are developed. The experiments show that the proposed scheme can efficiently learn parameters with richer feature spaces than could be used in the past. In our research, we demonstrated the performance of our scheme by using combinatorial features. In the future, we will consider an extensive application of various types of feature spaces and distributed optimization schemes.