1 Introduction

Although modern machine learning approaches achieve remarkable improvements in various real-world fields such as visual recognition (Krizhevsky et al. 2017), one of the key elements towards constructing such helpful models is a large training set (Russakovsky et al. 2015; Su et al. 2018). Taking the instance collection and labeling cost into consideration, learning “rich” knowledge from “small” data is necessary and important. For example, images of rare species are hard to collect, thus the model should be able to do visual recognition based on single or a few reference examples (Wang et al. 2018b); it is cumbersome to require a user recording multiple facial expressions into a system in advance (Tan et al. 2006); for a robot, imitating from one single demonstration of human must become a fantastic characteristic (Finn et al. 2017b). In addition to such instance collection difficulty for the “long-tailed” objects, to make it worse, there also exists labeling cost when dealing with bio-informatics (Huang et al. 2014) or thousands of new-coming items (Karlinsky et al. 2017). The task with only limited training examples from each class results in the few-shot learning (FSL) problem.

The training data in a machine learning task reveals the pattern of the data distribution, and the principles of statistical learning require enough training examples from the same distribution to make the model learnable and generalizable. Hence it seems that the FSL violates previous analyses on the machine learning field, and it is almost incredible to train a model through only a few training examples.

But how human can recognize novel objects with a few or even one single image? One possible reason lies in his/her rich experience, a.k.a. the the inductive bias (Baxter 2000), with seen objects (Lake et al. 2015). For example, a programmer adapts himself/herself to a new task rapidly based on his/her rich experience from related tasks.

Similar learning schema works in the few-shot classification environment. Specifically, a type of model inductive bias is extracted from the seen classes, and it is then applied to few-shot tasks composed by unseen classes. To this end, meta-learning mimics the few-shot evaluation scenario during training, and figures out a common task configuration over the sampled few-shot tasks from the seen classes (Vilalta and Drissi 2002; Maurer et al. 2016; Thrun and Pratt 2012). Feature embedding (Koch et al. 2015; Vinyals et al. 2016; Snell et al. 2017; Triantafillou et al. 2017), embedding adaptation (Ye et al. 2018), and model optimization strategies (Ravi and Larochelle 2017) all can be learned in a meta-learning way.

Model-agnostic meta-learning (MAML) (Finn et al. 2017a; Nichol et al. 2018) is an important thread of meta-learning. Due to the observations that the quality of a model highly depends on the initial point of its optimization, to avoid over-fitting, MAML learns a common model initialization depicting the features across a wide range of few-shot tasks, then restricts any few-shot models to be optimized by a fixed number of gradient descent steps from this specified initialization. Although experiments verify the feasibility of MAML for FSL, there still exist several problems. First, a single optimization initialization point is too difficult to satisfy diverse tasks. Taking a binary task discerning “sunflower” and “dog” as an example. A good initialization tends to be composed by classifiers with discriminative coefficients towards plant and animal accordingly. While for another task containing ”cat” and “rose”, the previous plant-animal initialization cannot be applied to this “reverse” animal-plant case without careful updates. Furthermore, vanilla MAML requires the second order derivatives of all parameters during training, and recent literature shows it is hard to apply such a method without special hyper-parameter tuning tricks (Antoniou et al. 2018; Chen et al. 2019).

Fig. 1
figure 1

The main flow of model-agnostic meta-learning (MAML) and our proposed AdaptiVely InitiAlized Task OptimizeR (Aviator) approach for few-shot learning. MAML figures out an initial point \(\theta _0\) for all tasks, and the final model of each task (say \(\theta _A^*\)) is fine-tuned from \(\theta _0\) through a fixed number of gradient descent (say based on \(\nabla _\theta ^A\)). While Aviator considers task characteristic, and sets specific initial points for each task

Due to the heterogeneity of classes from one task to another, in our manuscript, we propose our AdaptiVely InitiAlized Task OptimizeR (Aviator) approach for few-shot learning, which transforms MAML into a practical FSL approach. Aviator incorporates the task context, i.e., the class information inside a training set, into the construction of the model initial point. Hence different tasks have adaptive model initialization, so as to obtain task-specific classifier effectively and efficiently. The notion of Aviator is illustrated in Fig. 1.

To this end, we decouple a model into the embedding part and the top-layer classifier. By treating the current few-shot training set as the input, a mapping is learned to encode the task characteristic and output an adaptive classifier initialization. To enable the gradient back-propagation, we turn to the first-order approximation of the gradient descent, and apply a simple re-parameterization strategy to implement and simplify the model training. During the synthetic experiments, we find the model initialization output by MAML is too conservative to cover all kinds of tasks, while the initial model used by Aviator handles model diversities effectively. We also validate the superiority of our task-adaptive optimizer approach w.r.t. the MAML variants on two benchmark data sets, i.e., MiniImageNet and CUB. Concrete ablation studies and discussions illustrate the helpful task-specific feature of Aviator.

The main contributions of our manuscript can be summarized as follows:

  • A practical task-adaptive optimizer for FSL which considers the task context.

  • An effective meta-level optimization strategy with a low computational burden.

  • Promising experimental results on benchmark data sets.

We start with the background and notations of the few-shot learning problem, and then we discuss the related approaches. Before diving into our Aviator approach in detail, we give a brief introduction to the MAML framework. The training and the re-parameterization strategy of Aviator are stated in Sect. 4. Finally are experiments and conclusion.

2 Related work

Although it is the fact that human is able to learn a novel concept through one single example (Li et al. 2006; Lake et al. 2011, 2015), it is still difficult to train a model under the data budget. For example, most visual recognition methods require thousands of training data to fit the huge deep learning models (Russakovsky et al. 2015; Krizhevsky et al. 2017). Considering both the data collection (Li and Zhou 2015) and labeling cost (Huang et al. 2014), it is necessary to enable a machine learning algorithm to have such a valuable ability—building classifiers with limited training examples, i.e., the one-shot and the few-shot Learning. One naive idea is to fine-tune a pre-trained model or add strong regularizers. Neither these solutions cannot achieve satisfying results in practice due to the data scarcity and model complexity.

Tracing back to Baxter (2000) and Vilalta and Drissi (2002)), the importance of task-level inductive bias has been proposed and analyzed theoretically. Different from conventional models predicting over the instance level, meta-learning, a.k.a. learning-to-learn, extracts inductive bias across training tasks. A meta-model characterizes the task commonality and generalizes its prediction to those unseen tasks from a related environment (Maurer 2009; Maurer et al. 2016; Denevi et al. 2018). Meta-learning approaches have been successfully applied in various fields, like long-tail class-imbalance classification (Wang et al. 2017b; Ren et al. 2018), domain adaptation (Motiian et al. 2017; Zhang et al. 2018), intimation learning (Yu et al. 2018), density estimation (Reed et al. 2017), unsupervised learning (Garg 2018; Hsu et al. 2018), data compression (Wang et al. 2018a), recommendation system (Vartak et al. 2017), and hyper-parameter tuning (Franceschi et al. 2017; Probst et al. 2019).

Benefited from meta-learning’s ability to generalize the prediction ability across tasks, in Few-Shot Learning (FSL), the inductive bias is first learned over few-shot tasks composed by seen classes, and then is evaluated for those few-shot tasks with unseen classes. For example, few-shot classification can be implemented in a non-parametric way with soft nearest neighbor (Vinyals et al. 2016) or nearest center rule (Snell et al. 2017), so the representation function acts as the task-level inductive bias. The learned embedding pulls similar instances together and pushes dissimilar ones far away, such that a test instance can be classified even with a few labeled training examples (Koch et al. 2015). Considering the hypothesis complexity, the model training configurations also serve as a type of inductive bias. (Andrychowicz et al. 2016; Ravi and Larochelle 2017) meta-determines the optimization strategy for each task, including the learning rate and update directions as well. Other kinds of inductive biases are also explored. Hariharan and Girshick (2017) and Wang et al. (2018b)) learn a generation prior to augment examples given a single image; Dai et al. (2017) extracts logical derivations from related tasks; Wang et al. (2017a)) and Shyam et al. (2017) utilize the prior to attend images. An empirical study of few-shot learning approaches can be found in Chen et al. (2019).

Model-agnostic meta-learning (MAML) (Finn et al. 2017a) proposes another kind of inductive bias, i.e., the model initialization. After meta-learned a common model initialization among tasks, the classifier of a new few-shot task can be fine-tuned with several steps of gradient descent from that initial point. The universality of this MAML-type updates has be proved in Finn and Levine (2018). MAML have been applied in various scenarios, such as uncertainty estimation (Finn et al. 2018), robotics control (Yu et al. 2018; Clavera et al. 2018), neural translation (Gu et al. 2018), and language generation (Huang et al. 2018). In spite of the success, there still exist problems with such a framework. Nichol et al. (2018) handles the high computational burden of MAML with first-order approximation; Deleu and Bengio (2018) points out possible negative adaptation with too many updates; Antoniou et al. (2018) provides a bunch of tricks to tune the MAML framework; and Lee and Choi (2018) decomposes the MAML backbone to introduce task-specific metric. In our paper, we aim to incorporate the task context into the determination of the model initialization. By decoupling the model into embedding and top-layer classifier, we propose a re-parameterization strategy which enables the transition of the task information without touching the backbone. Our AdaptiVely InitiAlized Task OptimizeR (Aviator) approach achieves good performance with low computational burden in practice.

Different from Ye et al. (2018) which transforms embedding with a transformer-based set function over the training set embedding output, Aviator adapts both top-layer classifier and embedding via gradient descent. In addition, Triantafillou et al. (2019) constructs an embedding learning objective for class prototypes, and updates task embedding in the MAML style. Aviator introduces the task-adaptive property through the dynamic initialization, which not only incorporates task context but also accelerates the update process. More comparison can be found in the experiments part.

3 Notations and background

In this section, we describe the Few-Shot Learning (FSL) setting formally at first, and then present the main flow of the meta-learning methods dealing with FSL tasks. At last, we discuss the relationship between the current meta-learning and the previous analyzed learning-to-learn framework.

3.1 The few-shot learning problem

Following the literature, we define a N-way K-shot task as a classification problem with N classes in total and K examples in each class. In the few-shot scenario, the value K is very small, e.g., \(K=1\) or \(K=5\). The target of the Few-Shot Learning (FSL) task is to obtain an effective classifier based on these NK training examples, which is able to discern an unseen test instance among these N classes. The main difficulty of such FSL is the contradiction between the complex model and scarce data, which makes a model prone to over-fit.

We denote the training set (a.k.a. support set) of the problem as \({\mathcal {D}}_{\mathbf {train}} = \{(\mathbf{x}_{i}, \mathbf{y}_{i})\}_{i=1}^{NK}\). Each \((\mathbf{x}_{i}, \mathbf{y}_{i})\in {\mathcal {D}}_{\mathbf {train}}\) are the instance and label of the ith example respectively, where \(\mathbf{x}_{i}\in {\mathbb {R}}^{D}\), and the label \(\mathbf{y}_{i}\in \{0,1\}^N\) comes from a class set composed by N classes. The index of value 1 in the label vector \(\mathbf{y}_{i}\) indicates the class of the instance \(\mathbf{x}_{i}\).

Rather than learning a few-shot model over unseen classes from scratch, the inductive bias extracted from related seen class tasks should be reused in advance. We use a superscript \({\mathcal {S}}\) to emphasize a set or an instance sampled from the seen class set if necessary. Different from unseen few-shot tasks, classes in the seen set often have enough examples. Taking a rare-bird classification problem as an example. The target is to train a classifier for unseen rare birds, where only a limited number of images of those birds can be collected in practice. To obtain inductive bias towards birds classification, we can take advantage of the public Caltech-UCSD Birds (CUB) (Wah et al. 2011) data sets which have 200 common birds and enough examples in each of the 200 classes. In other words, the public birds data set in this case serves as the seen class set, while the tasks with rare birds containing limited examples correspond to the unseen class set. The target of the FSL is to utilize the seen set to assist the few-shot classification over the unseen set.

Fig. 2
figure 2

Comparison between the few-shot classification and the standard supervised learning paradigm. Left: Different from standard machine learning paradigm training a model based on a large data set, few-shot classification considers the scenario there is only a limited number of instances in the training set. Right: The general flow of the meta-learning procedure for few-shot classification. By sampling few-shot tasks from the meat-training set (seen classes), the learned task inductive bias can be applied to the few-shot task from the meta-test set (unseen classes)

3.2 Meta-learning for few-shot learning

Meta-learning is a popular approach to extract inductive bias from seen class set, which has been widely used in the few-shot scenarios (Vinyals et al. 2016; Finn et al. 2017a; Snell et al. 2017; Ye et al. 2018). A comparison between standard classifier training and the meta-learning paradigm can be found in Fig. 2.

The main idea of the meta-learning is to sample tasks from the seen sets to mimic the evaluation scenario, i.e., the N-way K-shot task. In particular, N-way K-shot tasks \({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}}\) are sampled from the seen sets in a hierarchical way: first, N classes are chosen randomly from all seen classes, and then K examples in each of the N classes are selected in random. Following this strategy, the seen class set is named as “meta-train”, while the unseen set is denoted as “meta-test”.

A mapping f is constructed based on \({\mathcal {D}}_{\mathbf {train}}\), which outputs a classifier for the given N classes. In detail, for a test instance \(\mathbf{x}_j\), its label can be predicted as

$$\begin{aligned} {\hat{y}}_j = f({\mathcal {D}}_{\mathbf {train}})(\mathbf{x}_j). \end{aligned}$$
(1)

The mapping f is learned with few-shot tasks from the meta-training set. To measure the performance of such a classifier mapping f over the N classes when facing a few-shot training set \({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}}\), another test set (a.k.a. query set) \({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {test}}\) with these N classes is sampled. A good classifier mapping f will achieve low loss value after predicting the labels of all instances from the test set. Therefore, the objective of meta-learning can be summarized as

$$\begin{aligned} \min _f \sum _{({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}},{\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {test}})\sim {\mathcal {S}}} \sum _{(\mathbf{x}^{{\mathcal {S}}}_j,\mathbf{y}^{{\mathcal {S}}}_j)\in {\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {test}}} \ell (f({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}})(\mathbf{x}^{{\mathcal {S}}}_j), \mathbf{y}^{{\mathcal {S}}}_j). \end{aligned}$$
(2)

In Eq. 2, \(({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}},{\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {test}})\sim {\mathcal {S}}\) denote the enumeration of all sampled tasks from the seen class set. The loss function \(\ell (\cdot , \cdot )\) measures the discrepancy between the prediction and true label for each instance in \({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {test}}\). By optimizing the objective, a general mapping f is constructed, which maps a training task, even with a few instances in each class, to an effective classifier for the particular task (meta-training phase). Hence such mapping can also be applied to few-shot tasks from the unseen class set (meta-test phase). The whole process is summarized in Alg. 1.

Remark 1

Although there are only limited training examples in each class, the classifier mapping f is shared among a lot of few-shot tasks. Furthermore, since such f is learned from plenty of tasks, enough number of tasks alleviate the burden of within task sample requirement from the meta-perspective to some extent.

figure a

3.3 Learning embedding for few-shot learning

A direct implementation of the classifier mapping f is the embedding function, \(f=\phi :{\mathbb {R}}^D\rightarrow {\mathbb {R}}^{d}\), which extract features of the input examples and transformed them into a latent space with d dimensions. If the embedding \(\phi \) makes similar objects close to each other while dissimilar ones far away, then it is qualified for the few-shot classification task (Koch et al. 2015). For a test instance \(\mathbf{x}_j\), the embedding function \(\phi \) makes a prediction based on a soft nearest neighbor rule:

$$\begin{aligned} {\hat{y}}_j = f({\mathcal {D}}_{\mathbf {train}})(\mathbf{x}_j) = \sum _{(\mathbf{x}_i, \mathbf{y}_i)\in {\mathcal {D}}_{\mathbf {train}}} \mathbf {sim}(\mathbf{x}_j, \mathbf{x}_i)\mathbf{y}_i. \end{aligned}$$
(3)

The \(\mathbf {sim}(\mathbf{x}_j, \mathbf{x}_i)\) measures the similarity between the test instance \(\mathbf{x}_j\) and each training instances \(\mathbf{x}_i\). Such measurement can be completed by a normalized version of cosine similarity (Vinyals et al. 2016) or negative euclidean distance (Snell et al. 2017). When there are more than one instance in each class, i.e., \(K>1\), instances in the same class can also be averaged to assist make final decision (Snell et al. 2017). By learning a good embedding, the important features for few-shot classification is stressed, which will also be used for few-shot tasks from the unseen class set.

3.4 Discussion: the learning-to-learn framework

Instead of utilizing a test set to evaluate the quality of the mapping f in meta-learning as in Eq. 2, in the learning-to-learn (L2L) framework, it is the training set error that is used to provide the learning signal.Footnote 1

As discussed in Maurer et al. (2016), the mapping \(f = W\circ \phi \) is a composition of a linear classifier \(W\in {\mathbb {R}}^{d\times N}\) and the embedding \(\phi \). The L2L objective is formulated as a bi-level optimization problem:

$$\begin{aligned}&\min _{W, \phi } \sum _{{\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}}\sim {\mathcal {S}}} \sum _{(\mathbf{x}^{{\mathcal {S}}}_i,\mathbf{y}^{{\mathcal {S}}}_i)\in {\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}}} \ell (W^\top \phi (\mathbf{x}_i), \mathbf{y}_i).\nonumber \\&\quad \Rightarrow \min _{\phi }\sum _{{\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}}\sim {\mathcal {S}}}\min _W \sum _{(\mathbf{x}^{{\mathcal {S}}}_i,\mathbf{y}^{{\mathcal {S}}}_i)\in {\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}}} \ell (W^\top \phi (\mathbf{x}_i), \mathbf{y}_i). \end{aligned}$$
(4)

In Eq. 4, all tasks sampled from the meta-training set share the common embedding \(\phi \). On top of the embedding and the training set \({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}}\), the classifier W for each task is generated by directly optimizing the linear classifier until convergence. Assuming all tasks are sampled from the same task distribution, the generalization ability of the learning-to-learn objective is guaranteed in Maurer et al. (2016). A similar objective is also used in the few-shot learning field (Nichol et al. 2018).

Remark 2

Comparing with the learning-to-learn framework (c.f. Eq. 4), the meta-learning framework (c.f. Eq. 2) has at least three main differences.

  • Instead of focusing only on the few-shot training set \({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}}\), in meta-learning, another test set \({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {test}}\) is also sampled to evaluate the quality of the classifier mapping, which produces a more accurate measurement of the quality of f;

  • All tasks sampling from the same task-distribution is an important assumption for L2L. Meta-learning weakens the assumption and directly applies the model learned on the tasks with seen classes during meta-training to those tasks with unseen classes in the meta-test phase.

  • In Eq. 4, only the embedding is updated across different tasks, while in the meta-learning framework, the parameter update of f would be more flexible.

4 Few-shot learning with adaptively initialized optimizer

In this section, we present the main idea of our AdaptiVely InitiAlized Task OptimizeR (Aviator) approach for few-shot learning in detail. Before that, we provide a brief introduction to the model agnostic meta-learning approach.

4.1 Model-agnostic meta-learning (MAML)

Consider how we obtain a classifier \(f:{\mathbb {R}}^D\rightarrow \{0,1\}^N\) from the training set.Footnote 2 Given a training set \({\mathcal {D}}_{\mathbf {train}}\) of a task, the model can be obtained by optimizing the loss on the training set:

$$\begin{aligned} \min _{f_{\theta }} \sum _{(\mathbf{x}_i, \mathbf{y}_i) \in {\mathcal {D}}_{\mathbf {train}}} \ell (f_\theta (\mathbf{x}_i), \mathbf{y}_i). \end{aligned}$$
(5)

\(\theta \) is the set of all learnable parameters in f. A simple way to optimize such objective is the gradient descent. Define a gradient operator \(\varXi _\theta (\cdot )\):

$$\begin{aligned} \varXi _\theta ({\mathcal {D}}_{\mathbf {train}}, \theta _t) = \sum _{(\mathbf{x}_i, \mathbf{y}_i) \in {\mathcal {D}}_{\mathbf {train}}} \nabla _\theta \ell (f_{\theta _t}(\mathbf{x}_i), \mathbf{y}_i), \end{aligned}$$

which computes the gradient w.r.t. \(\theta \) given the training set and the current tth step’s solution \(\theta _t\). We omit the \({\mathcal {D}}_{\mathbf {train}}\) in \(\varXi _\theta \) for notation simplicity. Therefore, the full gradient descent update w.r.t. to the variable \(\theta \) to solve the optimization problem in Eq. 5 is:

$$\begin{aligned} \theta _{t+1} = \theta _t - \eta \varXi _\theta (\theta _t) = \theta _t - \eta \sum _{(\mathbf{x}_i, \mathbf{y}_i) \in {\mathcal {D}}_{\mathbf {train}}} \nabla _\theta \ell (f_{\theta _t}(\mathbf{x}_i), \mathbf{y}_i). \end{aligned}$$

\(\eta > 0\) is the step-size. Usually, starting from an initial \(\theta _0\), more than one gradient step will be used to get the final solution. If the value of the objective can be obtained during the optimization, complex step-size strategies can be utilized to accelerate the optimization.

The general idea of meta-learning aims to obtain a shared classifier mapping f among tasks. Consider the dilemma with the limited training instances in \({\mathcal {D}}_{\mathbf {train}}\), directly optimizing the problem till convergence will make the model over-fit and hard to generalize. In addition, for a learner with non-convex objective, the quality of its solution depends on its initialization point a lot. Model-agnostic meta-learning (MAML) (Finn et al. 2017a) handles such problems with two strategies. First, an initialization of the model \(\theta _0\) is shared among tasks. Besides, to avoid over-fitting, only a fixed number of gradient descent steps (say T steps) are carried out. In other words, given a few-shot task \({\mathcal {D}}_{\mathbf {train}}\), its classifier is obtained by T successive gradient updates based on the common initialization \(\theta _0\). Therefore, the objective of MAML can be formulated as:

$$\begin{aligned} \min _{\theta _0} \sum _{({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}},{\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {test}})\sim {\mathcal {S}}} \sum _{(\mathbf{x}^{{\mathcal {S}}}_j,\mathbf{y}^{{\mathcal {S}}}_j)\in {\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {test}}} \ell (f_{\theta _0 - \eta \varXi _\theta (\theta _0)}({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}})(\mathbf{x}^{{\mathcal {S}}}_j), \mathbf{y}^{{\mathcal {S}}}_j). \end{aligned}$$
(6)

The MAML objective in Eq. 6 optimizes the initial parameter \(\theta _0\) for the model f. In each task, the task-specific classifier is obtained by one or more steps of gradient descent over \(\theta _0\), using the training set \({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}}\). This objective also requires the updated model in each task has low loss value over instances from the corresponding test set \({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {test}}\). It is notable that in Eq. 6 and following objectives, we only do one inner-task gradient descent with step-size \(\eta \). The objective with more steps can be easily extended. The output of MAML is the model initialization \(\theta _0\), so for a few-shot task from the unseen class set, its task-specific model can also be optimized by gradient descent from this initial parameter \(\theta _0\).

Similar to the general meta-learning objective in Eq. 2, the loss computed on the test set \({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {test}}\) supervises the search of the initialization point \(\theta _0\) among tasks. To optimize such initial parametric model \(f_{\theta _0}\), a meta-level stochastic gradient descent is leveraged. In detail, for a pair of \(({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}},{\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {test}})\), the update of \(\theta _0\) with a meta-level step-size \(\gamma >0\) is:

$$\begin{aligned} \theta _0&= \theta _0 - \gamma \nabla _{\theta _0}\\&= \theta _0 - \gamma \sum _{(\mathbf{x}^{{\mathcal {S}}}_j,\mathbf{y}^{{\mathcal {S}}}_j)\in {\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {test}}} \nabla \ell (f_{\theta _0 - \eta \varXi _\theta (\theta _0)}({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}})(\mathbf{x}^{{\mathcal {S}}}_j), \mathbf{y}^{{\mathcal {S}}}_j). \end{aligned}$$

Owing to the fact that the gradient w.r.t. \(\theta _0\) is used for the inner update over the training set \({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}}\), during the meta-update, the gradient w.r.t. the updated \(\theta \) over the test set \({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {test}}\) is related to the second order derivatives of \(\theta _0\), which will have high-computational burden. The MAML training follows the procedure in Algorithm 1.

Remark 3

One simple way to reduce the computational burden of MAML is to use the first-order gradient to do the inner task update. In particular, define \(\mathbf{sg}(\cdot )\) as an operator which stops the gradient for the input variable, then we can use \(\mathbf{sg}(\varXi _\theta (\theta _0))\) to replace the inner gradient. Therefore, although the gradients in the inner update are related to the intermediate variable \(\theta _t\) and the loss on the training set, the updated parameter \(\theta \) of the model can be formulated as the \(\theta _0\) minus some constants gradient values. So the final objective of the the first-order MAML is

$$\begin{aligned} \min _{\theta _0} \sum _{({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}},{\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {test}})\sim {\mathcal {S}}} \sum _{(\mathbf{x}^{{\mathcal {S}}}_j,\mathbf{y}^{{\mathcal {S}}}_j)\in {\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {test}}} \ell (f_{\theta _0 - \eta \mathbf{sg}(\varXi _\theta (\theta _0))}({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}})(\mathbf{x}^{{\mathcal {S}}}_j), \mathbf{y}^{{\mathcal {S}}}_j). \end{aligned}$$
(7)

This first-order MAML achieves a little lower performance degrade over the benchmark evaluation Finn et al. (2017a), while it accelerates the meta-training process a lot. Another variant of the first-order MAML updates, Reptile (Nichol et al. 2018), returns to the learning-to-learn objective and optimizes the loss on the training set:

$$\begin{aligned} \min _{\theta _0} \sum _{{\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}}} \sum _{(\mathbf{x}^{{\mathcal {S}}}_j,\mathbf{y}^{{\mathcal {S}}}_j)\in {\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}}} \ell (f_{\theta _0 - \eta \mathbf{sg}(\varXi _\theta (\theta _0))}({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}})(\mathbf{x}^{{\mathcal {S}}}_j), \mathbf{y}^{{\mathcal {S}}}_j). \end{aligned}$$
(8)

Reptile empirically gets better few-shot classification results than the first-order MAML, but it still remains a lot of hyper-parameters to tune during the meta-training.

Remark 4

We can think of MAML as a surrogate implementation of the bi-level learning-to-learn objective in Eq. 4. First, it is the model initialization that shared among tasks; then, instead of optimizing the model over the training set completely, it simplifies such optimization process with a fixed number of gradient steps. So in MAML, a good enough task-specific classifier is assumed to be obtained thanks to a good initialization. There are two main advantages of such decisions. On the one hand, it gets rid of the inner-task minimization problem and simplifies the holistic optimization. Furthermore, the conservative fixed number of gradient descent fits the few-shot learning problem with the limit of the training data.

Remark 5

There are several hyper-parameters in the training process of MAML. In addition to the number of steps (T) and learning rate (\(\eta \)) in the inner-task optimization, the meta-level learning rate and schedule should also be determined for model training. With so many hyper-parameters, however, it is hard to tune MAML in practice (Chen et al. 2019). Antoniou et al. (2018) provide a lot of tricks and objective transformations to make MAML work better.

Remark 6

MAML optimizes an initial model over the same form of tasks, which can also be used for the sample-efficient reinforcement learning (RL). Different from few-shot classification scenarios where tasks during the model evaluation phase are sampled from the unseen classes, the tasks in RL fit the assumption of the task distribution. Finn et al. (2017a) has verified the usage of MAML in several RL problems. In this paper, we focus on applying MAML for few-shot classification.

4.2 Using the adaptively initialized optimizer

We decouple the model \(f_\theta :{\mathbb {R}}^{D}\rightarrow {\mathbb {R}}^N\) into two parts. To make a prediction on an instance, the feature extractor \(\phi :{\mathbb {R}}^{D}\rightarrow {\mathbb {R}}^{d}\) maps the input into an embedding space, and a top-layer classifier \(W\in {\mathbb {R}}^{d\times N}\) carries out the final prediction.Footnote 3 Each column of W corresponds to each on of the N classes. In other words, \(f(\mathbf{x}) = W^\top \phi (\mathbf{x})\). The feature embedding extracts general representations among classes (Achille and Soatto 2018), while the top-layer classifier encodes the relationship between an embedding feature and a particular class.

Consider a binary example classifying dog and cat. The embedding \(\phi \) reveals the features among all possible classes, while the columns of W depict the relatedness between features and classes in the task. For example, a classifier for “dog” will have larger weights for those discriminative features between dog and others. Focusing on the top-layer classifier, it is noteworthy that there exists a one-to-one correspondence between the (columns of the) classifier with those classes. In this case, MAML would construct an initial \(W_0\) with coefficients biased towards dog and cat respectively, so that an effective final classifier could be obtained easily with limited steps of gradient descent on a few examples. However, when facing another task classifying cat and dog, a class-level permutation of the previous task, the previous learned initial classifier \(W_0\) will have a negative effect due to the incorrespondence between classes and their stressed features. In other words, a single model initialization cannot handle heterogeneous tasks effectively.

Inspired by this observation, we aim to make the learned model initialization adaptable to all kinds of tasks by incorporating the task context. By encoding the task property through a limited number of instances, the model initialization could be better estimated after such “first impression”. After that, more details could be fulfilled by inner-task gradient updates. In the “dog and cat” example, the few-shot training set will facilitate to recognize the relationship between the class and a particular classifier, then an adapted initial classifier could be easily generated via fully reusing the previously learned initial model. Therefore, we propose our AdaptiVely InitiAlized Task OptimizeR (Aviator) approach for few-shot learning (Fig. 3).

Fig. 3
figure 3

The workflow of the model-agnostic meta-learning (MAML) and our proposed AdaptiVely InitiAlized Task OptimizeR (Aviator) approach. Instead of updating the task classifier based on a non-informative initialization, Aviator takes task characteristic into account for an adaptively initialized model. \(\varXi (\cdot )\) is the gradient operator

Since the embedding function \(\phi \) extracts general discriminative features among classes, we attribute the main problem of MAML to the non-informative top-layer classifier initialization. In our Aviator approach, we use the training set embedding of the current task as the context, and use another mapping g to generate the initialization of W. It is notable that we do not treat the output of g as the new classifier, and still use an adaptively updated W for final classification. In the inner-task optimization, it is the gradient w.r.t. W (and \(\phi \)) used to update the model, and g is only used to initialize the W.

The main difficulty lies in the fact that g only affects the initial value of W and will be untouched in the following gradient updates and classification process. In other words, after obtaining the updated classifier over the few-shot training set \({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}}\), the loss over the corresponding test set \({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {test}}\) has nothing to do with g. Therefore, this kind of adaptive strategy blocks the gradient flow to g, which makes it hard to update g during the meta-training stage. To solve this problem, we propose to re-parameterize the gradient update flow with the first-order approximation.

4.2.1 Enable the gradient flow

The inner-task updates of the model \(f_\theta \) depends on the initialization and the gradient of the model parameters. Denote \(\theta = \{W, \phi \}\), we can expand the update process as

$$\begin{aligned} W = W_0 - \eta \varXi _W({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}}, W_0), \quad \phi = \phi _0 - \eta \varXi _\phi ({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}}, \phi _0). \end{aligned}$$

Same as the previous discussion, we list one gradient step here, and more steps can be easily extended. The next-step gradient depends on the training set loss with the previous-step model parameter. When using the first-order updates, we stop computing the gradient for \(\varXi _W({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}}, W_0)\) (resp. \(\varXi _\phi ({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}}, \phi _0)\)) w.r.t. W (resp. \(\phi \)), and the final updated W (resp. \(\phi \)) is a function of \(W_0\) (resp. \(\phi _0\)). Denote \(\phi ({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}}) = \{\phi (\mathbf{x}); \forall \mathbf{x}\in {\mathcal {D}}^{{\mathcal {S}}}_\mathbf{train}\}\) the embedding of the training set, then we use a function g on \(\phi _0({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}})\) to generate the initial top-layer classify \(W_0\). To enable the gradient descent for the initialization generation function g, we use the following three-step re-parameterization strategy: First, the task-specific initial classifier \(W_0\) is generated based on \(W_0 = g(\phi _0({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}}))\). Then the value of such initial point is assigned to an intermediate initialization \({\hat{W}}_0 = \mathbf{sg}(W_0)\), which is used to compute the following gradient but is not related to g. After computing the gradient w.r.t. this \({\hat{W}}_0\), we re-introduce g to generate the final classifier \(W = g(\phi _0({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}})) - \eta \mathbf{sg}(\varXi _W({\hat{W}}_0))\).

In summary, we have the following objective for Aviator:

$$\begin{aligned}&\min _{\phi _0, g}&\quad \sum _{{\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {test}}} \sum _{(\mathbf{x}^{{\mathcal {S}}}_j,\mathbf{y}^{{\mathcal {S}}}_j)\in {\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {test}}} \ell (f_{\theta _0 - \eta \mathbf{sg}(\varXi _f(\hat{\theta }_0))}({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}})(\mathbf{x}^{{\mathcal {S}}}_j), \mathbf{y}^{{\mathcal {S}}}_j)\nonumber \\&\mathrm{s.t.}&\quad \theta _0 = \{W_0, \phi _0\}; \;\hat{\theta }_0 = \{\mathbf{sg}(W_0), \mathbf{sg}(\phi _0)\},W_0 = g(\phi _0({\mathcal {D}}^{{\mathcal {S}}}_\mathbf{train})). \end{aligned}$$
(9)

4.2.2 Implementation of the classifier initialization mapping

The function g maps the embedding of the training set to the corresponding classifier, which generates an adaptive initialization based on the “first impression” of data. There are two main rules to design g. First, there exists a strong relationship between the instances in a particular class and its corresponding classifier. In addition, the influence of other classes should act as a set, i.e., their influences will be invariant for their permutations. Hence we implement g based on the idea of deep sets (Zaheer et al. 2017), where the effect of a set can be depicted as the transformed sum of the set instance embedding. Given the training set embedding \(\phi ({\mathcal {D}}^{{\mathcal {S}}}_{\mathbf {train}})\), we use the mean of a particular class instances and the mean of the complementary classes instances as the context for a classifier. For the nth class, denote

$$\begin{aligned} \mathbf{p}_n = \frac{1}{K} \sum _{\mathbf{y}_i = n} \phi (\mathbf{x}_i),\; \mathbf{p}^\complement _n = \frac{1}{(N-1)K} \sum _{\mathbf{y}_i \ne n} \phi (\mathbf{x}_i). \end{aligned}$$
(10)

Here the \(\mathbf{y}_i = n\) and \(\mathbf{y}_i \ne n\) mean the instances in or not in the nth class respectively. After concatenating \(\mathbf{p}_n\) and \(\mathbf{p}^\complement _n\), we use a two-layer fully connected network to generate the d-dimensional classifier for the nth class. The bias can be generated in a similar way. We use this simple implementation to show the effectiveness of the Aviator idea. More complicated implementations can be found in the ablation study part in the experiments. The whole flow of our Aviator approach can be found in Algorithm 2.

figure b

Remark 7

With the decoupled top-layer classifier W and the embedding \(\phi \), a straightforward way to construct such adaptive model is to make W become the output of g, i.e., \(W = g(\phi ({\mathcal {D}}_\mathbf{train}))\). So we have \(\hat{y_j} = g(\phi ({\mathcal {D}}_\mathbf{train}))^\top \phi (\mathbf{x}_j)\). It incoporates the few-shot training set into the determiniation of the whole initialization, and increase the complexity by the mapping composition. The main difference between this implementation and ours is that it includes g as a part of the predictor and also updates g during the inner-task optimization. Since g could be complicated, the inner updates over g will not only increase the computational burden but also make the model prone to over-fit. While in Aviator, g only help determine the initialization of the linear top-layer classifier W, and will not participate in the inner-task update. Empirically, we find this naive task-adaptive solution is hard to train and easy to over-fit.

Remark 8

It is notable that due to the fixed initial classifier used in the vanilla MAML, the form of training and evaluation few-shot tasks should be consistent. In other words, when sampling 5-way tasks during the meta-training phase, the initial classifier is learned as a \(d\times 5\) matrix, which is hard to generalize to tasks with other numbers of classes. In contrast, Aviator generates the classifier initialization for each class based on the context of the current class and the holistic effect of other classes. Thus it is easy for Aviator to deal with tasks with more classes, which makes our proposed method practical.

5 Experiments

We verify the effectiveness of our proposed AdaptiVely InitiAlized Task OptimizeR (Aviator) approach through both synthetic and real benchmark data sets. We will introduce the details of the model setups, including data sets, implementations, and evaluation protocols at first. Then we describe the results. At last we analyze the ablated model.

5.1 Setups

Here we introduce the general setups for few-shot classification in our experiments.

5.1.1 Data sets

Two benchmarks are used in our experiments. The miniImageNet dataset (Vinyals et al. 2016) is a subset of the ILSVRC-12 dataset (Russakovsky et al. 2015). There are totally 100 classes and 600 examples in each class. For evaluation, we follow the split of Ravi and Larochelle (2017) and use 64 of 100 classes for meta-training, 16 for validation, and 20 for meta-test (model evaluation). In other words, a model is trained on few-shot tasks sampled from the 64 seen classes set during meta-training, and the best model is selected based on the few-shot classification performance over the 16 class set. The final model is evaluated based on few-shot tasks sampled from the 20 unseen classes. Following (Vinyals et al. 2016; Finn et al. 2017a; Snell et al. 2017), all images are first re-scaled to the fixed size \(3\times 84\times 84\) during the pre-processing.

The Caltech-UCSD Birds (CUB) 200-2011 data set (Wah et al. 2011) is designed for fine-grained classification, which contains a total of 11,788 images of birds over 200 species. Due to the similarity between classes, it constructs difficult classification tasks when given limited training examples. Following the configuration of Triantafillou et al. (2017) and Chen et al. (2019), we use the provided bounding box of CUB to crop the center object of each image. All images in CUB are also resized to \(3\times 84\times 84\). 100 classes are used for meta-training, 50 classes are used for model selection, and the last 50 classes are served as the unseen set for model evaluation. Since there is no public released split from Triantafillou et al. (2017), we randomly shuffle all 200 classes in CUB, and choose seen and unseen sets by ourselves.

5.1.2 Evaluation protocols

We use the same evaluation protocol over all benchmark data sets (Vinyals et al. 2016; Finn et al. 2017a; Snell et al. 2017; Triantafillou et al. 2017), i.e., the performance of 1-shot 5-way and 5-shot 5-way classification. We keep the same configuration of tasks between meta-training and meta-test. In other words, for the 1-shot 5-way problem, we keep sampling the 1-shot 5-way training set (a.k.a. support set in the literature) from seen class set during meta-training. Besides, to evaluate the optimized classifier for a particular N-way problem, another 15 examples from each of the N classes are sampled as the test set (a.k.a. query set in the literature) to provide the loss and supervision. Most previous methods are evaluated on 600 tasks sampled from the unseen class set (Vinyals et al. 2016; Finn et al. 2017a; Snell et al. 2017; Triantafillou et al. 2017; Antoniou et al. 2018; Chen et al. 2019), which introduces high variance. Different from that, in our experiments, we test the model over 10,000 sampled few-shot tasks (Rusu et al. 2018; Ye et al. 2018). The mean accuracy and 95% confidence interval are recorded for comparison.

5.1.3 Implementation details

Following the setting of most existing methods (Vinyals et al. 2016; Finn et al. 2017a; Snell et al. 2017; Triantafillou et al. 2017), we use 4 identical neural network blocks to implement the embedding backbone \(\phi \). In each of the block, four components, i.e., a \(3\times 3\) convolution with 64 filters, a batch normalization (Ioffe and Szegedy 2015), a ReLU activation, and a \(2\times 2\) max-pooling, are stacked upon each other. Different from previous methods, we add another global max-pooling layer following the last block, which outputs 64-dimensional embeddings at last. The global pooling not only deals with the spatial transformation effectively but also relieves the optimization burden a lot.

Before the meta-training stage, we try to find a good initialization for the embedding \(\phi \). In particular, we add a linear layer on the backbone output and optimize a 64-way classification problem on the meta-training set with the cross-entropy loss function. Stochastic gradient descent with ADAM (Kingma and Ba 2014) is used to complete such optimization. The 16 classes for model selection also assist the choice of the pre-trained model. After each epoch, we use the current embedding and measure the nearest neighbor based few-shot classification performance on the sampled few-shot tasks from these 16 classes. The most suitable embedding function is recorded. After that, such learned backbone is used to initialize the embedding part \(\phi \) of the whole model. For the meta-learning phase, ADAM is used with an initial learning rate 1e−4 for the backbone part and 10 times faster rates for the top layers. The learning rate of all layers will be halved after optimizing 2000 tasks. During the experiments, we find the pre-train stage facilitates the learning of few-shot optimizer a lot, which is consistent with Rusu et al. (2018) and Ye et al. (2018). Besides, we set the inner-task learning rate \(\eta =0.05\), with \(T=15\) gradient descent updates. The sizes of two hidden layers in the initialization generation function are 128 and 64 respectively.

5.2 Synthetic regression tasks

We first use a simple regression task to verify the ability of Aviator to deal with complex tasks. In each few-shot regression task, a limited number of points are provided for curve fitting, and the model is required to predict the value of unseen points (sampled from the same curve) precisely. To increase the difficulty of this task, three curve families are used, namely the linear (\(y = \alpha x + \beta \)), square (\(y=\alpha (x - \beta )^2 + \gamma \)), and sine (\(y=\alpha sin(\beta x + \gamma )\)). The range of parameters in the function families are listed in Table 1.

Table 1 The range of hyper-parameters in three families of curves

Regression tasks in both meta-training and meta-test stages are sampled in a hierarchical way. First, one of the three curve families are determined, then the parameters of a curve are sampled from the pre-specified range. The training and test points in regression tasks are perturbed by normal distribution \({\mathcal {N}}(0, 0.3)\) from the oracle curve. Five points are used as the training set. After fitting the curve with square loss, the model is evaluated on 10 test points. Both MAML and Aviator are meta-trained and meta-tested with 10,000 tasks. A 4-layer MLP, with ReLU activation and hidden layer dimension 100 is used as the backbone for both methods. Mean Square Error (MSE) is computed to evaluate the ability of the few-shot regression model. Since the heterogeneity of the tasks, Aviator is expected to get better results.

After meta-training, MAML and Aviator achieve 3.043 and 1.926 MSE respectively. One regression task for each curve family is randomly chosen for illustration (in Fig. 4). It can be found that MAML uses the same model initialization in all cases, which is non-informative for the complex task. The initial curve of Aviator obviously adapts for different scenarios.

Fig. 4
figure 4

Initial and updated regression models of MAML and Aviator on 3 different types of curves. Learning with 5 noisy examples (the black dots), the model is asked to fit the curve

5.3 Synthetic classification tasks

To visualize the classification boundary, we test MAML and Aviator on a 2-D classification task. We generate 100 classes in total based on the normal distribution, with mean sampled from \([0, 1]^2\) and variance equals 0.05I. Half of the 100 classes are used for meta-training, half of the remaining classes are used for model selection, and the last 25 classes are for meta-test. Equipped with a two-layer MLP as the encoder, MAML and Aviator are trained by 50,000 5-shot 3-way tasks, and tested on 10,000 tasks. The average few-shot classification accuracy of MAML and Aviator are listed in Table 2. By observing the results, we can find Aviator achieves much better classification results than MAML. Besides, to achieve the same level quality of the few-shot classification model (w.r.t. the test accuracy), Aviator requires much fewer steps. So the adaptively initialized model can effectively reduce the computational burden of the gradient-based meta-learning approach.

One 3-way task is randomly selected for visualization, as shown in Fig. 5. The first row in the figure corresponds to the task, while the second row presents the same task with a permutation of classes. 200 instances from each class are drawn to show the range of the class, and different classes instances/classification boundaries are denoted by different colors. From the results, MAML uses the uniform non-informative initial classifier, which is difficult to apply on various tasks, so that more gradient descent steps are required to update the initial model to fit a specific task. While for Aviator, it deals with two permuted tasks well, and the adaptively generated initial classifier is close to the final updated ones.

Table 2 The 3-way classification performance of MAML and Aviator on the synthetic classification data set. The average accuracy using inner-task step-size 0.01 and different numbers of updates are reported
Fig. 5
figure 5

Initial and updated classification models of MAML and Aviator on 2 different 3-way tasks. Colors of instances denote the class, and the shadow region are the classification boundary of the specified model. Each row corresponds to a task, and these two tasks only differ in the order of the classes

Table 3 The average classification accuracy and 95% confidence interval when different methods are evaluated on few-shot tasks sampled from MiniImageNet. All methods use the 4-layer ConvNet Backbone

5.4 Benchmark results

Table 3 shows the results of various few-shot learning methods on the MiniImageNet data sets, where both 1-shot 5-way and 5-shot 5-way tasks are investigated. The main results of comparison methods are cited from a recent empirical study on few-shot learning (Chen et al. 2019). Different from the reported value of Finn et al. (2017a), there exists a noticeable gap of MAML’s results between the reported and reproduced one. By adding some specific training strategies, Antoniou et al. (2018) improves the few-shot classification ability of MAML. We also re-implement the ProtoMAML (Triantafillou et al. 2019) approach, which constructs an embedding learning objective based on the relationship between the prototypes and the classifiers. Owing to the adaptive embedding in ProtoMAML, it can improve over vanilla MAML a lot, as reported in their paper. Our Aviator approach gets the best performance among all previous results, which validates the importance of the task-adaptive initialization in the few-shot inner-task update. It is also notable that our results are evaluated on 10,000 tasks, which are more convinced with a lower confidence interval. Similar trends can also be found in CUB data set in Table 4.

Table 4 The average classification accuracy and 95% confidence interval when different methods are evaluated on few-shot tasks sampled from CUB. All methods use the 4-layer ConvNet Backbone

5.5 Ablation studies and discussions

In this subsection, we analyze various properties of our proposed Aviator approach on the MiniImageNet data set.

5.5.1 The influence of different components in the inner-task update

There exist various key components to update the classifier within a task. In addition to focusing on the model initialization in MAML and Aviator, Meta-SGD (Li et al. 2017) adds another learnable parameter to assist the gradient descent; MT-Net (Lee and Choi 2018) enhances the construction of the backbone, which decomposes the embedding function into two task-common and task-specific feature generation flows; Antoniou et al. (2018) propose an orthogonal way to train MAML, some special tricks like the accumulated annealed step-wise losses are used. Since ProtoMAML (Triantafillou et al. 2019) dynamically updates the embedding for each task in a MAML way, the further performance improvement of ProtoMAML verifies the importance of considering task context and make the method adaptive. Table 5 lists the results of all these methods. Aviator easily gets the best performance with only an addition of a simple top-layer initialization generation function.

Table 5 Comparison between different MAML-based improvements on MiniImageNet benchmark with the 4-layer ConvNet backbone

5.5.2 Different implementation of the g

In previous experiments, we use the simple two-layer fully connected network to implement the the initialization generation function g. Attention mechanism (Vaswani et al. 2017) can also be introduced to generate the top-layer linear classifier. Denote the embedding prototype of the training set as \(P = [\mathbf{p}_1, \ldots , \mathbf{p}_N]\in {\mathbb {R}}^{d\times N}\). For the nth class, the corresponding classifier is generated by \(\mathbf{p}_n + \mathbf{Softmax}(\mathbf{p}_n^\top M P)P\). Here M is a learnable matrix, and \(\mathbf{Softmax}(\cdot )\) is used to normalize the similarity value between one class and other prototypes. The usage of the attention can depict the relationship between classes better, which results in the Aviator\(^+\) approach. The results of all methods are shown in Table 6, where Aviator\(^+\) can further improve the few-shot classification performance. We also list the results of Feat, which also takes advantage of the self-attention to adapt the embedding. It is notable that different from the embedding-based approach Feat, Aviator\(^+\) incorporates task context with an optimization-based strategy, and could be further improved with other helpful objectives (Ye et al. 2018).

Table 6 Comparison between different methods on MiniImageNet benchmark with the 4-layer ConvNet backbone. By using the attention, Aviator\(^+\) achieves better results

5.5.3 Generalization of Aviator to more ways

One promising feature of the embedding-based few-shot learning approach is their ability to generalize to different forms of tasks. As in matching network (Vinyals et al. 2016) and prototypical network (Snell et al. 2017), by learning a few-shot facilitated embedding, tasks with a various number of classes can all be classified via the nearest neighbor rule. For MAML-based approaches, the main obstacle is the strong relationship between the classifier and the number of classes. By generating initial classifier based on the training set embedding adaptively, our Aviator approach handles different ways of tasks effectively. With models trained by 1-shot 5-way tasks, we test the same model on 1-shot \(\{5, 10, 15, 20\}\)-way tasks. The results of our re-implemented ProtoNet (Snell et al. 2017), Aviator, and Aviator\(^+\) can be found in Fig. 6. Our Aviator and its variants can get the best performance in most scenarios.

Fig. 6
figure 6

Testing Aviator variants and our re-produced ProtoNet on MiniImageNet with different numbers of ways. After meta-train with 5-way, these models are evaluated on \(\{5,10,15,20\}\) ways 1-shot/5-shot tasks

6 Conclusion

Model-agnostic meta-learning (MAML) is an important flow of the few-shot learning research, which capsules the model information into the gradient and supervises the model update by the loss on the training set. Current results observe the difficulties of MAML tackling complicated tasks, and its weakness to adaptively build the correspondence between a class and an initialized model. We propose our AdaptiVely InitiAlized Task OptimizeR (Aviator) approach to incorporate the task context and enable the gradient-based meta-learning applicable on various kinds of few-shot learning tasks. Using a re-parameterization strategy, the task-specific initialization facilitates the inner-task training a lot. Visualization experiments on synthetic data show the adaptively generated model initialization of Aviator is consistent with the first impression of the task. Empirical results on benchmark data sets verify the superiority of Aviator as well. The improvement of Aviator over MAML can be applied to almost all fields where MAML works well. We will investigate more real applications in the future.