A Closer Look at Few-shot Classification Again
A Closer Look at Few-shot Classification Again
A Closer Look at Few-shot Classification Again
Abstract design of the training algorithm should prepare for the algo-
rithm used for adaptation. For this reason, pioneering works
Few-shot classification consists of a training
(Vinyals et al., 2016; Finn et al., 2017; Ravi & Larochelle,
phase where a model is learned on a relatively
2017) formalize the problem with meta-learning framework,
arXiv:2301.12246v4 [cs.LG] 1 Jun 2023
1
A Closer Look at Few-shot Classification Again
fer learning literature. Such meta-level understanding can from the test dataset Dtest with classes and domains possi-
be useful for future few-shot learning research. The analysis bly different from those of Dtrain . Each task τ consists of
for each phase leads to the following key observations: a support set S = {(xi , yi )}N S
i=1 used for adaptation and a
∗ ∗ NQ
query set Q = {(xi , yi )}i=1 that is used for evaluation and
1. We observed a different neural scaling law in few-shot shares the same label space with S. τ is called a N -way K-
classification that test error falls off as a power law with shot task if there are N classes in the support set S and each
the number of training classes, instead of the number class contains exactly K samples. To solve each task τ , the
of training samples per class. This observation high- adaptation algorithm Aadapt takes the learned model fθ and
lights the importance of the number of training classes in the support set S as inputs, and produces a new classifier
few-shot classification and may help future research fur- g(·; fθ , S) : RD → [N ]. The constructed classifier will be
ther understand the crucial difference between few-shot evaluated on the query set Q to test its generalization ability.
classification and other vision tasks. The evaluation metric is the average performance over all
sampled tasks. We denote both the resultant average accu-
2. We found two evaluated datasets on which increasing the racy and the radius of 95% confidence interval as a function
scale of training dataset does not always lead to better of training and adaptation algorithms: Avg(Atrain , Aadapt )
few-shot performance. This suggests that it is never real- and CI(Atrain , Aadapt ), respectively.
istic to train a model that can solve all possible tasks well
just by feeding it a very large amount of data. This also Depending on the form of training algorithm Atrain , the
indicates the importance of properly filtering training model fθ can be different. For non-meta-learning methods,
knowledge for different few-shot classification tasks. fθ : RD → Rd is simply a feature extractor that takes an
image x ∈ RD as input and outputs a feature vector z ∈ Rd .
3. We found that standard ImageNet performance is not a Thus any visual representation learning algorithms can be
good predictor of few-shot performance for supervised used as Atrain . For meta-learning methods, the training
models (contrary to previous observations in other vision algorithm directly aims at optimizing the performance of
tasks), but it does predict well for self-supervised models. the adaptation algorithm Aadapt in a learning-to-learn fash-
This observation may become the key to understanding ion. Specifically, meta-learning methods firstly parameterize
both the difference between few-shot classification and the adaptation algorithm Aadapt
θ that makes it optimizable.
other vision tasks, and the difference between supervised Then the model fθ used for training is set equal to Aadapt ,
θ
learning and self-supervised learning. i.e., Atrain (Dtrain ) = fθ = Aadapt . The training process
θ
4. We found that, contrary to a common belief that fine- consists of constructing pseudo few-shot classification tasks
NTtrain
tuning the whole network with few samples would lead T train = {(Sttrain , Qtrain
t )}t=1 from Dtrain that take
to severe overfitting, vanilla fine-tuning performs the the same form with tasks during adaptation. In each itera-
best among all adaptation algorithms even when data is tion t, just like what will be done in the adaptation phase,
extremely scarce, e.g., 5-way 1-shot task. In particular, the model fθ takes Sttrain as input and outputs a classifier
partial finetune methods that are designed to overcome g(·; St ). Images in Qtrain
t are then fed into g(·; St ) and
the overfitting problem of vanilla finetune in few-shot return a loss that is used to update fθ . After training, fθ
setting perform worse. The advantage of finetune ex- is directly used as the adaptation algorithm Aadapt θ . Al-
pands with the increase of the number of ways, shots and though different from non-meta-learning methods, most
the degree of task distribution shift. However, finetune meta-learning algorithms still set the learnable parameters θ
methods suffer from extremely high time complexity. as the parameters of a feature extractor, making it possible
We show that the difference in these factors is the reason to change the algorithm used for adaptation.
why state-of-the-art methods in different few-shot classi-
fication benchmarks differ in adaptation algorithms. 3. Are Training and Adaptation Algorithms
Uncorrelated?
2. The Problem of Few-shot Classification
Given a set of training algorithms M train = {Atraini }m 1
i=1
Few-shot classification aims to learn a model that can and a set of adaptation algorithms M adapt = {Ai adapt m
}i=1
2
,
quickly adapt to a novel classification task given only we say that M train
and M adapt
are uncorrelated if changing
few observations. In the training phase, given a training algorithms from M train does not influence the performance
|D train |
dataset Dtrain = {xn , yn }n=1 with NC classes, where ranking of algorithms from M adapt , and vice versa. To give
xi ∈ RD is the i-th image and yi ∈ [NC ] is its label, a a precise description, we first define a partial order.
model fθ is learned via a training algorithm Atrain , i.e.,
Atrain (Dtrain ) = fθ . In the adaptation phase, a series of Definition 3.1. We say two training algorithms
few-shot classification tasks T = {τi }N T
i=1 are constructed Atrain
a , Atrain
b have the partial order Atrain
a ⪯ Atrain
b , if
2
A Closer Look at Few-shot Classification Again
Table 1. Few-shot classification performance of pairwise combinations of a variety of training and adaptation algorithms. All evaluation
tasks are 5-way 5-shot tasks sampled from Meta-Dataset (excluding ImageNet). We sample 2000 tasks per dataset in Meta-Dataset
and report the average accuracy over all datasets along with the 95% confidence interval. The algorithms are listed according to their
partial order according to Definition 3.2 from top to bottom and from left to right. * denotes training algorithm that uses transductive
BN (Bronskill et al., 2020) that produces a much higher, unfair performance using Fintune and TSA as adaptation algorithms. †: TSA and
eTT are both architecture-specific partial-finetune algorithms, thus TSA can be used for CNN only and eTT for original ViT only.
Adaptation algorithm
Training algorithm Training dataset Architecture MatchingNet MetaOpt NCC LR URL CC TSA/eTT† Finetune
PN miniImageNet Conv-4 48.54±0.4 49.84±0.4 51.38±0.4 51.65±0.4 51.82±0.4 51.56±0.4 58.08±0.4 60.88±0.4
MAML∗ miniImageNet Conv-4 53.71±0.4 53.69±0.4 55.01±0.4 55.03±0.4 55.66±0.4 55.63±0.4 62.80±0.4 64.87±0.4
CE miniImageNet Conv-4 54.68±0.4 56.79±0.4 58.54±0.4 58.26±0.4 59.63±0.4 59.20±0.5 64.14±0.4 65.12±0.4
MatchingNet miniImageNet ResNet-12 55.62±0.4 57.20±0.4 58.91±0.4 58.99±0.4 61.20±0.4 60.50±0.4 64.88±0.4 67.93±0.4
MAML∗ miniImageNet ResNet-12 58.42±0.4 58.52±0.4 59.65±0.4 60.04±0.4 60.38±0.4 60.50±0.4 71.15±0.4 73.13±0.4
PN miniImageNet ResNet-12 60.19±0.4 61.70±0.4 63.71±0.4 64.46±0.4 65.64±0.4 65.76±0.4 70.44±0.4 74.23±0.4
MetaOpt miniImageNet ResNet-12 62.06±0.4 63.94±0.4 65.81±0.4 66.03±0.4 67.47±0.4 67.24±0.4 72.07±0.4 74.96±0.4
DeepEMD miniImageNet ResNet-12 62.67±0.4 64.15±0.4 66.14±0.4 66.14±0.4 68.66±0.4 69.76±0.4 74.21±0.4 74.83±0.4
CE miniImageNet ResNet-12 63.27±0.4 64.91±0.4 66.96±0.4 67.14±0.4 69.78±0.4 69.52±0.4 74.30±0.4 74.89±0.4
Meta-Baseline miniImageNet ResNet-12 63.25±0.4 65.02±0.4 67.28±0.4 67.56±0.4 69.84±0.4 69.76±0.4 73.94±0.4 75.04±0.4
COS miniImageNet ResNet-12 63.99±0.4 66.09±0.4 68.31±0.4 69.26±0.4 70.71±0.4 71.03±0.4 75.10±0.4 75.68±0.4
PN ImageNet ResNet-50 63.68±0.4 65.79±0.4 68.40±0.4 68.87±0.4 69.69±0.4 70.81±0.4 74.15±0.4 78.42±0.4
S2M2 miniImageNet WRN-28-10 64.41±0.4 66.59±0.4 68.67±0.4 69.16±0.4 70.88±0.4 71.38±0.4 74.94±0.4 76.89±0.4
FEAT miniImageNet ResNet-12 65.42±0.4 67.15±0.4 69.06±0.4 69.21±0.4 71.24±0.4 72.07±0.4 75.99±0.4 76.83±0.4
IER miniImageNet ResNet-12 65.37±0.4 67.31±0.4 69.30±0.4 70.01±0.4 72.48±0.4 72.85±0.4 76.70±0.4 77.54±0.4
Moco v2 ImageNet ResNet-50 65.47±0.5 68.63±0.4 71.05±0.4 71.49±0.4 74.46±0.4 74.57±0.4 79.70±0.4 79.98±0.4
Exemplar v2 ImageNet ResNet-50 67.70±0.5 70.07±0.4 72.55±0.4 72.93±0.4 75.26±0.4 76.83±0.4 80.22±0.4 81.75±0.4
DINO ImageNet ResNet-50 73.97±0.4 76.45±0.4 78.30±0.4 78.72±0.4 80.73±0.4 81.05±0.4 83.64±0.4 83.20±0.4
CE ImageNet ResNet-50 74.75±0.4 76.94±0.4 78.96±0.4 79.57±0.4 80.89±0.4 81.51±0.4 84.07±0.4 84.92±0.4
BiT-S ImageNet ResNet-50 75.44±0.4 77.86±0.4 79.84±0.4 79.97±0.4 81.79±0.4 81.91±0.4 84.84±0.3 86.40±0.3
CE ImageNet Swin-B 75.17±0.4 77.81±0.4 80.06±0.4 81.04±0.4 82.55±0.4 82.46±0.4 - 88.16±0.3
DeiT ImageNet ViT-B 75.82±0.4 78.34±0.4 80.62±0.4 81.68±0.4 82.80±0.3 83.13±0.4 84.22±0.3 87.62±0.3
CE ImageNet ViT-B 76.78±0.4 78.81±0.4 80.65±0.4 81.13±0.3 82.69±0.3 82.77±0.3 85.60±0.3 88.48±0.3
DINO ImageNet ViT-B 76.44±0.4 79.11±0.4 81.23±0.4 82.01±0.4 84.16±0.3 84.44±0.3 86.25±0.3 88.04±0.3
CLIP WebImageText ViT-B 78.06±0.4 81.20±0.4 83.04±0.3 83.22±0.3 84.11±0.3 84.20±0.3 87.66±0.3 90.26±0.3
for all i ∈ [m2 ], Now, to see whether training and adaptation algorithms
in few-shot classification are uncorrelated, we choose a
Avg(Atrain
a , Aadapt
i ) − CI(Atrain
a , Aadapt
i ) wide range of training and adaptation algorithms from pre-
<Avg(Atrain
b , Aadapt
i ) + CI(Atrain
b , Aadapt
i ). (1) vious few-shot classification methods with various train-
ing datasets and network architectures to form M train and
This inequality holds when the values inside the confidence M adapt . We then conduct experiments on each pair of al-
interval of Atrain
b are all larger than or at least have overlap gorithms, one from M train and another from M adapt , to
with that of Atrain
a when evaluated with every adaptation al- check whether the two sets are ordered sets.
gorithm in M adapt . This implies that there is a considerable
probability that the performance of Atrain is no worse than Algorithms evaluated. The selected set of training algo-
b
Atrain when combined with any possible adaptation algo- rithms M train encompasses both meta-learning and non-
a
adapt meta-learning methods. For meta-learning methods, we
rithm Ai , thus the ranking of the two training algorithms
evaluate MAML (Finn et al., 2017), ProtoNet (Snell et al.,
are not influenced by adaptation algorithms with high prob-
2017), MatchingNet (Vinyals et al., 2016), MetaOpt (Lee
ability. We here use ⪯ instead of ≺ to show that the defined
et al., 2019), Feat (Ye et al., 2020), DeepEMD (Zhang et al.,
partial order is not strict, so it is valid that Atrain
a ⪯ Atrain
b
2020) and MetaBaseline (Chen et al., 2021b). For non-meta-
and Atrain
b ⪯ A train
a hold simultaneously, meaning that
learning methods, we evaluate supervised algorithms includ-
the two algorithms are comparable. The partial order inside
ing Cross-Entropy baseline (Chen et al., 2019), COS (Luo
M adapt can be similarly defined by exchanging training and
et al., 2021), S2M2 (Mangla et al., 2020), IER (Rizve et al.,
adaptation algorithms above. We are now ready to define
2021), BiT (Kolesnikov et al., 2020), Exemplar v2 (Zhao
what it means for two sets of algorithms to be uncorrelated.
Definition 3.2. M train and M adapt are uncorrelated if they i.e., if Atrain
1 ⪯ Atrain
2 and Atrain
2 ⪯ Atrain
3 , it is possible that
are both ordered sets wrt the partial order relation defined Atrain
1 ⪯ A train
3 does not hold. However, in our experiment,
such cases do not exist. Thus we assume the transitivity holds and
in Definition 3.1.1 we can always get an ordered set of algorithms from one-by-one
1
The partial order in Definition 3.1 may not satisfy transitivity, relations.
3
A Closer Look at Few-shot Classification Again
et al., 2021) and DeiT (Touvron et al., 2021); unsuper- the next two sections, we will, for the first time, individually
vised algorithms including MoCo-v2 (He et al., 2020) and analyze each of the two phases of algorithms while fixing
DINO (Caron et al., 2021); and a multimodal pre-training al- the algorithms in the other phase.
gorithm CLIP (Radford et al., 2021). M adapt encompasses
the ones from meta-learning methods including Match- 4. Training Analysis
ingNet, MetaOpt, Nearest Centroid Classifier (PN), Fine-
tune (MAML); the ones from non-meta-learning methods Throughout this section, we will fix the adaptation algorithm
including Logistic Regression (Tian et al., 2020), URL (Li to the Nearest-Centroid Classifier, and analyze some aspects
et al., 2021), Cosine Classifier (Chen et al., 2019); and test- of interest in the training process of few-shot classification.
time-only methods TSA (Li et al., 2022b) and eTT (Xu et al., According to Section 3, observations would not change with
2022a). high probability if we change adaptation algorithms.
Datasets. For the test dataset, we choose Meta-Dataset (Tri- 4.1. On the Scale of Training Dataset
antafillou et al., 2020), a dataset of datasets that covers 10
diverse vision datasets from different domains. We remove We are first interested in understanding how the scale of
ImageNet from Meta-Dataset to avoid label leakage from training dataset influences few-shot classification perfor-
training. For training, we choose three datasets of different mance. In few-shot classification, since classes in training
scales: the train split of miniImageNet (Vinyals et al., 2016) and adaptation do not need to overlap, in addition to increas-
that contains 38400 images from 64 classes, the train split ing the number of samples per class, we can also increase
of ImageNet (Deng et al., 2009) that contains more than 1 the training dataset size by increasing the number of training
million images from 1000 classes, and a large-scale mul- classes. This is different from standard vision classification
timodal dataset WebImageText (Radford et al., 2021) that tasks where studying the effect of increasing the number of
contains 400 million (image, text) pairs. For completeness, samples per class is of more interest.
we also show traditional miniImageNet-only experiments in
We conduct both types of scaling experiments on the train-
Table 4-5 in the Appendix.
ing set of ImageNet, a standard vision dataset that is always
Results. Table 1 shows 5-way 5-shot performance of all pair- used as a pre-training dataset for downstream tasks. We
wise combinations of algorithms from M train and M adapt . choose three representative training algorithms that cover
As seen, both training algorithms and adaptation algorithms main types of algorithms: (1) Cross Entropy (CE) train-
form ordered sets according to Definition 3.2: when we fix ing, the standard supervised training in image classification
any adaptation algorithm (a column in the table), the perfor- tasks; (2) ProtoNet (PN), a widely-used meta-learning al-
mance is monotonically increasing (or at least confidence gorithm; (3) MoCo-v2, a strong unsupervised visual rep-
intervals are intersected) as we move from top to bottom; resentation learning algorithm. For each dataset scale, we
similarly, adaptation algorithms form an ordered set from randomly select samples or classes 5 times, train a model
left to right. 1-shot results are similar and are given in Table using the specified training algorithm, and report the aver-
3 in the Appendix. Since we have covered a bunch of repre- age performance and the standard variation over the 5 trials
sentative few-shot classification algorithms, we can say that of training. The adaptation datasets we choose include 9
with high probability, training and adaptation algorithms are datasets from Meta-Dataset and the standard validation set
uncorrelated in few-shot classification. of ImageNet. We plot the results of ranging the number of
samples per class in Figure 1 and the results of ranging the
Remark. According to Definition 3.2, since M train and number of classes in Figure 2. Both axes are plotted in log
M adapt are uncorrelated, changing algorithms either in scale. We also report the results evaluated on additional 9
M train or M adapt along the sequences in the ordered set datasets in BSCD-FSL benchmark and DomainNet in Figure
always leads to performance improvement. Thus a simple 8-9 in the Appendix. We make the following observations.
greedy search on either side of algorithms always leads to
global optima. A direct consequence is that, if two phases of Neural scaling laws for training. Comparing Figure 1
algorithms are optimal on their own, their combinations are and 2, we can see that for supervised models (CE and PN),
optimal too. For example, from Table 1 we can see that, for increasing the number of classes is much more effective
5-way 5-shot tasks on Meta-Dataset, CLIP and Finetune are than increasing the number of samples per class (We give
the optimal training and adaptation algorithms respectively, clearer comparisons in Figure 10-12 in the Appendix). The
and their combination also becomes the optimal combina- effect of increasing the number of samples per class plateaus
tion. quickly, while increasing the number of classes leads to very
stable performance improvement throughout all scales. We
This algorithm-disentangled property would greatly simplify notice that most performance curves of PN and CE in Figure
the algorithm design process in few-shot classification. In 2 look like a straight line. In Figure 13-15 in the appendix
4
A Closer Look at Few-shot Classification Again
ImageNet-val Omniglot Aircraft Birds Textures
50 69 59 48 CE
5-way 5-shot test error (%)
18 CE 54 CE
CE CE CE 46 CE
39 PN PN PN PN
MoCo MoCo 28 MoCo 49 MoCo
50 39
35
18
41 32 39
31
CE
PN
27 25 MoCo
32 8 1% 2% 29 24 1%
1% 2% 5% 10% 30%50% 100% 1% 2% 5% 10% 30%50% 100% 5% 10% 30%50% 100% 1% 2% 5% 10% 30%50% 100% 1% 2% 5% 10% 30%50% 100% 2% 5% 10% 30% 50% 100%
Proportion of data used for training Proportion of data used for training
Figure 1. The effect of sample size per training class on few-shot classification performance. We use all 1000 classes of the training set of
ImageNet for training. Both axes are logit-scaled. ImageNet-val means conducting few-shot classification on the original validation set of
ImageNet. The average performance is obtained by averaging performance on 9 datasets excluding ImageNet-val. Best viewed in color.
CE 72 59 60 CE 54
CE CE 47 CE 55 CE CE
48 PN PN PN PN PN
MoCo MoCo 34 MoCo MoCo MoCo
54 45 49
41
21
43 35 39
34
27 32 8 10 25 29 24
10 20 50 100 300 500 1000 10 20 50 100 300 500 1000 20 50 100 300 500 1000 10 20 50 100 300 500 1000 10 20 50 100 300 500 1000 10 20 50 100 300 500 1000
Number of classes used for training Number of classes used for training
Figure 2. The effect of the number of training classes on few-shot classification performance. For each randomly selected class in
ImageNet, we use all of its samples from the training set for training. Both axes are logit-scaled. Best viewed in color.
we plot the linear fit which verifies our observations. In rated), can easily obtain a near-zero error. This indicates that
fact, the Pearson coefficient between the log scale of the as the number of training samples/classes increases, there
number of training classes and the log scale of average test exists a progressively larger mismatch between the knowl-
error is −0.999 for CE and −0.995 for PN, showing strong edge learned from ImageNet and the knowledge needed
evidence of linearity. This linearity indicates the existence for distinguishing new classes in these two datasets. Thus
of a form of neural scaling laws in few-shot classification: training a large model on a big dataset that can solve every
test error falls off as a power law with the number of training possible task well is not a realistic hope, unless the training
classes, which is different from neural scaling laws observed dataset already contains all possible tasks. How to choose
in other machine learning tasks (Hestness et al., 2017; Zhai a part of the training dataset to train a model on, or how
et al., 2022; Kaplan et al., 2020) that test error falls off as to select positive/useful knowledge from a learned model
a power law with the number of training samples per class. depending on only a small amount of data in the specified
Such a difference reveals the intrinsic difference between adaptation scenario is an important research direction in
few-shot classification and other tasks: while seeing more few-shot classification.
samples in a training class does help in identifying new
samples in the same class, it may not help that much in CE training scales better. As seen from both figures, PN
identifying previously-unseen classes in a new task. On and MoCo perform comparably to CE on small-scale train-
the other hand, seeing more classes may help the model ing data, but as more training data comes in, the gap grad-
learn more potentially useful knowledge that might help ually widens. Considering that all algorithms have been
distinguish new classes. fed with the same amount of data during training, we can
infer that CE training indeed scales better than PN and
Bigger is not necessarily better. On most evaluated MoCo. This trend seems to be more obvious for fine-grained
datasets, test error decreases with more training sam- datasets including Aircraft, Birds, Fungi, and VGG Flower.
ples/classes. However, on Omniglot and ISIC (shown in While this phenomenon needs further investigation, we spec-
Figure 8-9), the error first goes down and then goes up, ulate it is due to that CE simultaneously differentiates all
especially for supervised models. On the contrary, previ- classes during training which requires distinguishing all pos-
ous works (Snell et al., 2017) have shown that a simple PN sible fine-grained classes. On the contrary, meta-learning al-
model, both training and evaluating on Omniglot (class sepa- gorithms like PN typically need to distinguish only a limited
5
A Closer Look at Few-shot Classification Again
ImageNet-val Omniglot 47
Aircraft Birds Textures
5-way 5-shot test error
7
9 22
43 12
4
1 5 18
35
6
17 23 29 35 17 23 29 35 17 23 29 35 17 23 29 35 17 23 29 35
21
Quick Draw 35 Fungi VGG Flower Traffic Signs MSCOCO
5-way 5-shot test error
8
29
34 31 25
6
27 25
29 22
19
4
23 19 21
24
17 23 29 35 17 23 29 35 17 23 29 35 17 23 29 35 17 23 29 35 17 23 29 35
Top-1 error on ImageNet Top-1 error on ImageNet
Figure 3. For supervised models, ImageNet performance is not a good predictor of few-shot classification performance. Each point in a
plot refers to a supervised CE model with a specific backbone architecture. Both axes are logit-scaled.
65 49
22 9 62
45
2 13
3 36 5
23 42 61 80 23 42 61 80 23 42 61 80 23 42 61 80 23 42 61 80
29
Quick Draw 71 r=0.913
Fungi VGG Flower Traffic Signs MSCOCO
r=0.418 47 r=0.911 59 r=0.731 65 r=0.946
32
5-way 5-shot test error
35 54 45 50
17
37 31 35
29
2 17
23 20 17 20
23 42 61 80 23 42 61 80 23 42 61 80 23 42 61 80 23 42 61 80 23 42 61 80
KNN Top-1 error on ImageNet KNN Top-1 error on ImageNet
Figure 4. For self-supervised models, ImageNet performance is a good predictor of few-shot classification performance. Each point in a
plot refers to a self-supervised model with a specific training algorithm/architecture. Both axes are logit-scaled. The regression line and a
95% confidence interval are plotted in blue. “r” refers to the correlation coefficient between the two axes of data.
number of classes during each iteration, and self-supervised on benchmarks that use ImageNet as the training dataset
models like MoCo do not use labels, thus focusing more like Meta-Dataset, by just waiting for state-of-the-art Ima-
on global information in images (Zhao et al., 2021) and geNet models. For this, we test 36 pre-trained supervised
performing not well on fine-grained datasets. We leave it CE models with different network architectures, includ-
for future work to verify if this conjecture holds generally. ing VGG (Simonyan & Zisserman, 2015), ResNet (He
et al., 2016), MobileNet (Howard et al., 2017), RegNet (Ra-
4.2. ImageNet Performance vs Few-shot Performance dosavovic et al., 2020), DenseNet (Huang et al., 2017),
ViT (Dosovitskiy et al., 2021), Swin Transformer (Liu et al.,
Then we fix the scale of training dataset and investigate 2021b) and ConvNext (Liu et al., 2022). We also test 32
how the changes in training algorithms and network archi- self-supervised ImageNet pre-trained models with different
tectures influence few-shot performance. We especially network algorithms and architectures. The algorithms in-
pay our attention to CE-trained and self-supervised mod- clude MoCo-v2 (He et al., 2020), MoCo-v3 (Chen et al.,
els due to their superior performance shown in Table 1. 2021a), InstDisc (Wu et al., 2018), BYOL (Grill et al., 2020),
Previous studies have revealed that the standard ImageNet SwAV (Caron et al., 2020), OBoW (Gidaris et al., 2021),
performance of CE models trained on ImageNet is a strong SimSiam (Chen & He, 2021), Barlow Twins (Zbontar et al.,
predictor (with a linear relationship) of its performance on 2021), DINO (Caron et al., 2021), MAE (He et al., 2022),
a range of vision tasks, including transfer learning (Korn- iBOT (Zhou et al., 2022) and EsViT (Li et al., 2022a). We
blith et al., 2019), open-set recognition (Vaze et al., 2022) use KNN (Caron et al., 2021) to compute top-1 accuracy for
and domain generalization (Taori et al., 2020). We here these self-supervised models. We plot results for supervised
ask if this observation also holds for few-shot classifica- models in Figure 3 and self-supervised models in Figure 4.
tion. If this is true, we can improve few-shot classification
6
A Closer Look at Few-shot Classification Again
Figure 5. Way and shot experiments of adaptation algorithms on ImageNet and Quick Draw. For shot experiment, we fix the number of
ways to 5 and show test error, and for way experiment, we fix the number of shots to 5 and show test accuracy. Both axes are logit-scaled.
Best viewed in color.
Supervised ImageNet models overfit to ImageNet per- analyze how the performance of different adaptation algo-
formance. For supervised models, we can observe from rithms varies under different choices of ways and shots, with
Figure 3 that on most datasets sufficiently different from the training algorithm unchanged. For this experiment, we
ImageNet, such as Aircraft, Birds, Textures, Fungi, and choose ImageNet and Quick Draw as the evaluated datasets
VGG Flower, the test error of few-shot classification first because these two datasets have enough classes and images
decreases and then increases with the improvement of Ima- per class to be sampled and are representative for in-domain
geNet performance. The critical point is at about 23% Top-1 and out-of-domain datasets, respectively. For ImageNet, we
error on ImageNet, which is the best ImageNet performance remove all classes from miniImageNet.
in 2017 (e.g., DenseNet (Huang et al., 2017)). This indicates
that recent years of improvement in image classification on Neural scaling laws for adaptation. We notice that for Lo-
ImageNet overfit to ImageNet performance when the down- gistic Regression, Finetune, and MetaOPT, the performance
stream task is specified as few-shot classification. We also curves approximate straight lines when varying the number
observe that on datasets like Quick Draw, Traffic Signs, and of shots. This indicates that for the scale of the adaptation
Ominglot, there is no clear relationship between ImageNet dataset, the classification error obeys the traditional neural
performance and few-shot performance. Since supervised scaling laws (different from what we found for the scale of
ImageNet performance is usually a strong predictor of other the training dataset in Section 4.1). While this seems to be
challenging vision tasks, few-shot classification stands out a reasonable phenomenon for Finetune, we found it a sur-
to be a special task that needs a different and much better prise for Logistic Regression and MetaOpt, which are linear
generalization ability. Identifying the reasons behind the dif- algorithms (for adaptation) built upon frozen features and
ference may lead to a deeper understanding of both few-shot are thus expected to reach performance saturation quickly.
classification and vision representation learning. This reveals that even for small-scaled models trained on
miniImageNet, the learned features are still quite linearly-
ImageNet performance is a good predictor of few-shot separable for new tasks. However, their growth rates differ,
performance for self-supervised models. Different from indicating they differ in their capability to scale.
supervised models, for self-supervised models, we observe
the clear positive correlation between ImageNet perfor- Backbone adaptation is preferred for high-way, high-
mance and few-shot classification performance. The best shot, or cross-domain tasks. As seen from Figure 5, while
self-supervised model only obtains 77% top-1 accuracy on Finetune and the partial-finetune algorithm TSA do not
ImageNet, but obtains more than 83% average few-shot per- significantly outperform other algorithms on 1-shot and 5-
formance, outperforming all evaluated supervised models. shot tasks on ImageNet, their advantages become greater
Thus self-supervised algorithms indeed generalize better when shots or ways increase or the task is conducted on
and the few-shot learning community should pay more at- Quick Draw. Thus we can infer that backbone adaptation is
tention to the progress of self-supervised learning. preferred when data scale is large enough to avoid overfitting
or when the domain shift so large that the learned feature
space deforms on the new domain.
5. Adaptation Analysis
Query-support matching algorithms scale poorly. Query-
In this section, we fix the training algorithm to the CE model
support matching algorithms like TSA, MatchingNet, NCC,
trained on miniImageNet and analyze adaptation algorithms.
and URL obtain query predictions by comparing the sim-
ilarities of query features with support features2 , different
5.1. Way and Shot Analysis
2
Although these algorithms all belong to metric-based algo-
Ways and shots are important variables during the adapta- rithms, there exist other metric-based algorithms like Cosine Clas-
tion phase of few-shot classification. For the first time, we sifier that are not query-support matching algorithms.
7
A Closer Look at Few-shot Classification Again
Table 2. The average support set size and the degree of task distribution shift of tasks from each dataset on three benchmarks. The metric
measuring the degree of task distribution shift is defined by the deviation of feature importance; see Table 3 of Luo et al. (2022) for details.
Benchmark miniImageNet BSCD-FSL Meta-Dataset
Dataset miniImageNet ChestX ISIC ESAT CropD ILSVRC Omniglot Aircraft Birds Textures QuickD Fungi Flower Traffic Sign COCO
Mean support set size 5 or 25 25 or 100 or 250 374.5 88.5 337.6 316.0 287.3 425.2 361.9 292.5 421.2 416.1
Task distribution shift 0.056 0.186 0.205 0.153 0.101 0.054 0.116 0.097 0.117 0.100 0.106 0.080 0.096 0.150 0.083
from other algorithms that learn a classifier from the support ImageNet Quick Draw
90 90 cosistent lr
8
A Closer Look at Few-shot Classification Again
6. Related Work not been discovered before, which proves the importance of
the number of classes in few-shot learning. Although in Sbai
As an active research field, few-shot learning is considered et al. (2020) the significance of the number of classes has
to be a critical step towards building efficient and brain-like also been discussed from different perspectives, there are
machines (Lake et al., 2017). Meta-learning (Thrun & Pratt, no clear conclusions in Sbai et al. (2020) and thus we com-
1998; Schmidhuber, 1987; Naik & Mammone, 1992) was plement their studies; (2) we observed that larger datasets
thought to be an ideal framework to approach this goal. Un- may lead to degraded performance in specific downstream
der this framework, methods can be roughly split into three datasets, both in terms of increasing the number of classes
branches: optimization-based methods, black-box meth- and samples per class. Such findings were not present in
ods, and metric-based methods. Optimization-based meth- Sbai et al. (2020), and hence our study opens new avenues
ods, mainly originated from MAML (Finn et al., 2017), for future research by inspecting specific datasets; (3) there’s
learn the experience of how to optimize a neural network no clear evidence in Sbai et al. (2020) that simple supervised
given a few training samples. Variants in this direction con- training scales better than other types of training algorithms;
sider different parts of optimization to meta-learn, including (4) moreover, our paper evaluates 18 datasets, including
model initialization point (Finn et al., 2017; Rusu et al., those beyond ImageNet and CUB, which are the only ones
2019; Rajeswaran et al., 2019; Zintgraf et al., 2019; Jamal studied in Sbai et al. (2020). Thus, our study provides a
& Qi, 2019; Lee et al., 2019), optimization process (Ravi broader perspective and complements the analysis in Sbai
& Larochelle, 2017; Xu et al., 2020; Munkhdalai & Yu, et al. (2020).
2017; Li et al., 2017) or both (Baik et al., 2020; Park &
Oliva, 2019). Black-box methods (Santoro et al., 2016; Gar-
nelo et al., 2018; Mishra et al., 2018; Requeima et al., 2019) 7. Discussion
directly model the learning process as a neural network with- One lesson learned from our analysis is that training by
out explicit inductive bias. Metric-based methods (Vinyals only scaling models and datasets is not a one-fit-all solution.
et al., 2016; Snell et al., 2017; Sung et al., 2018; Yoon et al., Either the design of the training objective should consider
2019; Zhang et al., 2020) meta-learn a feature extractor that what the adaptation dataset is (instead of the adaptation al-
can produce a well-shaped feature space equipped with a gorithm), or the adaptation algorithm should select accurate
pre-defined metric. In the context of few-shot image clas- training knowledge of interest. The former approach limits
sification, most state-of-the-art meta-learning methods fall the trained model to a specific target domain, while the latter
into the metric-based and optimization-based ones. approach cannot be realized easily when only few labeled
Recently, a number of non-meta-learning methods that uti- data are provided in the target task which makes knowl-
lize supervised (Chen et al., 2019; Tian et al., 2020; Dhillon edge selection difficult or even impossible due to bias of
et al., 2020; Triantafillou et al., 2021; Li et al., 2021) or distribution estimation (Luo et al., 2022; Xu et al., 2022b).
unsupervised representation learning methods (Rizve et al., More effort should be put into aligning training knowledge
2021; Doersch et al., 2020; Das et al., 2022; Hu et al., 2022; and knowledge needed in adaptation. Although we have
Xu et al., 2022a) to train a feature extractor have emerged shown vanilla Finetune performs so well, we believe that
to tackle few-shot image classification. In addition, a bunch such a brute-force, non-selective model adaptation algo-
of meta-learning methods (Chen et al., 2021b; Zhang et al., rithm is not the final solution, and it has other drawbacks
2020; Hu et al., 2022; Ye et al., 2020) learn a model ini- such as having extremely high adaptation cost, as shown in
tialized from a pre-trained backbone (Our experiments also Appendix D. Viewed from another perspective, our work
consider pretrain+meta-learning training algorithms such as points to the plausibility of using few-shot classification as a
Meta-Baseline, DeepEMD and FEAT. Thus our conclusions tool to better understand some key aspects of general visual
hold generally). Since these methods do not strictly follow representation learning.
meta-learning framework, the training algorithm does not
necessarily have a relationship with the adaptation algo- Acknowledgements
rithm, and they are found to be simpler and more efficient
than meta-learning methods while achieving better perfor- Special thanks to Qi Yong for providing indispensable spiri-
mance. Following this line, our paper further reveals that tual support for the work. We also would like to thank all re-
the training and adaptation phases in few-shot image classi- viewers for constructive comments that help us improve the
fication are completely disentangled. paper. This work is supported by National Key Research and
Development Program of China (No. 2018AAA0102200),
One relevant work (Sbai et al., 2020) also gives a detailed and the National Natural Science Foundation of China
and comprehensive analysis of few-shot learning, especially (Grant No. 62122018, No. U22A2097, No. 62020106008,
on the training process. Our study complements this work in No. 61872064).
several ways: (1) the neural scaling laws that we found have
9
A Closer Look at Few-shot Classification Again
Das, D., Yun, S., and Porikli, F. Confess: A framework for He, K., Zhang, X., Ren, S., and Sun, J. Deep residual
single source cross-domain few-shot learning. In ICLR, learning for image recognition. In CVPR, 2016.
2022.
He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Mo-
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, mentum contrast for unsupervised visual representation
L. Imagenet: A large-scale hierarchical image database. learning. In CVPR, 2020.
In CVPR, 2009. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., and Girshick,
R. Masked autoencoders are scalable vision learners. In
Dhillon, G. S., Chaudhari, P., Ravichandran, A., and Soatto, CVPR, 2022.
S. A baseline for few-shot image classification. In ICLR,
2020. Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H.,
Kianinejad, H., Patwary, M., Ali, M., Yang, Y., and Zhou,
Doersch, C., Gupta, A., and Zisserman, A. Crosstransform- Y. Deep learning scaling is predictable, empirically. arXiv
ers: spatially-aware few-shot transfer. In NeurIPS, 2020. preprint arXiv:1712.00409, 2017.
10
A Closer Look at Few-shot Classification Again
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin,
W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: S., and Guo, B. Swin transformer: Hierarchical vision
Efficient convolutional neural networks for mobile vision transformer using shifted windows. In ICCV, 2021b.
applications. arXiv preprint arXiv:1704.04861, 2017.
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T.,
Hu, S. X., Li, D., Stühmer, J., Kim, M., and Hospedales, and Xie, S. A convnet for the 2020s. In CVPR, 2022.
T. M. Pushing the limits of simple pipelines for few-shot
learning: External data and fine-tuning make a difference. Luo, X., Wei, L., Wen, L., Yang, J., Xie, L., Xu, Z., and
In CVPR, 2022. Tian, Q. Rectifying the shortcut learning of background
for few-shot learning. In NeurIPS, 2021.
Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger,
K. Q. Densely connected convolutional networks. In Luo, X., Xu, J., and Xu, Z. Channel importance matters in
CVPR, 2017. few-shot image classification. In ICML, 2022.
Jamal, M. A. and Qi, G.-J. Task agnostic meta-learning for Mangla, P., Kumari, N., Sinha, A., Singh, M., Krishna-
few-shot learning. In CVPR, 2019. murthy, B., and Balasubramanian, V. N. Charting the
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., right manifold: Manifold mixup for few-shot learning. In
Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and WACV, 2020.
Amodei, D. Scaling laws for neural language models.
Mishra, N., Rohaninejad, M., Chen, X., and Abbeel, P. A
arXiv preprint arXiv:2001.08361, 2020.
simple neural attentive meta-learner. In ICLR, 2018.
Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung,
J., Gelly, S., and Houlsby, N. Big transfer (bit): General Munkhdalai, T. and Yu, H. Meta networks. In ICML, 2017.
visual representation learning. In ECCV. Springer, 2020.
Naik, D. K. and Mammone, R. J. Meta-neural networks that
Kornblith, S., Shlens, J., and Le, Q. V. Do better imagenet learn by learning. In IJCNN, 1992.
models transfer better? In CVPR, 2019.
Oreshkin, B., Rodrı́guez López, P., and Lacoste, A. Tadam:
Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet Task dependent adaptive metric for improved few-shot
classification with deep convolutional neural networks. learning. In NeurIPS, 2018.
In NeurIPS, 2012.
Park, E. and Oliva, J. B. Meta-curvature. In NeurIPS, 2019.
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gersh-
man, S. J. Building machines that learn and think like Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
people. Behavioral and brain sciences, 2017. Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
L., et al. Pytorch: An imperative style, high-performance
Lee, K., Maji, S., Ravichandran, A., and Soatto, S. Meta- deep learning library. In NeurIPS, 2019.
learning with differentiable convex optimization. In
CVPR, 2019. Patacchiola, M., Bronskill, J., Shysheya, A., Hofmann, K.,
Nowozin, S., and Turner, R. E. Contextual squeeze-and-
Li, C., Yang, J., Zhang, P., Gao, M., Xiao, B., Dai, X., excitation for efficient few-shot image classification. In
Yuan, L., and Gao, J. Efficient self-supervised vision NeurIPS, 2022.
transformers for representation learning. In ICLR, 2022a.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Li, W., Liu, X., and Bilen, H. Universal representation learn-
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,
ing from multiple domains for few-shot classification. In
et al. Learning transferable visual models from natural
ICCV, 2021.
language supervision. In ICML, 2021.
Li, W.-H., Liu, X., and Bilen, H. Cross-domain few-shot
learning with task-specific adapters. In CVPR, 2022b. Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., and
Dollár, P. Designing network design spaces. In CVPR,
Li, Z., Zhou, F., Chen, F., and Li, H. Meta-sgd: Learning 2020.
to learn quickly for few-shot learning. arXiv preprint
arXiv:1707.09835, 2017. Rajeswaran, A., Finn, C., Kakade, S. M., and Levine, S.
Meta-learning with implicit gradients. In NeurIPS, 2019.
Liu, Y., Lee, J., Zhu, L., Chen, L., Shi, H., and Yang, Y. A
multi-mode modulator for multi-domain few-shot classi- Ravi, S. and Larochelle, H. Optimization as a model for
fication. In ICCV, 2021a. few-shot learning. In ICLR, 2017.
11
A Closer Look at Few-shot Classification Again
Ren, M., Triantafillou, E., Ravi, S., Snell, J., Swersky, K., Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles,
Tenenbaum, J. B., Larochelle, H., and Zemel, R. S. Meta- A., and Jégou, H. Training data-efficient image trans-
learning for semi-supervised few-shot classification. In formers & distillation through attention. In ICML, 2021.
ICLR, 2018.
Triantafillou, E., Zhu, T., Dumoulin, V., Lamblin, P., Evci,
Requeima, J., Gordon, J., Bronskill, J., Nowozin, S., and U., Xu, K., Goroshin, R., Gelada, C., Swersky, K., Man-
Turner, R. E. Fast and flexible multi-task classification zagol, P., and Larochelle, H. Meta-dataset: A dataset
using conditional neural adaptive processes. In NeurIPS, of datasets for learning to learn from few examples. In
2019. ICLR, 2020.
Rizve, M. N., Khan, S. H., Khan, F. S., and Shah, M. Explor- Triantafillou, E., Larochelle, H., Zemel, R. S., and Du-
ing complementary strengths of invariant and equivariant moulin, V. Learning a universal template for few-shot
representations for few-shot learning. In CVPR, 2021. dataset generalization. In ICML, 2021.
Rusu, A. A., Rao, D., Sygnowski, J., Vinyals, O., Pascanu, Vaze, S., Han, K., Vedaldi, A., and Zisserman, A. Open-set
R., Osindero, S., and Hadsell, R. Meta-learning with recognition: A good closed-set classifier is all you need.
latent embedding optimization. In ICLR, 2019. In ICLR, 2022.
Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.
Lillicrap, T. Meta-learning with memory-augmented neu- Matching networks for one shot learning. In NeurIPS,
ral networks. In ICML, 2016. 2016.
Sbai, O., Couprie, C., and Aubry, M. Impact of base dataset Wu, Z., Xiong, Y., Yu, S. X., and Lin, D. Unsupervised fea-
design on few-shot image classification. In ECCV, 2020. ture learning via non-parametric instance discrimination.
Schmidhuber, J. Evolutionary principles in self-referential In CVPR, 2018.
learning, or on learning how to learn: the meta-meta- Xu, C., Yang, S., Wang, Y., Wang, Z., Fu, Y., and Xue, X.
... hook. PhD thesis, Technische Universität München, Exploring efficient few-shot adaptation for vision trans-
1987. formers. Transactions on Machine Learning Research,
Shysheya, A., Bronskill, J. F., Patacchiola, M., Nowozin, 2022a.
S., and Turner, R. E. Fit: Parameter efficient few-shot Xu, J., Ton, J.-F., Kim, H., Kosiorek, A., and Teh, Y. W.
transfer learning for personalized and federated image Metafun: Meta-learning with iterative functional updates.
classification. In ICLR, 2023. In ICML, 2020.
Simonyan, K. and Zisserman, A. Very deep convolutional Xu, J., Luo, X., Pan, X., Pei, W., Li, Y., and Xu, Z. Alle-
networks for large-scale image recognition. In ICLR, viating the sample selection bias in few-shot learning by
2015. removing projection to the centroid. In NeurIPS, 2022b.
Snell, J., Swersky, K., and Zemel, R. Prototypical networks Ye, H.-J., Hu, H., Zhan, D.-C., and Sha, F. Few-shot learning
for few-shot learning. In NeurIPS, 2017. via embedding adaptation with set-to-set functions. In
Sun, C., Shrivastava, A., Singh, S., and Gupta, A. Revisiting CVPR, 2020.
unreasonable effectiveness of data in deep learning era.
Yoon, S. W., Seo, J., and Moon, J. Tapnet: Neural network
In ICCV, 2017.
augmented with task-adaptive projection for few-shot
Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H., and learning. In ICML, 2019.
Hospedales, T. M. Learning to compare: Relation net-
Zbontar, J., Jing, L., Misra, I., LeCun, Y., and Deny, S.
work for few-shot learning. In CVPR, 2018.
Barlow twins: Self-supervised learning via redundancy
Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B., and reduction. In ICML, 2021.
Schmidt, L. Measuring robustness to natural distribution
Zhai, X., Puigcerver, J., Kolesnikov, A., Ruyssen, P.,
shifts in image classification. In NeurIPS, 2020.
Riquelme, C., Lucic, M., Djolonga, J., Pinto, A. S., Neu-
Thrun, S. and Pratt, L. Learning to learn: Introduction and mann, M., Dosovitskiy, A., et al. A large-scale study of
overview. In Learning to learn. Springer, 1998. representation learning with the visual task adaptation
benchmark. arXiv preprint arXiv:1910.04867, 2019.
Tian, Y., Wang, Y., Krishnan, D., Tenenbaum, J. B., and
Isola, P. Rethinking few-shot image classification: A Zhai, X., Kolesnikov, A., Houlsby, N., and Beyer, L. Scaling
good embedding is all you need? In ECCV, 2020. vision transformers. In CVPR, 2022.
12
A Closer Look at Few-shot Classification Again
Zhang, C., Cai, Y., Lin, G., and Shen, C. Deepemd: Few-
shot image classification with differentiable earth mover’s
distance and structured classifiers. In CVPR, 2020.
Zhao, N., Wu, Z., Lau, R. W. H., and Lin, S. What makes
instance discrimination good for transfer learning? In
ICLR, 2021.
Zhou, J., Wei, C., Wang, H., Shen, W., Xie, C., Yuille, A.,
and Kong, T. ibot: Image bert pre-training with online
tokenizer. In ICLR, 2022.
Zintgraf, L., Shiarli, K., Kurin, V., Hofmann, K., and White-
son, S. Fast context adaptation via meta-learning. In
ICML, 2019.
13
A Closer Look at Few-shot Classification Again
Backbone adaptation in few-shot classification. MAML (Finn et al., 2017) is the first paper that uses Finetune as the
adaptation algorithm. However, all hyperparameters of Finetune are fixed before training and the backbone is weak, so
MAML does not perform well. Later, Tadam (Oreshkin et al., 2018) designs the first adaptation algorithm that partially
adapts the backbone by a black-box meta-learning method. The Baseline algorithm (Chen et al., 2019) is the first one that
uses a combination of non-meta-learning training and Finetune, and achieves surprisingly good results. Another baseline
method (Dhillon et al., 2020) utilizes simple supervised training and Finetune using initialization of the linear layer from
feature prototypes. CNAPs (Requeima et al., 2019) is a partial adaptation meta-learning algorithm that learns on multiple
datasets on Meta-Dataset and achieves SOTA results. After CNAPs comes out, several works emerge that adapt the backbone
either by finetuning or partial backbone adaptation in the adaptation phase on Meta-Dataset (Triantafillou et al., 2021; Li
et al., 2022b; Xu et al., 2022a; Patacchiola et al., 2022; Liu et al., 2021a; Bateni et al., 2020; Shysheya et al., 2023). Our
paper reveals that the popularity of this line of research first declined and then increased is related to the bias of evaluation
protocols of benchmarks.
Connections of pre-trained models with downstream task performance. Kornblith et al. (2019) showed that ImageNet
performance has a linear relationship with the downstream transfer performance of classification tasks. Similarly, such a
linear relationship was discovered later in domain generalization (Taori et al., 2020) and open-set recognition (Vaze et al.,
2022). Abnar et al. (2022) questioned this result with large-scale experiments and showed that when increasing the upstream
accuracy to a very high number, performance of downstream tasks can saturate. Our experiments on Omniglot and ISIC
further ensure this observation, even when the training data is at a small scale. Recently, Entezari et al. (2023) find that the
choice of the pre-training data source is essential for the few-shot classification, but its role decreases as more data is made
available for fine-tuning, which complements our study.
B. Details of Experiments
We reimplement some of the training algorithms in Table 1, including all PN models, all MAML models, CE models
with Conv-4 and ResNet-12, MetaOpt, Meta-Baseline, COS, and IER. For all other training algorithms, we use existing
checkpoints from official repositories or the Pytorch library (Paszke et al., 2019). All reimplemented models are trained
for 60 epochs using SGD+Momemtum with cosine learning rate decay without restart. The initial learning rates are all
set to 0.1. Training batchsize is 4 for meta-learning models and 256 for non-meta-learning models. The input image size
is 84×84 for Conv-4 and ResNet-12 models and 224×224 for other models. We use random crop and horizontal flip as
data augmentation at training. Since some models like PN are trained on normalized features, for a fair comparison, we
normalize the features of all models for the adaptation phase.
For experiments in Section 4.1, to make a fair comparison, we train CE and MoCo for 150 epochs and train PN using a
number of iterations that makes the number of seen samples equal. SGD+Momemtum with cosine learning rate decay
without restart is used. The backbone used is ResNet-18. Learning rates are all set to 0.1. Training batchsize is 4 for PN
and 256 for CE and MoCo. The input image size is 84×84. During training, we use random crop and horizontal flip as
data augmentation for CE and PN, and for MoCo, we use the same set of data augmentations as in the MoCo-v2 paper
(Chen et al., 2020). We repeat training 5 times with different sampling of data or classes for each experiment in Section
4.1. All pre-trained supervised models in Section 4.2 are from Pytorch library, and all self-supervised models are from
official repositories. To avoid memory issues, we only use 500000 image features of the training set of ImageNet for KNN
computation of Top-1 ImageNet accuracy for self-supervised models in Section 4.2.
All training algorithms that are evaluated in this paper, including meta-learning algorithms, set the learnable parameters as
14
A Closer Look at Few-shot Classification Again
Table 3. 5-way 1-shot performance of pairwise combinations of a variety of training and adaptation algorithms on Meta-Dataset. We
exclude MatchingNet from the adaptation algorithms because MatchingNet equals NCC when the shot is one.
Adaptation algorithm
Training algorithm Training dataset Architecture MetaOpt NCC LR URL CC TSA/eTT Finetune
PN miniImageNet Conv-4 38.50±0.5 38.69±0.5 38.23±0.4 38.81±0.4 38.64±0.5 41.27±0.5 42.60±0.5
MAML miniImageNet Conv-4 42.92±0.5 43.00±0.5 42.65±0.5 42.51±0.5 42.97±0.5 44.55±0.5 46.13±0.5
CE miniImageNet Conv-4 44.49±0.5 44.88±0.5 44.88±0.5 44.48±0.5 44.82±0.5 46.20±0.5 46.92±0.5
MatchingNet miniImageNet ResNet-12 45.00±0.5 45.23±0.5 45.24±0.5 44.89±0.5 45.40±0.5 46.18±0.5 48.53±0.5
MAML miniImageNet ResNet-12 46.09±0.5 46.09±0.5 45.81±0.5 45.88±0.5 46.07±0.5 51.95±0.5 53.71±0.5
PN miniImageNet ResNet-12 47.32±0.5 47.53±0.5 47.33±0.5 47.53±0.5 47.65±0.5 49.36±0.5 53.06±0.5
MetaOpt miniImageNet ResNet-12 49.16±0.5 49.52±0.5 49.53±0.5 49.42±0.5 49.73±0.5 52.01±0.5 53.90±0.5
CE miniImageNet ResNet-12 51.09±0.5 51.42±0.5 51.60±0.5 50.94±0.5 51.71±0.5 53.81±0.5 54.68±0.5
Meta-Baseline miniImageNet ResNet-12 51.24±0.5 51.56±0.5 51.67±0.5 51.23±0.5 51.77±0.5 53.87±0.5 54.54±0.5
COS miniImageNet ResNet-12 51.23±0.5 51.53±0.5 51.31±0.5 51.87±0.5 51.72±0.5 54.18±0.5 54.98±0.5
PN ImageNet ResNet-50 52.50±0.5 52.84±0.5 52.71±0.5 52.90±0.5 52.93±0.5 54.34±0.5 57.40±0.5
IER miniImageNet ResNet-12 53.31±0.5 53.63±0.5 53.82±0.5 53.24±0.5 53.98±0.5 56.32±0.5 56.98±0.5
Moco v2 ImageNet ResNet-50 54.89±0.5 55.38±0.5 55.64±0.5 55.77±0.5 55.70±0.5 58.13±0.5 59.99±0.5
DINO ImageNet ResNet-50 60.81±0.5 61.37±0.5 61.61±0.5 61.96±0.5 61.81±0.5 62.69±0.5 63.61±0.5
CE ImageNet ResNet-50 62.34±0.5 62.88±0.5 62.90±0.5 63.55±0.5 63.18±0.5 65.04±0.5 65.87±0.5
BiT-S ImageNet ResNet-50 62.41±0.5 62.95±0.5 63.15±0.5 63.40±0.5 63.40±0.5 65.02±0.5 67.05±0.5
CE ImageNet Swin-B 64.03±0.5 64.46±0.5 64.38±0.5 65.22±0.5 65.01±0.5 - 69.12±0.5
DeiT ImageNet ViT-B 64.20±0.5 64.62±0.5 64.43±0.5 65.31±0.5 65.11±0.5 66.25±0.5 69.12±0.5
DINO ImageNet ViT-B 64.86±0.5 65.36±0.5 65.31±0.5 66.05±0.5 65.91±0.5 67.26±0.5 67.89±0.5
CE ImageNet ViT-B 67.19±0.5 67.61±0.5 67.56±0.5 68.00±0.5 67.85±0.5 69.78±0.5 72.14±0.4
CLIP WebImageText ViT-B 67.95±0.5 68.68±0.5 69.10±0.5 69.85±0.5 68.85±0.5 70.42±0.5 74.96±0.5
the parameters of a feature extractor, and all adaptation algorithms do not have additional parameters that need to be obtained
from training. Thus adapting different training algorithms is as easy as adapting different feature extractors with different
adaptation algorithms. There exist other meta-learning algorithms (Oreshkin et al., 2018; Requeima et al., 2019; Patacchiola
et al., 2022; Ye et al., 2020; Doersch et al., 2020) that meta-learn additional parameters besides a feature extractor, so their
training/adaptation algorithms cannot be combined with other adaptation/training algorithms directly. Thus these algorithms
are not included in our experiments. One solution is, for each such algorithm, learn the same additional parameters while
freezing the backbone for every other trained model, and then we can compare all algorithms. We expect that after doing
this the ranking of both training and adaptation algorithms will still not be changed and we leave it for future work to verify
this conjecture.
Throughout the main paper, for all adaptation algorithms that have hyperparameters, we grid search hyperparameters on the
validation dataset of miniImageNet and Meta-Dataset. For Traffic Signs which does not have a validation set, we use the
hypeparameters averaged over the found optimal hyperparameters of all other datasets. For adaptation analysis experiments
in Section 5, we partition ImageNet and Quick Draw to have a 100-class validation set. The rest is used as the test set.
15
A Closer Look at Few-shot Classification Again
Table 4. 5-way 5-shot performance of pairwise combinations of a variety of training and adaptation algorithms conducted on the
miniImageNet benchmark.
Adaptation algorithm
Training algorithm Architecture MatchingNet MetaOpt NCC LR URL CC TSA/eTT Finetune
MAML Conv-4 59.80±0.3 57.99±0.4 58.86±0.2 60.93±0.3 60.81±0.4 61.83±0.3 62.40±0.3 62.03±0.5
PN Conv-4 63.71±0.5 64.12±0.5 63.67±0.5 65.78±0.5 65.78±0.4 65.82±0.5 65.69±0.4 66.35±0.5
CE Conv-4 64.09±0.4 66.41±0.4 67.93±0.3 68.92±0.5 68.63±0.4 69.08±0.5 69.22±0.4 69.51±0.6
MatchingNet ResNet-12 69.48±0.3 69.71±0.3 69.75±0.6 70.92±0.4 70.86±0.4 71.00±0.4 71.15±0.2 72.31±0.4
MAML ResNet-12 70.27±0.3 68.37±0.6 70.09±0.4 71.94±0.4 71.33±0.3 72.10±0.5 75.70±0.5 76.18±0.3
PN ResNet-12 73.64±0.4 74.03±0.4 74.99±0.5 75.46±0.4 75.72±0.4 75.65±0.4 76.99±0.3 79.62±0.2
MetaOpt ResNet-12 75.21±0.4 76.51±0.5 77.69±0.4 78.09±0.5 78.36±0.4 78.43±0.4 80.55±0.2 81.44±0.2
CE ResNet-12 76.66±0.4 77.66±0.4 79.97±0.4 80.01±0.5 80.11±0.5 80.34±0.5 80.65±0.1 80.84±0.2
Meta-Baseline ResNet-12 77.06±0.4 77.59±0.4 79.85±0.2 80.54±0.5 80.52±0.4 80.77±0.4 80.97±0.3 81.42±0.2
COS ResNet-12 79.70±0.3 80.07±0.4 81.01±0.3 81.28±0.4 81.54±0.4 81.52±0.5 81.97±0.2 83.26±0.2
IER ResNet-12 80.37±0.3 81.33±0.3 82.80±0.3 83.71±0.3 83.83±0.3 84.04±0.3 83.53±0.3 84.02±0.2
Table 5. 5-way 1-shot performance of pairwise combinations of a variety of training and adaptation algorithms conducted on the
miniImageNet benchmark.
Adaptation algorithm
Training algorithm Architecture MetaOpt NCC LR URL CC TSA/eTT Finetune
MAML Conv-4 45.97±0.4 46.24±0.5 47.62±0.5 46.81±0.6 47.40±0.5 47.55±0.4 47.40±0.3
PN Conv-4 49.79±0.4 50.95±0.4 50.89±0.4 51.01±0.5 50.95±0.4 50.97±0.3 50.65±0.4
CE Conv-4 51.28±0.5 51.68±0.7 51.07±0.6 52.18±0.6 51.86±0.7 52.88±0.3 51.87±0.4
MatchingNet ResNet-12 54.52±0.5 54.96±0.5 54.85±0.5 54.84±0.6 54.89±0.5 55.27±0.4 55.52±0.4
MAML ResNet-12 56.43±0.4 55.80±0.7 57.14±0.7 56.06±0.8 57.03±0.7 57.86±0.4 58.49±0.2
PN ResNet-12 59.91±0.4 60.25±0.7 60.26±0.7 60.01±0.6 60.26±0.7 60.37±0.5 60.67±0.2
MetaOpt ResNet-12 60.40±0.3 60.82±0.5 60.40±0.5 61.79±0.5 60.91±0.5 61.89±0.4 62.58±0.4
CE ResNet-12 62.53±0.6 62.88±0.6 62.55±0.6 63.15±0.6 62.94±0.6 63.46±0.4 63.33±0.4
Meta-Baseline ResNet-12 63.99±0.2 64.92±0.7 64.84±0.7 64.55±0.7 64.91±0.7 64.92±0.3 64.97±0.2
COS ResNet-12 64.06±0.3 64.73±0.9 64.71±0.9 64.60±0.8 64.70±0.9 64.92±0.4 65.01±0.4
IER ResNet-12 65.05±0.1 66.45±0.6 66.17±0.6 66.68±0.6 66.48±0.6 66.25±0.3 65.86±0.4
few-shot performance may not improve if we use larger datasets. But one difference is that for MoCo, few-shot performance
does always improve on ISIC, while few-shot performance does not always improve on Omniglot. Also, MoCo performs
well on ChestX, while falling behind CE and PN on all other datasets. These show that the knowledge learned from MoCo
is somewhat different from that of PN and CE, and this knowledge is useful for classification tasks on ISIC and ChestX.
Previous works (Zhao et al., 2021) has shown that contrastive learning models like MoCo tend to learn more low-level visual
features that are easier to transfer. We thus conjecture that low-level knowledge is more important for tasks of some datasets
such as ChestX and ISIC. This indicates that the design of the training objective should consider what the test dataset is, so a
one-fit-all solution may not exist. We also notice that all datasets in DomainNet except for Quick Draw exhibit similar scale
patterns. We know that in DomainNet, each dataset has the same set of classes, while differs in domains. Thus we can infer
that the test domain is not the key factor that influences the required training knowledge, but the choice of classes is. In (Luo
et al., 2022), the authors define a new task distribution shift that measures the difference between tasks, taking classes into
consideration. It is future work to see whether task distribution shift is the key factor that influences the required training
knowledge for each test dataset.
Figure 10-12 depicts the comparisons of the two data scaling approaches for CE, PN, and MoCo. We can see that for CE
and PN, increasing the number of classes is far more effective than increasing the number of samples per class. However,
for MoCo, two data scaling approaches present similar performance at every data ratio used for training. Thus we can
infer that self-supervised algorithms that do not use labels for supervision indeed do not be influenced by the number of
labels. Self-supervised algorithms do not rely on labels, so they treat each sample equally, especially for contrastive learning
methods. Thus for self-supervised algorithms, the total number of training samples is the only variable of interest. While
this makes self-supervised algorithms particularly suitable for learning on datasets with scarce classes, this also hinders
self-supervised algorithms from scaling well to datasets with a large number of classes, e.g., ImageNet-21K or JFM (Sun
et al., 2017).
Figure 13-15 plot the linear fit of few-shot performance vs the number of training classes on logit-transformed axes. We can
see that the linear relationship is obvious for most circumstances (most correlation coefficients are larger than 0.9). Thus we
16
A Closer Look at Few-shot Classification Again
ChestX Plant Disease ESAT ISIC Infograph 55
32 CE
5-way 5-shot test error (%)
CE CE
57 CE 61 CE
PN 32 PN
46 MoCo 52 MoCo 44
20
35 43 33
CE CE
PN PN
MoCo MoCo 34
24 1% 2% 8 34 22
5% 10% 30%50% 100% 1% 2% 5% 10% 30%50% 100% 1% 2% 5% 10% 30%50% 100% 1% 2% 5% 10% 30%50% 100% 1% 2% 5% 10% 30% 50% 100%
Proportion of data used for training Proportion of data used for training
Figure 8. Results of other 9 datasets from BSCD-FSL benchmark and DomainNet about the effect of sample size per training class on
few-shot classification performance. The plot follows Figure 1.
have verified the discovery of neural scaling laws wrt the number of training classes.
CE 44 CE CE CE 72 CE
Average test error (%) over 9 datasets
PN
MoCo 33
PN
MoCo
PN
MoCo 65
PN
MoCo
PN
MoCo
PN
31 66 MoCo
62
76 22
24 52
59 60
75
17 56
74 10 11 54
20 50 100 300 500 1000 10 20 50 100 300 500 1000 10 20 50 100 300 500 1000 10 20 50 100 300 500 1000 10 20 50 100 300 500 1000
Painting Real Sketch Clipart
50 64 43
5-way 5-shot test error (%)
63 CE 67 CE CE
PN 36 PN PN
50 MoCo 56 MoCo 50 MoCo
22
37 45 36
CE
PN
MoCo 34
24 8 34 22
10 20 50 100 300 500 1000 10 20 50 100 300 500 1000 10 20 50 100 300 500 1000 10 20 50 100 300 500 1000 10 20 50 100 300 500 1000
Number of classes used for training Number of classes used for training
Figure 9. Results of other 9 datasets from BSCD-FSL benchmark and DomainNet about the effect of the number of training classes on
few-shot classification performance. The plot follows Figure 2.
17
A Closer Look at Few-shot Classification Again
ImageNet-val Omniglot Aircraft Birds Textures
47 72 sample ratio
5-way 5-shot test error (%)
sample ratio sample ratio sample ratio 56 sample ratio 57 sample ratio 51
6 14 24
5
1% 2% 5% 10% 30%50% 100% 1% 2% 5% 10% 30%50% 100% 51 1% 2% 5% 10% 30%50% 100% 1% 2% 5% 10% 30%50% 100% 1% 2% 5% 10% 30%50% 100%
Quick Draw Fungi VGG Flower Traffic Signs MSCOCO
62 56 33
5-way 5-shot test error (%)
sample ratio sample ratio 38 sample ratio 55 sample ratio sample ratio
42 class ratio class ratio class ratio class ratio class ratio
52 28 45 47
37
18 35
32 42 38
27 32 25 29
1% 2% 5% 10% 30%50% 100% 1% 2% 5% 10% 30%50% 100% 8 1% 2% 5% 10% 30%50% 100% 1% 2% 5% 10% 30%50% 100% 1% 2% 5% 10% 30%50% 100% 24 1% 2% 5% 10% 30% 50% 100%
Proportion of data used for training (CE) Proportion of data used for training (CE)
Figure 10. Comparisons of the two data scaling approaches for CE: scaling with sample size per training class and scaling with the number
of training classes.
49 sample ratio sample ratio sample ratio sample ratio 59 sample ratio
7 11 25 32
1% 2% 5% 10% 30%50% 100% 1% 2% 5% 10% 30%50% 100% 63 1% 2% 5% 10% 30%50% 100% 1% 2% 5% 10% 30%50% 100% 1% 2% 5% 10% 30%50% 100%
Quick Draw Fungi VGG Flower Traffic Signs MSCOCO
40
5-way 5-shot test error (%)
50 sample ratio 64 sample ratio 47 sample ratio 55 sample ratio 58 sample ratio
class ratio class ratio class ratio class ratio class ratio
44 38 46 49
57
29
38 50 37 40
32 20 28 32
43 31
1% 2% 5% 10% 30%50% 100% 1% 2% 5% 10% 30%50% 100% 1% 2% 5% 10% 30%50% 100% 1% 2% 5% 10% 30%50% 100% 1% 2% 5% 10% 30%50% 100% 1% 2% 5% 10% 30% 50% 100%
Proportion of data used for training (PN) Proportion of data used for training (PN)
Figure 11. Comparisons of the two data scaling approaches for PN.
set. In practical scenarios, few-shot learning usually requires real-time response, so such a long time waiting for one task is
intolerable.
51 sample ratio sample ratio sample ratio 59 sample ratio 54 sample ratio
Average test error (%) over 9 datasets
sample ratio sample ratio sample ratio sample ratio sample ratio
class ratio 58 class ratio class ratio 48 class ratio class ratio 40
33
35 53
54 44
33 25
50 40 47
31
29 46 17 36 36
1% 2% 5% 10% 30%50% 100% 1% 2% 5% 10% 30%50% 100% 1% 2% 5% 10% 30%50% 100% 1% 2% 5% 10% 30%50% 100% 41 1% 2% 5% 10% 30%50% 100% 1% 2% 5% 10% 30% 50% 100%
Proportion of data used for training (MoCo) Proportion of data used for training (MoCo)
Figure 12. Comparisons of the two data scaling approaches for MoCo.
18
A Closer Look at Few-shot Classification Again
25
29 24
27 32 8
10 20 50 100 300 500 1000 10 20 50 100 300 500 1000 10 20 50 100 300 500 1000 10 20 50 100 300 500 1000 10 20 50 100 300 500 1000 10 20 50 100 300 500 1000
Number of classes used for training (CE) Number of classes used for training (CE)
Figure 13. Linear fit of few-shot performance of CE vs the number of training classes on logit-transformed axes. “r” refers to the
correlation coefficient between two axes of data.
11
7 64 25 32
10 20 50 100 300 500 1000 10 20 50 100 300 500 1000 10 20 50 100 300 500 1000 10 20 50 100 300 500 1000 10 20 50 100 300 500 1000
Quick Draw Fungi VGG Flower Traffic Signs MSCOCO 40
5-way 5-shot test error (%)
44 38 46 49
57
38 29 37 40
50
32
28 31
32 20
43
10 20 50 100 300 500 1000 10 20 50 100 300 500 1000 10 20 50 100 300 500 1000 10 20 50 100 300 500 1000 10 20 50 100 300 500 1000 10 20 50 100 300 500 1000
Number of classes used for training (PN) Number of classes used for training (PN)
Figure 14. Linear fit of few-shot performance of PN vs the number of training classes on logit-transformed axes. “r” refers to the
correlation coefficient between two axes of data.
48
42 68 46
55
11
33 66 38
10 51
44
9
64
24 47 30
10 20 50 100 300 500 1000 10 20 50 100 300 500 1000 10 20 50 100 300 500 1000 10 20 50 100 300 500 1000 10 20 50 100 300 500 1000
Quick Draw Fungi VGG Flower Traffic Signs MSCOCO 40
59
5-way 5-shot test error (%)
Figure 15. Linear fit of few-shot performance of MoCo vs the number of training classes on logit-transformed axes. “r” refers to the
correlation coefficient between two axes of data.
19
A Closer Look at Few-shot Classification Again
Table 6. Detailed results of supervised CE models in Figure 3. Bold/underline is the best/second best in each column.
Architecture ImageNet Top-1 Avg few-shot ImageNet-val Omniglot Aircraft Birds Textures Quick Draw Fungi VGG Flower Traffic Signs MSCOCO
ResNet-18 68.55 79.29 96.76 92.73 59.19 90.95 79.81 70.16 73.97 94.31 78.22 74.24
ResNet-34 72.50 79.18 97.66 92.76 58.65 91.71 81.57 68.57 73.54 93.80 76.27 75.77
ResNet-50 75.27 79.33 98.15 92.93 59.51 92.02 82.26 67.67 72.68 94.33 75.17 77.41
ResNet-101 76.74 79.89 98.46 92.98 60.10 92.90 81.97 69.10 74.09 94.54 75.50 77.84
ResNet-152 77.73 79.02 98.62 91.33 57.20 93.36 82.36 68.12 73.85 94.26 72.37 78.37
Swin-T 80.74 80.86 99.14 94.17 58.26 93.40 82.70 73.70 74.77 95.23 76.30 79.20
Swin-S 82.59 79.41 99.33 93.17 56.94 91.89 81.07 74.14 72.01 93.25 72.68 79.58
Swin-B 83.00 79.27 99.33 94.87 55.26 91.25 80.63 74.54 70.71 93.99 72.32 79.82
ViT-B 80.74 80.36 98.92 94.98 58.16 92.23 80.48 73.02 71.71 93.45 81.33 77.83
ViT-L 79.50 80.34 98.80 93.85 59.26 93.04 81.32 74.53 72.07 94.80 78.21 76.02
DenseNet-121 73.60 80.78 97.52 94.88 61.62 92.89 81.62 71.95 74.30 94.73 79.58 75.49
DenseNet-161 76.44 81.42 98.05 93.92 65.87 93.00 82.21 70.71 74.42 95.40 80.09 77.12
DenseNet-169 75.07 80.65 97.78 93.60 61.71 92.43 81.77 69.55 74.28 94.98 81.21 76.29
DenseNet-201 75.86 81.40 97.97 94.91 61.97 93.32 82.24 73.31 73.08 95.33 81.33 77.09
RegNetY-1.6GF 76.01 81.53 97.88 94.19 62.72 93.85 82.84 72.00 77.08 95.97 77.82 77.31
RegNetY-3.2GF 77.63 81.49 98.22 93.84 63.25 94.07 82.70 72.26 77.66 95.84 75.89 77.93
RegNetY-16GF 79.39 81.21 98.57 94.82 62.16 94.02 82.46 72.34 75.79 95.68 75.03 78.62
RegNetY-32GF 79.79 80.37 98.69 94.24 59.72 93.57 82.23 72.41 74.37 95.80 72.06 78.94
RegNetX-400MF 71.45 79.10 97.16 93.20 57.76 91.57 80.91 70.06 73.46 94.25 75.50 75.14
RegNetX-800MF 73.86 80.24 97.65 93.62 59.13 92.36 82.33 69.69 75.78 95.07 77.49 76.70
MobileNetV2 70.54 80.90 96.86 94.26 61.03 91.87 80.61 73.30 76.13 95.56 80.64 74.70
MobileNetV3-L 72.91 80.48 94.71 94.91 56.63 91.45 80.68 76.11 74.65 96.49 81.22 72.21
MobileNetV3-S 66.10 78.06 91.78 93.45 53.79 88.05 77.03 74.64 72.50 94.16 80.21 68.72
VGG-11 67.97 75.99 93.13 93.08 54.21 85.19 78.89 65.61 70.57 93.59 72.81 69.95
VGG-11-BN 69.54 77.90 93.99 94.14 58.48 87.45 81.01 64.46 73.28 94.83 76.76 70.66
VGG-13 68.93 76.78 93.96 93.98 54.94 87.16 79.71 66.61 70.64 93.29 73.91 70.78
VGG-13-BN 70.64 78.01 94.64 92.84 58.83 88.87 81.56 64.81 74.26 94.83 74.43 71.64
VGG-16 70.86 77.24 95.63 92.66 55.63 89.91 79.88 62.25 72.16 93.62 76.00 73.02
VGG-16-BN 72.68 78.56 96.33 91.65 60.85 91.32 81.84 62.08 74.45 93.82 76.70 74.33
VGG-19 71.41 77.76 96.25 94.96 57.42 90.78 79.52 64.16 71.08 91.43 76.43 74.03
VGG-19-BN 73.26 79.58 96.77 92.18 64.29 91.80 81.57 65.23 73.43 93.15 79.82 74.80
ConvNeXt-T 81.69 78.22 97.91 94.76 54.78 91.22 78.45 72.74 65.88 93.62 77.44 75.12
ConvNeXt-S 82.84 77.41 98.42 95.54 53.58 88.60 78.18 72.53 67.26 92.63 76.10 72.29
ConvNeXt-B 83.35 77.37 98.66 95.65 54.58 89.09 76.79 71.86 66.99 92.16 74.45 74.73
ConvNeXt-L 83.69 76.62 98.99 94.31 53.50 88.72 76.92 69.44 66.04 92.22 72.32 76.10
20
A Closer Look at Few-shot Classification Again
Table 7. Detailed results of self-supervised models in Figure 4. Bold/underline is the best/second best in each column.
Algorithm Architecture ImageNet Top-1 Avg few-shot ImageNet-val Omniglot Aircraft Birds Textures Quick Draw Fungi VGG Flower Traffic Signs MSCOCO
BYOL ResNet-50 62.20 77.91 92.72 92.96 52.99 80.78 83.81 73.34 70.77 96.25 81.04 69.30
SwAV ResNet-50 62.10 74.53 93.37 92.66 45.37 71.12 85.20 65.71 69.84 95.18 73.72 71.98
SwAV ResNet-50-x2 62.59 74.93 92.57 94.71 45.41 68.11 85.17 68.34 70.16 95.70 75.40 71.37
SwAV ResNet-50-x4 63.65 74.60 92.40 93.89 44.99 66.26 85.71 67.71 70.16 95.53 76.38 70.80
SwAV ResNet-50-x5 61.37 75.99 93.38 92.71 46.41 69.65 86.77 67.27 72.16 96.60 79.72 72.62
DINO ViT-S/8 76.94 83.33 98.16 96.61 61.20 95.33 85.93 73.48 80.06 98.10 80.50 78.71
DINO ViT-S/16 72.48 81.52 97.28 95.01 56.88 94.80 85.63 73.00 79.51 97.88 74.81 76.19
DINO ViT-B/16 74.15 81.39 97.91 95.77 50.72 92.39 86.15 73.58 79.77 98.28 78.25 77.63
DINO ViT-B/8 75.74 82.85 98.34 96.83 64.67 89.71 87.02 72.39 78.83 98.21 79.03 78.96
DINO ResNet-50 64.09 77.37 93.98 93.72 51.60 77.48 84.78 65.07 75.51 96.98 78.84 72.37
MoCo-v1 ResNet-50 41.27 67.67 87.98 88.05 41.44 61.77 77.96 61.06 61.69 89.39 62.64 65.01
MoCo-v2-200epoch ResNet-50 51.72 70.33 93.10 90.79 36.12 65.43 82.28 67.49 60.52 91.00 68.52 70.81
MoCo-v2 ResNet-50 59.19 71.24 94.70 89.73 34.38 70.32 84.03 66.13 61.74 91.92 70.78 72.10
MoCo-v3 ResNet-50 66.61 79.95 94.91 94.61 55.45 87.31 84.75 72.27 72.75 96.68 83.44 72.32
MoCo-v3 ViT-S 65.46 76.75 94.22 93.41 45.94 84.66 83.77 73.21 69.56 94.99 72.81 72.39
MoCo-v3 ViT-B 69.32 78.40 95.80 94.66 47.08 85.29 84.74 75.19 72.53 96.04 76.33 73.70
SimSiam ResNet-50 53.57 73.88 92.07 92.87 44.38 68.01 81.84 70.05 66.67 94.67 77.05 69.37
Barlow Twins ResNet-50 63.26 77.04 93.83 92.23 49.89 79.07 84.73 68.31 71.23 96.35 81.04 70.53
MAE ViT-B 20.66 46.77 39.94 93.45 26.89 35.54 33.04 72.07 30.66 52.6 41.64 35.01
MAE ViT-L 42.63 60.38 72.40 95.61 40.42 49.91 61.76 75.85 46.74 77.07 43.40 52.70
MAE ViT-H 38.50 61.43 72.32 95.36 40.96 50.97 63.64 75.11 48.91 80.02 44.64 53.27
IBOT Swin-T/7 73.61 81.26 97.74 97.05 52.37 88.36 85.40 77.16 77.05 97.46 79.16 77.37
IBOT Swin-T/14 74.50 81.79 97.97 96.65 51.67 93.21 85.62 76.86 79.64 97.75 76.90 77.83
IBOT ViT-S 73.12 81.25 97.54 95.67 53.97 93.91 85.32 73.77 78.23 97.66 75.82 76.86
IBOT ViT-B 75.28 80.16 98.04 95.8 47.01 91.57 85.21 73.78 76.57 97.81 75.63 78.02
IBOT ViT-L 76.37 78.59 98.27 96.18 45.60 84.78 84.02 76.27 72.93 97.18 70.92 79.46
EsViT ResNet-50 69.91 75.14 97.21 88.21 42.87 80.45 84.85 62.87 70.33 95.04 75.90 75.70
EsViT Swin-T 74.32 81.31 97.84 96.25 50.78 94.44 85.75 74.80 78.57 97.83 75.64 77.72
EsViT Swin-S 76.19 79.43 98.55 94.93 46.50 86.50 85.52 72.77 76.41 97.15 75.71 79.33
EsViT Swin-B 77.33 77.77 98.77 95.59 37.74 83.57 83.76 71.88 73.98 96.62 76.64 80.19
oBoW ResNet-50 59.09 70.93 93.79 92.85 37.98 68.85 78.86 67.91 62.93 89.45 67.09 72.49
InstDisc ResNet-50 38.13 66.85 84.70 87.18 43.25 60.72 74.23 63.84 61.34 89.54 59.42 62.14
21