A Good Feature Extractor Is All You Need For Weakly Supervised Learning in Histopathology
A Good Feature Extractor Is All You Need For Weakly Supervised Learning in Histopathology
A Good Feature Extractor Is All You Need For Weakly Supervised Learning in Histopathology
Georg Wölflein1,2,∗ Dyke Ferber2,3 Asier R. Meneghetti2 Omar S. M. El Nahhas2 Daniel Truhn4
Zunamys I. Carrero2 David J. Harrison1,5 Ognjen Arandjelović1 Jakob N. Kather2,3,6
1
University of St Andrews 2
EKFZ for Digital Health, TU Dresden 3 University of Heidelberg
4
University Hospital Aachen 5
Lothian NHS University Hospitals 6 University Hospital Dresden
arXiv:2311.11772v2 [cs.CV] 22 Nov 2023
Figure 1. Stain normalisation and image augmentations do not impact downstream performance. We empirically evaluate ten feature
extractors across nine weakly supervised pathology tasks, observing no benefit in employing stain normalisation (a) or augmentations (b,
c) before feature extraction. The best models (d), Lunit-DINO and CTransPath, are particularly robust, unlike ImageNet baselines (bold).
1
In computational pathology, stain normalisation has tra- preprocess train
ditionally been a standard preprocessing step which was split into patches
deterministic
augment
(non)-deterministic
extract
frozen
aggregate classify
trainable
crucial in order to account for variations in scanners and
haematoxylin and eosin (H&E) stains by adjusting WSIs to
.
match a reference image [58, 63, 82]. Yet, with the shift
feature aggregator
.
from ImageNet CNNs to SSL models trained on vast and
varied data from multiple centres, it is worth reconsider-
MLP
ing its need. Beyond stain normalisation, image augmen-
tations are a broad category of image-to-image transforma-
tions that may be applied during training, such as random
flips, rotations, and colour transformations. Some augmen- Figure 2. Common setup for weakly supervised learning on WSIs.
tations, like rotation, are particularly well-suited for pathol- In the preprocessing stage, the input image is split into patches
ogy due to the rotational invariance of micrographs [68]. that undergo independent image augmentations a before feature
SSL feature extractors that have been trained on a wide vari- extraction. The feature aggregator and classifier are trained jointly
ety of images from multiple international sites might there- as a single neural network gθ ◦ hθ , which, given the feature vectors
as inputs, predicts the output y. For stain normalisation (shown
fore extract diagnostically/prognostically relevant features
here), the same a(·) is applied every time, though in general, the
irrespective of site- or scanner-specific traits. This leads to
augmentation function may vary between patches and epochs.
our primary research question: with SSL feature extractors
trained on rich datasets, is there still a need for image aug- bag of patches representing a WSI.
mentations and stain normalisation to improve the general- It is computationally infeasible to parametrise M using
isability of weakly supervised learning models? Our study a single deep neural network trained end-to-end. Instead,
approaches this question in two ways: the common approach in the literature is a two-step process
1. We assess the latent space similarity between original consisting of preprocessing (feature extraction) and training
patches and their stain-normalised/augmented counter- (aggregation and classification), outlined in Fig. 2. The pre-
parts in Sec. 3. Our analysis reveals that many augmen- processing stage often entails stain normalisation, and may
tations induce only minor perturbations in the extracted optionally include image augmentations as well.
features, especially compared to ImageNet backbones. We first consider the simple case with a predetermined
2. In the most comprehensive robustness evaluation of pub- augmentation function a : X → X that is applied inde-
licly available pathology SSL feature extractors to date, pendently to each patch xi to obtain the augmented patches
we compare over 6,000 trained models, both with and x̂i = a(xi ) for i = 1, 2, . . . , n. Then, we apply the feature
without normalisation/augmentation, across multiple ex- extractor f : X → Rdx , which for each patch x̂i outputs a
ternally validated slide-level tasks to determine whether dx -dimensional feature vector zi = f (x̂i ). Now, we have
the increased preprocessing effort holds merit in terms n feature vectors, z1 , z2 , . . . , zn , which are aggregated into
of downstream performance (Fig. 1 and Sec. 4). a single vector z̄ ∈ Rdz (usually dx = dz ) via an aggrega-
Our analysis has implications for computational pathology tion function gθ : Rn×dx → Rdz with learnable parameters
practitioners and researchers alike, given the overhead in- θ. Finally, the aggregated feature vector z̄ passes through
curred by employing image augmentations and stain nor- a classifier hθ : Rdz → Y, to obtain the final prediction.
malisation in feature extraction pipelines. In summary, we can express the process M : X n → Y of
obtaining a prediction y from a bag of patches {xi }ni=1 as
1.1. Problem formulation preprocessing
2
While just a small modification in terms of notation, this that were pretrained on large-scale multi-centre pathology
change incurs a significant increase in time and memory datasets such as The Cancer Genome Atlas (TCGA) [93].
complexity of the preprocessing task by a factor of |A|, Wang et al. [90, 91] proposed CTransPath, a Swin Trans-
since augmentation and feature extraction must be per- former [52] feature extractor trained using semantically-
formed for all possible augmentations ai ∈ A for every relevant contrastive learning (SRCL), a novel SSL tech-
patch3 . As a result of this overhead, practitioners must care- nique based on MoCo v3 [20] that is specifically tailored
fully choose which augmentations to apply, if any. We ad- to pathology. Previously, they had put forth RetCCL [92],
dress this problem by assessing the performance benefit ob- a ResNet-50 model trained using a SSL technique they
tained by different augmentations on our benchmark tasks. termed clustering-guided contrastive learning (CCL) based
on MoCo [38]. Owkin [31] evaluated different ViT vari-
2. Related work ants [49] using the iBOT framework [100], and later termed
their best ViT-B variant “Phikon”. Lunit [44] benchmarked
Weakly supervised WSI classification Early work on
various SSL techniques including Barlow Twins [98],
WSI classification with slide-level labels employed CNNs
SwAV [11], MoCo v2 [19], and DINO [12] for pathology
such as ResNet [37] which were pretrained on Ima-
by training them on TCGA. All of the aforementioned mod-
geNet [26] and then fine-tuned on the classification task
els are available publicly, and – with one exception4 – form
using slide-level labels as patch-level supervision [23, 47].
the basis of our study. We refer the reader to Appendix B
Recognising that this approach introduces excessive noise
for a detailed overview of the evaluated feature extractors.
in the patch-level supervision to the detriment of the train-
This year, a number of pathology foundation models
ing process, later work [42, 89] reframed this task as an
have emerged [2, 8, 16, 57, 86] that were trained on consid-
embedding-based multiple instance learning (MIL) prob-
erably larger datasets. Unfortunately, we could not include
lem [29]. In this line of work, a feature vector is extracted
these in our study since their weights remain proprietary, yet
for every patch using a CNN (f in Fig. 2), and these feature
provide a more detailed account of these in Appendix B.1.
vectors are aggregated and classified via a learnable pooling
function and classifier (hθ ◦ gθ in Fig. 2). Initially, the entire Stain normalisation Different medical sites employ dif-
network, including feature extraction, was trained end-to- ferent microscopes, scanners, protocols, and dyes, result-
end [42]. However, end-to-end training becomes intractable ing in variations in the appearance of WSIs. For over
as MIL approaches scale to larger datasets, so more recent 20 years [58, 63, 66], stain normalisation has been com-
models operate on frozen features extracted using ImageNet monplace in digital pathology workflows to account for
pretrained models [56]. The frozen feature approach is now these factors by adjusting colours to match a reference
widely adopted for weakly supervised learning on WSIs, al- image. Classical techniques [58, 63, 82] achieve this by
beit with better feature extractors trained using SSL. performing colour deconvolution, standardising stain in-
SSL in pathology The goal of SSL is to learn useful tensity, and then transforming the colour space of the in-
representations for downstream tasks from unlabelled data. put images to that of a reference image. More recently,
Unlike supervised learning, SSL leverages structures inher- GAN-based approaches have been proposed to this end as
ent in data through pretext tasks, without needing explicit well [60, 87, 97]. Boschman et al. [5] compared eight clas-
labels. The development of SSL models is an active area of sical and GAN-based stain normalisation techniques, con-
research, from which a variety of algorithms like contrastive cluding that stain normalisation, especially the methods of
learning [17, 20, 38], non-contrastive learning [35, 98] and Vahadane et al. [82] and Macenko et al. [58], can indeed
clustering-based methods [10, 11] have emerged in recent bolster slide-level classification performance when validat-
years, each with unique advantages and challenges. These ing on external datasets. However, their approach aggre-
models have quickly found adoption in the pathology field, gated patch-level predictions through a simplistic majority
which is well-situated to benefit from SSL due to the avail- vote and did not integrate SSL feature extractors. In con-
ability of large datasets that lack patch-level labels. Indeed, trast, we contend that with SSL feature extractors, stain
SSL feature extractors pretrained on pathology data have normalisation becomes obsolete. To show this, we focus
been shown to outperform ImageNet pretrained models on our analysis on Macenko normalisation [58], the technique
downstream pathology tasks [8, 14, 44, 71, 74]. It is also most widely adopted in the literature [21, 30, 32, 70].
not surprising that obtaining more diverse data (e.g. from Image augmentations As a common regularisation tech-
different centres and countries) further improves generalis- nique for neural network training in general [24], image
ability [71]. In the last three years, a number of SSL mod- augmentationsand have unsurprisingly found widespread
els have been developed [2, 15, 16, 31, 44, 57, 86, 90–92] adoption in histopathology as well [68]. In this field, the
3 If the number of augmentations |A| is smaller than the number of most popular augmentations include flipping, scaling, ro-
training epochs, it is cheaper to pre-compute all augmentations before
training. Otherwise, it is better to sample the augmentations for every patch 4 To save computational resources, we excluded Lunit’s MoCo model
and epoch before training, and just pre-compute for those combinations. because both CTransPath and RetCCL already employ MoCo.
3
tating, and colour alterations due to the nature of pathol- Lunit-DINO ViT-S
ogy slides [68], though a recent line of research introduces TUM
STR
“stain augmentation” as a combination of stain normalisa- NORM
tion and image augmentations to increase data diversity as MUS
well [59, 72, 79]. In this work, we study 26 image augmen- MUC
LYM
tations, focusing our analysis on those popular in pathology.
DEB
ADI
Robustness of feature extractors in pathology Assess-
Figure 3. Latent space visualisations (t-SNE [83]) of features
ing the robustness and generalisation ability of deep learn-
extracted with Lunit-DINO (left) vs. ImageNet baseline (right).
ing models for pathology in the face of domain shift and out
Colours represent tissue classes [45]. Both feature extractors use
of distribution (OOD) data is an active area of research [33, the same architecture (ViT-S), but the left was trained on pathol-
43, 69, 99] and an important undertaking, considering the ogy images using SSL. Each dot represents a feature vector in la-
stakes may be human life. Our work builds upon Lunit’s tent space extracted from an unaltered image patch, and we draw
aforementioned SSL benchmarking initiative [44], which a line from that dot to the corresponding stain-normalised version.
involves training and evaluating four pathology-oriented
SSL feature extractors; we have integrated three of these embedding for a particular patch and its stain-normalised
into our study4 . Lunit’s evaluation, however, is confined to version (as we want the feature extractor to be invariant
patch classification and nuclei segmentation. While such to this factor), but yield very different embeddings for two
tile-based tasks are scientifically interesting and the pre- patches of different tissue classes (i.e. normal vs. tumour).
dominant means of evaluation in the literature [44, 77, 80], In this section, we study the effect of various augmen-
it has been suggested [8] that for evaluations to have greater tations on the latent space, beginning with stain normalisa-
clinical relevance, they should instead focus on slide-level tion. We employ the NCT-CRC-HE-100K dataset [45, 46],
tasks – predicting patient variables such as prognostic out- comprising 100,000 patches extracted from H&E images of
comes and biomarkers – and validate results on independent colorectal cancer (CRC) without stain normalisation. This
external cohorts. In response to this, we evaluate a total of dataset includes patch-level labels of the tissue type which
six SSL feature extractors across nine slide-level classifica- enables more fine-grained analysis and visualisation.
tion targets (whose clinical utility we detail in Appendix A), 3.1. Stain normalisation
and use external cohorts that were unseen during training
(including both SSL pretraining and downstream training). How similar are feature vectors extracted from image
Similar to our work, Tellez et al. [80] explore the in- patches to those derived from their stain-normalised coun-
fluence of stain normalisation and image augmentations terparts? We contend that simply looking at the aver-
on the generalisability of pathology models. However, age distance between original embeddings and their stain-
their 2019 study predates SSL models trained on expan- normalised versions does not provide enough information
sive pathology datasets akin to those employed in our eval- to make claims about the quality of a feature extractor. To
uation; their analysis is limited to CNNs trained from obtain a more nuanced view of how stain normalisation af-
scratch on narrow patch classification tasks. Springenberg fects embeddings, we present a dimensionality-reduced la-
et al. [77] empirically assess the robustness of CNNs and tent space visualisation of Lunit’s DINO feature extractor
ViTs in pathology with and without self-supervised pre- in Fig. 3. This feature extractor is highlighted due to its su-
training (CTransPath [91] and RetCCL [92]), but their eval- perior downstream performance (see Fig. 1d and analysis in
uation, again, is confined to patch classification. Sikaroudi Sec. 4.1). In our visualisation, each point corresponds to a
et al. [74] compare the OOD generalisability of pathol- feature vector, with a line connecting each original feature
ogy pretrained models (focusing on supervised and self- vector to its stain-normalised version. Notably, Lunit-DINO
supervised models trained on natural images as well as a clusters tissue types in latent space and the displacement of
non-SSL pathology-specific model [64], the latter achieving the feature vectors induced by stain normalisation is largely
the best results), but also only consider patch classification. confined to these clusters. In contrast, a baseline extractor
using the same ViT-S architecture [52] but trained via su-
3. Effect on latent space pervised learning on ImageNet, demonstrates less effective
clustering and exhibits a different pattern: some features
An ideal feature extractor for pathology extracts meaningful move hardly at all while others make large jumps between
features from a patch. More specifically, it should: clusters, as indicated by the longer inter-cluster lines in
1. be invariant to factors we deem unimportant, e.g. stain, Fig. 3, right. In fact, this pattern is consistent across various
orientation, etc.; and feature extractors: those pretrained on pathology data are
2. vary with properties we are interested in, e.g. tissue type, less prone to “jump” between tissue type clusters compared
cell type, and many other factors not known a priori. to their ImageNet-pretrained counterparts when undergoing
For example, a good feature extractor will produce a similar stain normalisation, further detailed in Appendix D.
4
original ↔ stain norm intra-class inter-class low contrast
Lunit-DINO
low saturation
1.00 low brightness ViT-S
Cosine distance
flip vertical
0.75 high saturation
flip horizontal
0.50
rotate 180°
high contrast
0.25
gamma 0.5
0.00 colour jitter
gamma 2.0
rotate 270°
in
L
th
AV
ko
C
-5
t-B
T-
T-
IN
Pa
Sw
rotate 90°
w
C
et
Vi
Vi
t-D
Ph
ni
t-S
et
ns
Lu
gaussian blur
R
es
a
ni
ni
Tr
Lu
Lu
R
C
AugMix
Figure 4. Boxplot of cosine distances between patch embeddings jigsaw
and their stain-normalised versions, as well as between embed- zoom 1.5x
sharpen
dings of randomly chosen patches of the same or of differing tissue high brightness
types. Feature extractors are grouped by architecture (ImageNet zoom 1.75x
Macenko
baselines are bold). Whiskers represent 95% of the distances. Cutout
median blur
zoom 2x
In Fig. 4, we compare the cosine distances of the em- affine
bedding displacement caused by stain normalisation across random rotation
warp perspective
all ten feature extractors. Despite the important difference
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
in terms of intra-cluster vs. inter-cluster jumps identified in
Cosine distance
the latent space visualisation above, Lunit-DINO and ViT-S
Figure 5. Boxplot of embedding displacement induced by image
exhibit similar averages (cf . their medians in Fig. 4, blue).
augmentations for Lunit-DINO and ViT-S. Dashed lines represent
This observation highlights the importance of examining the average distance between randomly selected patches (without
the distribution of distances, not merely their averages: the augmentation), indicating how ‘dispersed’ the latent spaces are.
boxplot in Fig. 4 reflects this difference by the increased
range of the whiskers of ViT-S compared to Lunit-DINO. that the choice of pretraining data influences the stability of
We note that an analysis that considers embedding dis- feature vectors in the context of stain normalisation. More
placement only from the perspective of stain normalisa- specifically, feature extractors that have seen diverse stains
tion is insufficient to make meaningful claims about fea- as part of their SSL pretraining can learn to become more
ture extractor utility. For example, an extractor that maps robust to variations in stain, while still preserving variations
all images to a single point in latent space would negate in aspects relevant to downstream tasks, i.e. tissue type.
any embedding displacement induced by stain normalisa-
3.2. Image augmentations
tion and prevent inter-cluster jumps, yet its features would
be wholly useless to the downstream model. This observa- In principle, the methodology presented above is suitable to
tion leads us to also consider the second key criterion out- study how any transformation of the input patches, not just
lined at the beginning of this section: the ability of feature stain normalisation,manifests itself in latent space. Here,
extractors to vary embeddings according to characteristics we consider 26 common image augmentations, for which
critical for downstream tasks. We select tissue type a sur- we provide representative examples in Appendix D. For
rogate marker to investigate this second criterion. However, Lunit-DINO and the ViT-S baseline, we compare the mag-
it is important to recognise as a limitation of this analysis nitudes of the embedding displacement across augmenta-
that there are numerous other potentially significant charac- tions in Fig. 5. We observe that Lunit-DINO’s embed-
teristics that remain unidentified at this stage, and for which dings are more robust to image augmentations compared
specific labels are unavailable. Nonetheless, we posit that to the ImageNet baseline: for all augmentations except
feature vectors from similar tissue types (indicated in blue ‘Cutout’ [27] and ‘warp perspective’, the cosine distances
in Fig. 4) should be closer in latent space compared to those tend to be smaller in Lunit-DINO. That is even though
from different tissue types (shown in green). Upon examin- Lunit-DINO’s embeddings are spread out more in latent
ing the disparity between these distance measures, we find space, i.e. the average distance between any two randomly
that the ImageNet baselines tend to lump all features more selected non-augmented patches is greater, indicated by the
closely together, regardless of tissue type. In contrast, the dashed lines in Fig. 5. When normalising the distances by
SSL models show better differentiation, as indicated by a this average, Cutout remains as the only (minor) exception.
greater separation between the blue and green boxes in the We observe that Lunit-DINO excels in terms of robust-
boxplot. Furthermore, the extent to which patches of differ- ness to right-angle rotations and flips – a much desired prop-
ent tissue types are distanced in the latent space (green) also erty considering that WSIs, unlike natural images, lack a
provides a useful scale for contextualising the original vs. canonical orientation. In fact, in selecting augmentations
stain-normalised distances (blue). These findings suggest for generating positive pairs for SSL pretraining of the Lunit
5
Lunit-DINO ViT-S 4. Impact on downstream performance
Motivated by the findings above – that some augmenta-
rotate 90°
n
iments, we parametrise hθ (·) as a linear layer with softmax
activation over the number of classes for the particular task.
Tasks and datasets In selecting downstream tasks, we
Figure 6. Visualisations of latent space transformations caused prioritise those with clinical utility and whose underly-
by rotation-based augmentations (rows) in Lunit-DINO (left) and ing variables are also available in adequately sized public
ViT-S (right). Colours and lines are as explained in Fig. 3. Top datasets. Training on TCGA-BRCA [93] and testing on
row: 90° rotation. Middle: rotating by a random angle. Bot-
CPTAC-BRCA [50], we predict four breast cancer ( ) tar-
tom (ablation): each line represents the transformation from (a)
gets: subtype as well as the CDH1, TP53, and PIK3CA ge-
the embedding of the 1.5× zoomed version of a patch to (b) the
embedding obtained by randomly rotating before the 1.5× zoom. netic mutations. Furthermore, we predict -lymph node sta-
tus in the CAMELYON17 breast cancer dataset [4] (which
contains data from five centres – we used one of the centres
feature extractors, Kang et al. [44] employed the aforemen- for testing and the others for training). Finally, we predict
tioned augmentations for this precise reason, incentivising four markers in colorectal cancer ( ): MSI status as well as
rotated/flipped embeddings to be close in latent space. On BRAF, KRAS, and SMAD4 genetic mutations (training on
the other hand, the ImageNet baseline is significantly less TCGA-CRC [93] and testing on CPTAC-COAD [84]). We
robust to such augmentations. Interestingly, it is more ro- elaborate on these variables, their clinical relevancy, and the
bust to horizontal flips than vertical flips, which may be ex- underlying datasets in Appendix A.
plained by the fact that it was trained on natural images. The aforementioned choice of tasks and datasets uses ex-
Although Lunit-DINO is remarkably robust to rotating ternal cohorts for testing, so that we can assess generalis-
by angles that are multiples of 90°, non-right angles cause ability to unseen datasets. We were also diligent in ensur-
the greatest displacement in latent space aside from per- ing no data leakage occurred between the SSL pretraining
spective warp (penultimate row in Fig. 5). To investigate and downstream test datasets. Notably, given that all eval-
this further, we visualise the latent space in Fig. 6. As uated pathology feature extractors included TCGA in their
expected, for 90° rotation (top row), Lunit-DINO’s latent pretraining, we deliberately chose other datasets for testing.
space remains largely unchanged, as opposed to ViT-S. Training details We train each model using the AdamW
However, for random angle rotations (middle row), we ob- optimiser [55] with an initial learning rate of 0.001 which is
serve a high degree of chaotic jumps in both feature extrac- decayed using cosine annealing [54] for up to 30 epochs,
tors, indicating neither is robust to this augmentation. We though training typically ends sooner due to our use of
hypothesise that this is caused by the loss of pixels at the early stopping (when the validation loss fails to improve
edges of the patches in off-angle rotations, and design an for ten consecutive epochs). For this, we allocate 80%
ablation study to investigate this phenomenon. To eliminate of the training set for model training and 20% for val-
the black pixel problem, we perform a centercrop on the idation. We conduct training with five distinct random
original and augmented patches in a manner that ensures seeds for the cartesian product of the ten feature extractors,
there are no black pixels in any rotation. The corresponding nine tasks, three downstream models, and six preprocess-
latent space visualisations at the bottom of Fig. 6 confirm ing/augmentation setups (slidewise or patchwise stain nor-
our assumption: Lunit-DINO’s latent space remains un- malisation, rotate/flip, all augmentations, or none), result-
changed whereas ViT-S’s embeddings move significantly. ing in over 6,000 trained models. The training and valida-
Similar reasoning may explain the poor robustness regard- tion splits are kept fixed per-task across the seeds for all
ing ‘random affine’, ‘warp perspective’, and ‘Cutout’ [27]. tasks except for lymph node status classification. This latter
6
Feature extractor -subtype -CDH1 -TP53 -PIK3CA -LN status -MSI -KRAS -BRAF -SMAD4 Average
Swin [52] 0.07 ± 0.02 0.17 ± 0.03 0.28 ± 0.02 0.07 ± 0.04 0.17 ± 0.08 0.18 ± 0.04 0.14 ± 0.04 0.14 ± 0.07 0.16 ± 0.05 0.15 ± 0.05
CTransPath [91] 0.00 ± 0.00 0.01 ± 0.01 0.01 ± 0.01 0.04 ± 0.03 0.06 ± 0.07 0.08 ± 0.03 0.06 ± 0.03 0.06 ± 0.03 0.06 ± 0.03 0.04 ± 0.03
ViT-B [49] 0.08 ± 0.04 0.11 ± 0.02 0.15 ± 0.03 0.07 ± 0.03 0.17 ± 0.06 0.15 ± 0.03 0.03 ± 0.04 0.18 ± 0.07 0.01 ± 0.01 0.11 ± 0.04
Phikon [31] 0.09 ± 0.02 0.09 ± 0.02 0.09 ± 0.03 0.09 ± 0.03 0.07 ± 0.06 0.06 ± 0.04 0.07 ± 0.04 0.07 ± 0.06 0.17 ± 0.08 0.09 ± 0.05
ViT-S [49] 0.13 ± 0.03 0.08 ± 0.03 0.14 ± 0.05 0.08 ± 0.05 0.19 ± 0.09 0.18 ± 0.04 0.06 ± 0.03 0.19 ± 0.04 0.08 ± 0.08 0.13 ± 0.05
Lunit-DINO [44] 0.08 ± 0.03 0.03 ± 0.03 0.03 ± 0.02 0.02 ± 0.03 0.07 ± 0.04 0.00 ± 0.00 0.06 ± 0.04 0.02 ± 0.02 0.02 ± 0.02 0.04 ± 0.03
ResNet-50 [37] 0.15 ± 0.03 0.09 ± 0.04 0.11 ± 0.03 0.01 ± 0.02 0.18 ± 0.08 0.22 ± 0.04 0.11 ± 0.03 0.23 ± 0.07 0.21 ± 0.09 0.15 ± 0.05
RetCCL [92] 0.07 ± 0.03 0.04 ± 0.02 0.04 ± 0.03 0.05 ± 0.03 0.07 ± 0.06 0.08 ± 0.03 0.03 ± 0.02 0.14 ± 0.03 0.06 ± 0.03 0.06 ± 0.03
Lunit-BT [44] 0.13 ± 0.03 0.06 ± 0.04 0.02 ± 0.01 0.13 ± 0.04 0.34 ± 0.15 0.28 ± 0.13 0.03 ± 0.04 0.35 ± 0.13 0.25 ± 0.03 0.18 ± 0.08
Lunit-SwAV [44] 0.06 ± 0.02 0.06 ± 0.03 0.06 ± 0.02 0.13 ± 0.06 0.07 ± 0.05 0.10 ± 0.03 0.13 ± 0.06 0.07 ± 0.07 0.14 ± 0.08 0.09 ± 0.05
Table 1. Comparative evaluation of feature extractors. This table presents the normalised differential AUROC scores (lower is better) for
all feature extractors, across the evaluated targets using the AttMIL [42] aggregation model. The scores reflect the expected decrease in
test AUROC when selecting a given feature extractor relative to the best-performing one for each task-model combination (see Sec. 4.1).
task uses the CAMELYON17 dataset, allowing us to per- gregation model and the type of input augmentation, as we
form leave-one-hospital-out cross-validation with a differ- show in the extended data table in Appendix E. Moreover,
ent random seed for each of the five hospitals. For the exper- we find the ImageNet baselines perform worse than the
iments involving augmentations, we apply these augmenta- pathology models (with the exception of Lunit-BT which
tions only on the images of the training datasets, never the performs very poorly indeed), which is in line with many
test datasets (except for the stain normalisation experiments, previous works [8, 14, 22, 25, 44, 74].
where we ensure the same normalisation is applied to train-
ing and test datasets). We perform feature extraction once 4.2. Stain normalisation does not impact down-
before training, caching for every patch in every dataset its stream performance
original feature vector as well as the feature vectors of all 27 We quantify the effect of stain normalisation on down-
augmented versions of that patch, including stain normali- stream model performance by determining the expected
sation. More details are provided in Appendix F.2. difference in test AUROC between models trained with
stain normalisation vs. without. Given a feature extractor
4.1. Lunit-DINO and CTransPath extract the most and downstream aggregation model, e.g. Lunit-DINO with
useful features AttMIL, we must compare 45 models trained with stain nor-
malisation (nine tasks times five random seeds) with another
Having trained a large number of downstream models based 45 models trained without stain normalisation. To estimate
on ten feature extractors across a diverse set of tasks, we are the difference in AUROC, we perform bootstrapping. For
in a position to identify the most effective feature extractor each of the 45 task-seed pairs, we generate 25 random re-
overall. We present these findings first and focus our later samples of the respective test datasets with replacement, to-
discussion on these feature extractors in particular. talling 45 × 25 = 1,125 bootstraps. Since each bootstrap is
First, let us consider how to determine the best feature associated with a particular task-seed combination, it corre-
extractor for a given task and downstream aggregator (such sponds to two trained models: one trained with stain nor-
as predicting -CDH1 with AttMIL aggregation). For any malisation and one without. We deploy both models on
such task-model pair, we trained 50 models – spanning the the given bootstrap, computing the difference in AUROC.
ten feature extractors across five random seeds. We define Repeating for all bootstraps, we obtain a distribution of
a ‘trial’ as one particular configuration where each feature 1,125 AUROC differences which we present as a boxplot
extractor is paired with a random seed, leading to 510 (≈ in Fig. 1a, with a separate box for every feature extractor
10 million) unique trials. Within each trial, we evaluate the (we focus on the AttMIL [42] aggregation model because it
feature extractors based on the difference between their test is the most widely used, but provide analogous plots for the
area under the receiver operating characteristic curve (AU- other two in Appendix E). We find no clear AUROC differ-
ROC) and the highest test AUROC observed, thus assign- ence between the two groups, for any feature extractor: all
ing a score of zero to the top performer. By calculating 95% confidence intervals (and interquartile ranges) include
the mean of these scores across all 510 trials, we derive the zero. Surprisingly, this holds even for ImageNet extractors,
‘normalised differential AUROC score’ – a measure that whose latent spaces we previously identified more suscep-
captures the relative efficacy of the feature extractors and tible to larger displacements due to stain normalisation.
allows fair comparisons across tasks of varying difficulty.
The outcomes of this analysis, when considering the down- Slidewise versus patchwise normalisation While in
stream AttMIL model and no augmentations, are detailed Fig. 1a, we perform stain normalisation on a per-slide basis,
for all tasks individually in Tab. 1 and averaged across tasks a more computationally efficient5 alternative is normalising
in Fig. 1d. Notably, Lunit-DINO and CTransPath are tied in 5 Macenko normalisation [58] requires an eigenvalue decomposition
achieving the best task-averaged performance. Indeed, they across all pixels in the image, scaling cubically in the number of pixels. For
consistently perform best, regardless of the downstream ag- a slide with n patches of k pixels each, the complexity is in O(n3 k3 ) for
7
employ the technique from Sec. 4.1, but instead of keep-
AUROC deterioration vs. best
0.12
AttMIL Transformer Mean pool
ing fixed the downstream model and determining the best
0.09
feature extractor, we choose a feature extractor and vary
0.06 the downstream aggregation model. As shown in Fig. 7,
AttMIL performs best, closely followed by the two-layer
0.03 transformer, and finally mean average pooling, but we note
the differences are small and exhibit high variance.
0.00
8
direction for future research into the development of pathol- eosin-stained pathology images. J. Pathol., 256(1):15–24,
ogy feature extractors and foundation models [2, 8, 16, 57, 2022. 3
86]: the importance of not only scaling size and diversity of [6] Bruno Buecher, Wulfran Cacheux, Etienne Rouleau, Bar-
pretraining datasets, but also tailoring SSL methods to the bara Dieumegard, Emmanuel Mitry, and Astrid Lièvre.
pathology domain, in order to effectively leverage this data. Role of microsatellite instability in the management of col-
Looking ahead, we aim to investigate whether the inef- orectal cancers. Dig. Liver Dis., 45(6):441–449, 2013. 1
fectiveness of augmentations persists in limited-data scenar- [7] Gabriele Campanella, Matthew G Hanna, Luke Genes-
ios, and how slide magnification impacts extracted features. law, Allen Miraflor, Vitor Werneck Krauss Silva, Klaus J
Busam, Edi Brogi, Victor E Reuter, David S Klimstra, and
Acknowledgements: GW is supported by Lothian NHS. This Thomas J Fuchs. Clinical-grade computational pathology
project received funding from the European Union’s Horizon using weakly supervised deep learning on whole slide im-
2020 research and innovation programme under Grant Agreement ages. Nat. Med., 25(8):1301–1309, 2019. 1
No. 101017453 as part of the KATY project. This work is sup-
[8] Gabriele Campanella, Ricky Kwan, Eugene Fluder, Jen-
ported in part by the Industrial Centre for AI Research in Digital
nifer Zeng, Aryeh Stock, Brandon Veremis, Alexan-
Diagnostics (iCAIRD) which is funded by Innovate UK on behalf
dros D Polydorides, Cyrus Hedvat, Adam Schoenfeld,
of UK Research and Innovation (UKRI) (project number 104690).
Chad Vanderbilt, Patricia Kovatch, Carlos Cordon-Cardo,
and Thomas J Fuchs. Computational pathology at health
system scale – Self-Supervised foundation models from
References three billion images. 2023. 1, 3, 4, 7, 8, 9, 2
[9] F Cardoso, S Kyriakides, S Ohno, F Penault-Llorca, P
[1] Fabrice André, Eva Ciruelos, Gabor Rubovszky, Mario Poortmans, I T Rubio, S Zackrisson, E Senkus, and ESMO
Campone, Sibylle Loibl, Hope S Rugo, Hiroji Iwata, Pier- Guidelines Committee. Electronic address: clinicalguide-
franco Conte, Ingrid A Mayer, Bella Kaufman, Toshinari [email protected]. Early breast cancer: ESMO clinical prac-
Yamashita, Yen-Shen Lu, Kenichi Inoue, Masato Taka- tice guidelines for diagnosis, treatment and follow-up†.
hashi, Zsuzsanna Pápai, Anne-Sophie Longin, David Mills, Ann. Oncol., 30(8):1194–1220, 2019. 1
Celine Wilke, Samit Hirawat, and Dejan Juric. Alpelisib for
[10] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and
PIK3CA-Mutated, hormone Receptor–Positive advanced
Matthijs Douze. Deep clustering for unsupervised learning
breast cancer. N. Engl. J. Med., 380(20):1929–1940, 2019.
of visual features. In Proceedings of the European confer-
1
ence on computer vision (ECCV), pages 132–149, 2018. 3
[2] Shekoofeh Azizi, Laura Culp, Jan Freyberg, Basil Mustafa, [11] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal,
Sebastien Baur, Simon Kornblith, Ting Chen, Nenad Toma- Piotr Bojanowski, and Armand Joulin. Unsupervised learn-
sev, Jovana Mitrović, Patricia Strachan, S Sara Mahdavi, ing of visual features by contrasting cluster assignments.
Ellery Wulczyn, Boris Babenko, Megan Walker, Aaron Adv. Neural Inf. Process. Syst., 33:9912–9924, 2020. 3, 2
Loh, Po-Hsuan Cameron Chen, Yuan Liu, Pinal Bav- [12] Mathilde Caron, Hugo Touvron, Ishan Misra, Herve Je-
ishi, Scott Mayer McKinney, Jim Winkens, Abhijit Guha gou, Julien Mairal, Piotr Bojanowski, and Armand Joulin.
Roy, Zach Beaver, Fiona Ryan, Justin Krogue, Mozziyar Emerging properties in self-supervised vision transformers.
Etemadi, Umesh Telang, Yun Liu, Lily Peng, Greg S Cor- In 2021 IEEE/CVF International Conference on Computer
rado, Dale R Webster, David Fleet, Geoffrey Hinton, Neil Vision (ICCV). IEEE, 2021. 3, 2
Houlsby, Alan Karthikesalingam, Mohammad Norouzi,
[13] M Chalabi, Y L Verschoor, J Van den Berg, and others.
and Vivek Natarajan. Robust and data-efficient general-
LBA7 neoadjuvant immune checkpoint inhibition in lo-
ization of self-supervised machine learning for diagnostic
cally advanced MMR-deficient colon cancer: The NICHE-
imaging. Nat Biomed Eng, 7(6):756–779, 2023. 1, 3, 9, 2
2 study. Annals of, 2022. 1
[3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. [14] Richard J Chen and Rahul G Krishnan. Self-Supervised
Layer normalization. arXiv preprint arXiv:1607.06450, vision transformers learn visual concepts in histopathol-
2016. 5 ogy. Learning Meaningful Representations of Life, NeurIPS
[4] Peter Bandi, Oscar Geessink, Quirine Manson, Mar- 2021, 2021. 1, 3, 7, 8
cory Van Dijk, Maschenka Balkenhol, Meyke Hermsen, [15] Richard J Chen, Chengkuan Chen, Yicong Li, Tiffany Y
Babak Ehteshami Bejnordi, Byungjae Lee, Kyunghyun Chen, Andrew D Trister, Rahul G Krishnan, and Faisal
Paeng, Aoxiao Zhong, et al. From detection of individ- Mahmood. Scaling vision transformers to gigapixel im-
ual metastases to classification of lymph node status at the ages via hierarchical self-supervised learning. In 2022
patient level: the camelyon17 challenge. IEEE transactions IEEE/CVF Conference on Computer Vision and Pattern
on medical imaging, 38(2):550–560, 2018. 6, 1 Recognition (CVPR), pages 16144–16155. IEEE, 2022. 1,
[5] Jeffrey Boschman, Hossein Farahani, Amirali Darbandsari, 3
Pouya Ahmadvand, Ashley Van Spankeren, David Farnell, [16] Richard J Chen, Tong Ding, Ming Y Lu, Drew F K
Adrian B Levine, Julia R Naso, Andrew Churg, Steven Jm Williamson, Guillaume Jaume, Bowen Chen, Andrew
Jones, Stephen Yip, Martin Köbel, David G Huntsman, Zhang, Daniel Shao, Andrew H Song, Muhammad Shaban,
C Blake Gilks, and Ali Bashashati. The utility of color Mane Williams, Anurag Vaidya, Sharifa Sahai, Lukas Old-
normalization for AI-based diagnosis of hematoxylin and enburg, Luca L Weishaupt, Judy J Wang, Walt Williams,
9
Long Phi Le, Georg Gerber, and Faisal Mahmood. A [30] Omar S M El Nahhas, Chiara M L Loeffler, Zunamys I
General-Purpose Self-Supervised model for computational Carrero, Marko van Treeck, Fiona R Kolbinger, Kather-
pathology. 2023. 1, 3, 9, 2 ine J Hewitt, Hannah S Muti, Mara Graziani, Qinghe
[17] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- Zeng, Julien Calderaro, Nadina Ortiz-Brüchle, Tanwei
offrey Hinton. A simple framework for contrastive learning Yuan, Michael Hoffmeister, Hermann Brenner, Alexan-
of visual representations. In Proceedings of the 37th In- der Brobeil, Jorge S Reis-Filho, and Jakob Nikolas
ternational Conference on Machine Learning, pages 1597– Kather. Regression-based Deep-Learning predicts molecu-
1607. PMLR, 2020. 3, 2 lar biomarkers from pathology slides. arXiv preprint arXiv,
[18] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad 2023. 1, 3
Norouzi, and Geoffrey E Hinton. Big Self-Supervised [31] Alexandre Filiot, Ridouane Ghermi, Antoine Olivier, Paul
models are strong Semi-Supervised learners. In Advances Jacob, Lucas Fidon, Alice Mac Kain, Charlie Saillard, and
in Neural Information Processing Systems, pages 22243– Jean-Baptiste Schiratti. Scaling self-supervised learning for
22255. Curran Associates, Inc., 2020. 2 histopathology with masked image modeling. 2023. 1, 3,
[19] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. 7, 2
Improved baselines with momentum contrastive learning. [32] Narmin Ghaffari Laleh, Hannah Sophie Muti, Chiara
2020. 3 Maria Lavinia Loeffler, Amelie Echle, Oliver Lester Sal-
[20] Xinlei Chen, Saining Xie, and Kaiming He. An empiri- danha, Faisal Mahmood, Ming Y Lu, Christian Trautwein,
cal study of training self-supervised vision transformers. In Rupert Langer, Bastian Dislich, Roman D Buelow,
2021 IEEE/CVF International Conference on Computer Vi- Heike Irmgard Grabsch, Hermann Brenner, Jenny Chang-
sion (ICCV). IEEE, 2021. 3, 2 Claude, Elizabeth Alwers, Titus J Brinker, Firas Khader,
[21] Philip Chikontwe, Hyun Jung Sung, Jaehoon Jeong, Mee- Daniel Truhn, Nadine T Gaisa, Peter Boor, Michael
jeong Kim, Heounjeong Go, Soo Jeong Nam, and Sang Hoffmeister, Volkmar Schulz, and Jakob Nikolas Kather.
Hyun Park. Weakly supervised segmentation on neural Benchmarking weakly-supervised deep learning pipelines
compressed histopathology with self-equivariant regular- for whole slide classification in computational pathology.
ization. Med. Image Anal., 80:102482, 2022. 3 Med. Image Anal., 79:102474, 2022. 1, 3
[22] Ozan Ciga, Tony Xu, and Anne Louise Martel. Self super- [33] Narmin Ghaffari Laleh, Daniel Truhn, Gregory Patrick
vised contrastive learning for digital histopathology. Ma- Veldhuizen, Tianyu Han, Marko van Treeck, Roman D
chine Learning with Applications, 7:100198, 2022. 1, 7 Buelow, Rupert Langer, Bastian Dislich, Peter Boor, Volk-
[23] Nicolas Coudray, Paolo Santiago Ocampo, Theodore Sakel- mar Schulz, and Jakob Nikolas Kather. Adversarial attacks
laropoulos, Navneet Narula, Matija Snuderl, David Fenyö, and adversarial robustness in computational pathology. Nat.
Andre L Moreira, Narges Razavian, and Aristotelis Tsiri- Commun., 13(1):5711, 2022. 4
gos. Classification and mutation prediction from non–small [34] A Goldhirsch, E P Winer, A S Coates, R D Gelber, M
cell lung cancer histopathology images using deep learning. Piccart-Gebhart, B Thürlimann, H-J Senn, and Panel mem-
Nat. Med., 24(10):1559–1567, 2018. 1, 3 bers. Personalizing the treatment of women with early
[24] Ekin Dogus Cubuk, Ethan S Dyer, Rapha Gontijo Lopes, breast cancer: highlights of the st gallen international ex-
and Sylvia Smullin. Tradeoffs in data augmentation: An pert consensus on the primary therapy of early breast cancer
empirical study. In ICLR, 2021. 3 2013. Ann. Oncol., 24(9):2206–2223, 2013. 1
[25] Olivier Dehaene, Axel Camara, Olivier Moindrot, Axel [35] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin
de Lavergne, and Pierre Courtiol. Self-Supervision closes Tallec, Pierre Richemond, Elena Buchatskaya, Carl Do-
the gap between weak and strong supervision in histology. ersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad
2020. 1, 7 Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Remi
[26] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Munos, and Michal Valko. Bootstrap your own latent - a
Li Fei-Fei. ImageNet: A large-scale hierarchical image new approach to Self-Supervised learning. In Advances
database. In 2009 IEEE Conference on Computer Vision in Neural Information Processing Systems, pages 21271–
and Pattern Recognition, pages 248–255, 2009. 1, 3 21284. Curran Associates, Inc., 2020. 3
[27] Terrance DeVries and Graham W Taylor. Improved regular- [36] Yonghang Guan, Jun Zhang, Kuan Tian, Sen Yang, Pei
ization of convolutional neural networks with cutout. 2017. Dong, Jinxi Xiang, Wei Yang, Junzhou Huang, Yuyao
5, 6, 2 Zhang, and Xiao Han. Node-aligned graph convolutional
[28] R Dienstmann, M J Mason, F A Sinicrope, A I Phipps, network for whole-slide image representation and classifi-
S Tejpar, A Nesbakken, S A Danielsen, A Sveen, D D cation. In 2022 IEEE/CVF Conference on Computer Vi-
Buchanan, M Clendenning, C Rosty, B Bot, S R Alberts, sion and Pattern Recognition (CVPR), pages 18813–18823.
J Milburn Jessup, R A Lothe, M Delorenzi, P A Newcomb, IEEE, 2022. 1
D Sargent, and J Guinney. Prediction of overall survival in [37] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
stage II and III colon cancer beyond TNM system: a ret- Deep residual learning for image recognition. In 2016 IEEE
rospective, pooled biomarker study. Ann. Oncol., 28(5): Conference on Computer Vision and Pattern Recognition,
1023–1031, 2017. 1 CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages
[29] Thomas G Dietterich, Richard H Lathrop, and Tomás 770–778. IEEE Computer Society, 2016. 1, 3, 7, 2
Lozano-Pérez. Solving the multiple instance problem with [38] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
axis-parallel rectangles. Artif. Intell., 89(1):31–71, 1997. 3 Girshick. Momentum contrast for unsupervised visual rep-
10
resentation learning. In 2020 IEEE/CVF Conference on Thomas Rösch, Rene Werner, Jie Tian, Elodie Puybareau,
Computer Vision and Pattern Recognition (CVPR). IEEE, Matteo Bovio, Xiufeng Zhang, Yifeng Zhu, Se Young
2020. 3, 2 Chun, Won-Ki Jeong, Peom Park, and Jinwook Choi. PAIP
[39] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Pi- 2019: Liver cancer segmentation challenge. Med. Image
otr Dollár, and Ross B. Girshick. Masked autoencoders are Anal., 67:101854, 2021. 2
scalable vision learners. In IEEE/CVF Conference on Com- [49] Alexander Kolesnikov, Alexey Dosovitskiy, Dirk Weis-
puter Vision and Pattern Recognition, CVPR 2022, New senborn, Georg Heigold, Jakob Uszkoreit, Lucas Beyer,
Orleans, LA, USA, June 18-24, 2022, pages 15979–15988. Matthias Minderer, Mostafa Dehghani, Neil Houlsby, Syl-
IEEE, 2022. 2 vain Gelly, Thomas Unterthiner, and Xiaohua Zhai. An im-
[40] Dan Hendrycks and Kevin Gimpel. Gaussian error linear age is worth 16x16 words: Transformers for image recog-
units (gelus). arXiv preprint arXiv:1606.08415, 2016. 5 nition at scale. In International Conference on Learning
[41] Dan Hendrycks, Norman Mu, Ekin D. Cubuk, Barret Zoph, Representations, 2021. 3, 7, 2
Justin Gilmer, and Balaji Lakshminarayanan. AugMix: A [50] Karsten Krug, Eric J Jaehnig, Shankha Satpathy, Lili Blu-
simple data processing method to improve robustness and menberg, Alla Karpova, Meenakshi Anurag, George Miles,
uncertainty. Proceedings of the International Conference Philipp Mertins, Yifat Geffen, Lauren C Tang, et al. Pro-
on Learning Representations (ICLR), 2020. 2 teogenomic landscape of breast cancer tumorigenesis and
[42] Maximilian Ilse, Jakub Tomczak, and Max Welling. targeted therapy. Cell, 183(5):1436–1456, 2020. 6, 1
Attention-based deep multiple instance learning. In Pro- [51] Yang Liu, Nilay S Sethi, Toshinori Hinoue, Barbara G
ceedings of the 35th International Conference on Machine Schneider, Andrew D Cherniack, Francisco Sanchez-Vega,
Learning, pages 2127–2136. PMLR, 2018. 3, 6, 7, 8, 4, 5 Jose A Seoane, Farshad Farshidfar, Reanne Bowlby, Mi-
[43] Mostafa Jahanifar, Manahil Raza, Kesi Xu, Trinh Vuong, razul Islam, et al. Comparative molecular analysis of gas-
Rob Jewsbury, Adam Shephard, Neda Zamanitajeddin, trointestinal adenocarcinomas. Cancer cell, 33(4):721–735,
Jin Tae Kwak, Shan E Ahmed Raza, Fayyaz Minhas, and 2018. 1
Nasir Rajpoot. Domain generalization in computational [52] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
pathology: Survey and guidelines. 2023. 4 Zhang, Stephen Lin, and Baining Guo. Swin transformer:
[44] Mingu Kang, Heon Song, Seonwook Park, Donggeun Yoo, Hierarchical vision transformer using shifted windows. In
and Sérgio Pereira. Benchmarking self-supervised learning 2021 IEEE/CVF International Conference on Computer Vi-
on diverse pathology datasets. In 2023 IEEE/CVF Confer- sion (ICCV). IEEE, 2021. 3, 4, 7, 2
ence on Computer Vision and Pattern Recognition (CVPR), [53] Chiara Maria Lavinia Loeffler, Omar S M El Nahhas,
pages 3344–3354. IEEE, 2023. 1, 3, 4, 6, 7, 8, 2 Hannah Sophie Muti, Tobias Seibel, Didem Cifci, Marko
[45] Jakob Nikolas Kather, Niels Halama, and Alexander Marx. van Treeck, Marco Gustav, Zunamys I Carrero, Nadine T
100,000 histological images of human colorectal cancer Gaisa, Kjong-Van Lehmann, Alexandra Leary, Pier Se-
and healthy tissue, 2018. 4 lenica, Jorge S Reis-Filho, Nadina Ortiz Bruechle, and
[46] Jakob Nikolas Kather, Johannes Krisam, Pornpimol Jakob Nikolas Kather. Direct prediction of homologous re-
Charoentong, Tom Luedde, Esther Herpel, Cleo-Aron combination deficiency from routine histology in ten dif-
Weis, Timo Gaiser, Alexander Marx, Nektarios A Val- ferent tumor types with attention-based multiple instance
ous, Dyke Ferber, Lina Jansen, Constantino Carlos Reyes- learning: a development and validation study. medRxiv,
Aldasoro, Inka Zörnig, Dirk Jäger, Hermann Brenner, 2023. 1
Jenny Chang-Claude, Michael Hoffmeister, and Niels Ha- [54] Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradi-
lama. Predicting survival from colorectal cancer histol- ent descent with warm restarts. In International Conference
ogy slides using deep learning: A retrospective multicenter on Learning Representations, 2017. 6, 5
study. PLoS Med., 16(1):e1002730, 2019. 4 [55] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
[47] Jakob Nikolas Kather, Alexander T Pearson, Niels Halama, regularization. In 7th International Conference on Learning
Dirk Jäger, Jeremias Krause, Sven H Loosen, Alexander Representations, ICLR 2019, New Orleans, LA, USA, May
Marx, Peter Boor, Frank Tacke, Ulf Peter Neumann, Heike I 6-9, 2019, 2019. 6, 5
Grabsch, Takaki Yoshikawa, Hermann Brenner, Jenny [56] Ming Y Lu, Drew F K Williamson, Tiffany Y Chen,
Chang-Claude, Michael Hoffmeister, Christian Trautwein, Richard J Chen, Matteo Barbieri, and Faisal Mahmood.
and Tom Luedde. Deep learning can predict microsatellite Data-efficient and weakly supervised computational pathol-
instability directly from histology in gastrointestinal cancer. ogy on whole-slide images. Nat Biomed Eng, 5(6):555–
Nat. Med., 25(7):1054–1056, 2019. 1, 3 570, 2021. 1, 3
[48] Yoo Jung Kim, Hyungjoon Jang, Kyoungbun Lee, [57] Ming Y Lu, Bowen Chen, Drew F K Williamson, Richard J
Seongkeun Park, Sung-Gyu Min, Choyeon Hong, Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor
Jeong Hwan Park, Kanggeun Lee, Jisoo Kim, Wonjae Odintsov, Andrew Zhang, Long Phi Le, Georg Gerber,
Hong, Hyun Jung, Yanling Liu, Haran Rajkumar, Ma- Anil V Parwani, and Faisal Mahmood. Towards a Visual-
hendra Khened, Ganapathy Krishnamurthi, Sen Yang, Language foundation model for computational pathology.
Xiyue Wang, Chang Hee Han, Jin Tae Kwak, Jianqiang 2023. 1, 3, 9, 2
Ma, Zhe Tang, Bahram Marami, Jack Zeineh, Zixu Zhao, [58] Marc Macenko, Marc Niethammer, J S Marron, David Bor-
Pheng-Ann Heng, Rüdiger Schmitz, Frederic Madesta, land, John T Woosley, Xiaojun Guan, Charles Schmitt, and
11
Nancy E Thomas. A method for normalizing histology [66] A C Ruifrok and D A Johnston. Quantification of histo-
slides for quantitative analysis. In 2009 IEEE International chemical staining by color deconvolution. Anal. Quant. Cy-
Symposium on Biomedical Imaging: From Nano to Macro, tol. Histol., 23(4):291–299, 2001. 3
pages 1107–1110, 2009. 2, 3, 7, 8, 10 [67] Oliver Lester Saldanha, Chiara M L Loeffler, Jan Moritz
[59] Niccolò Marini, Sebastian Otalora, Marek Wodzinski, Se- Niehues, Marko van Treeck, Tobias P Seraphin, Kather-
lene Tomassini, Aldo Franco Dragoni, Stephane Marchand- ine Jane Hewitt, Didem Cifci, Gregory Patrick Veldhuizen,
Maillet, Juan Pedro Dominguez Morales, Lourdes Duran- Siddhi Ramesh, Alexander T Pearson, and Jakob Nikolas
Lopez, Simona Vatrano, Henning Müller, and Manfredo Kather. Self-supervised attention-based deep learning for
Atzori. Data-driven color augmentation for H&E stained pan-cancer mutation prediction from histopathology. NPJ
images in computational pathology. J. Pathol. Inform., 14: Precis Oncol, 7(1):35, 2023. 1
100183, 2023. 4 [68] Massimo Salvi, U Rajendra Acharya, Filippo Molinari, and
[60] Haseeb Nazki, Ognjen Arandjelovic, In Hwa Um, and Kristen M Meiburger. The impact of pre- and post-image
David Harrison. MultiPathGAN: Structure preserving stain processing techniques on deep learning frameworks: A
normalization using unsupervised multi-domain adversarial comprehensive review for digital pathology image analysis.
network with perception loss. In Proceedings of the 38th Comput. Biol. Med., 128:104129, 2021. 2, 3, 4
ACM/SIGAPP Symposium on Applied Computing, pages [69] Birgid Schömig-Markiefka, Alexey Pryalukhin, Wolfgang
1197–1204, New York, NY, USA, 2023. Association for Hulla, Andrey Bychkov, Junya Fukuoka, Anant Madab-
Computing Machinery. 3 hushi, Viktor Achter, Lech Nieroda, Reinhard Büttner,
[61] Jan Moritz Niehues, Philip Quirke, Nicholas P West, Alexander Quaas, and Yuri Tolkach. Quality control stress
Heike I Grabsch, Marko van Treeck, Yoni Schirris, Gre- test for deep learning-based diagnostic model in digital
gory P Veldhuizen, Gordon G A Hutchins, Susan D Rich- pathology. Mod. Pathol., 34(12):2098–2108, 2021. 4
man, Sebastian Foersch, Titus J Brinker, Junya Fukuoka, [70] Peter Leonard Schrammen, Narmin Ghaffari Laleh, Amelie
Andrey Bychkov, Wataru Uegami, Daniel Truhn, Her- Echle, Daniel Truhn, Volkmar Schulz, Titus J Brinker,
mann Brenner, Alexander Brobeil, Michael Hoffmeister, Hermann Brenner, Jenny Chang-Claude, Elizabeth Alw-
and Jakob Nikolas Kather. Generalizable biomarker pre- ers, Alexander Brobeil, Matthias Kloor, Lara R Heij, Dirk
diction from cancer pathology slides with self-supervised Jäger, Christian Trautwein, Heike I Grabsch, Philip Quirke,
deep learning: A retrospective multi-centric study. Cell Rep Nicholas P West, Michael Hoffmeister, and Jakob Nikolas
Med, 4(4):100980, 2023. 1 Kather. Weakly supervised annotation-free cancer detec-
[62] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy tion and prediction of genotype in routine histopathology.
Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, J. Pathol., 256(1):50–60, 2022. 3
Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby,
[71] Zhuchen Shao, Liuxi Dai, Jitendra Jonnagaddala, Yang
Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Rus-
Chen, Yifeng Wang, Zijie Fang, and Yongbing Zhang. Gen-
sell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra,
eralizability of Self-Supervised training models for digital
Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu
pathology: A multicountry comparison in colorectal cancer.
Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand
JCO Clin Cancer Inform, 7:e2200178, 2023. 3, 8
Joulin, and Piotr Bojanowski. DINOv2: Learning robust
visual features without supervision. 2023. 2 [72] Yiqing Shen, Yulin Luo, Dinggang Shen, and Jing Ke.
RandStainNA: Learning Stain-Agnostic features from his-
[63] E Reinhard, M Adhikhmin, B Gooch, and P Shirley. Color
tology slides by bridging stain augmentation and normal-
transfer between images. IEEE Comput. Graph. Appl., 21
ization. In Medical Image Computing and Computer
(5):34–41, 2001. 2, 3
Assisted Intervention – MICCAI 2022, pages 212–221.
[64] Abtin Riasatian, Morteza Babaie, Danial Maleki, Shivam
Springer Nature Switzerland, 2022. 4
Kalra, Mojtaba Valipour, Sobhan Hemati, Manit Zaveri,
Amir Safarpoor, Sobhan Shafiei, Mehdi Afshari, Maral Ra- [73] Artem Shmatko, Narmin Ghaffari Laleh, Moritz Ger-
soolijaberi, Milad Sikaroudi, Mohd Adnan, Sultaan Shah, stung, and Jakob Nikolas Kather. Artificial intelligence in
Charles Choi, Savvas Damaskinos, Clinton Jv Campbell, histopathology: enhancing cancer research and clinical on-
Phedias Diamandis, Liron Pantanowitz, Hany Kashani, Ali cology. Nat Cancer, 3(9):1026–1038, 2022. 1
Ghodsi, and H R Tizhoosh. Fine-Tuning and training [74] Milad Sikaroudi, Maryam Hosseini, Ricardo Gonzalez,
of densenet for histopathology image representation using Shahryar Rahnamayan, and H R Tizhoosh. Generalization
TCGA diagnostic slides. Med. Image Anal., 70:102032, of vision pre-trained models for histopathology. Sci. Rep.,
2021. 4 13(1):6065, 2023. 1, 3, 4, 7, 8
[65] Arnaud D Roth, Sabine Tejpar, Mauro Delorenzi, Pu Yan, [75] T C Smyrk, P Watson, K Kaul, and H T Lynch. Tumor-
Roberto Fiocca, Dirk Klingbiel, Daniel Dietrich, Bart Bies- infiltrating lymphocytes are a marker for microsatellite in-
mans, György Bodoky, Carlo Barone, Enrique Aranda, stability in colorectal carcinoma. Cancer, 91(12):2417–
Bernard Nordlinger, Laura Cisar, Roberto Labianca, David 2422, 2001. 1
Cunningham, Eric Van Cutsem, and Fred Bosman. Prog- [76] T Sørlie, C M Perou, R Tibshirani, T Aas, S Geisler, H
nostic role of KRAS and BRAF in stage II and III re- Johnsen, T Hastie, M B Eisen, M van de Rijn, S S Jeffrey,
sected colon cancer: results of the translational study on T Thorsen, H Quist, J C Matese, P O Brown, D Botstein,
the PETACC-3, EORTC 40993, SAKK 60-00 trial. J. Clin. P E Lønning, and A L Børresen-Dale. Gene expression
Oncol., 28(3):466–474, 2010. 1 patterns of breast carcinomas distinguish tumor subclasses
12
with clinical implications. Proc. Natl. Acad. Sci. U. S. A., Assisted Intervention – MICCAI 2021, pages 257–266.
98(19):10869–10874, 2001. 1 Springer International Publishing, 2021. 3
[77] Maximilian Springenberg, Annika Frommholz, Markus [88] Sophia J Wagner, Daniel Reisenbüchler, Nicholas P West,
Wenzel, Eva Weicken, Jackie Ma, and Nils Strodthoff. Jan Moritz Niehues, Jiefu Zhu, Sebastian Foersch, Gre-
From modern CNNs to vision transformers: Assessing the gory Patrick Veldhuizen, Philip Quirke, Heike I Grab-
performance, robustness, and classification strategies of sch, Piet A van den Brandt, Gordon G A Hutchins, Su-
deep learning models in histopathology. Med. Image Anal., san D Richman, Tanwei Yuan, Rupert Langer, Josien C A
87:102809, 2023. 4 Jenniskens, Kelly Offermans, Wolfram Mueller, Richard
[78] Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Gray, Stephen B Gruber, Joel K Greenson, Gad Rennert,
Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How Joseph D Bonner, Daniel Schmolze, Jitendra Jonnagad-
to train your ViT? data, augmentation, and regularization in dala, Nicholas J Hawkins, Robyn L Ward, Dion Morton,
vision transformers. 2021. 2, 3 Matthew Seymour, Laura Magill, Marta Nowak, Jennifer
[79] David Tellez, Maschenka Balkenhol, Nico Karssemeijer, Hay, Viktor H Koelzer, David N Church, TransSCOT con-
Geert Litjens, Jeroen van der Laak, and Francesco Ciompi. sortium, Christian Matek, Carol Geppert, Chaolong Peng,
H and E stain augmentation improves generalization of con- Cheng Zhi, Xiaoming Ouyang, Jacqueline A James, Mau-
volutional networks for histopathological mitosis detection. rice B Loughrey, Manuel Salto-Tellez, Hermann Bren-
In Medical Imaging 2018: Digital Pathology, pages 264– ner, Michael Hoffmeister, Daniel Truhn, Julia A Schn-
270. SPIE, 2018. 4 abel, Melanie Boxberg, Tingying Peng, and Jakob Nikolas
[80] David Tellez, Geert Litjens, Péter Bándi, Wouter Bul- Kather. Transformer-based biomarker prediction from col-
ten, John-Melle Bokhorst, Francesco Ciompi, and Jeroen orectal cancer histology: A large-scale multicentric study.
van der Laak. Quantifying the effects of data augmentation Cancer Cell, 2023. 1, 6, 3, 4, 5
and stain color normalization in convolutional neural net- [89] Xinggang Wang, Yongluan Yan, Peng Tang, Xiang Bai, and
works for computational pathology. Med. Image Anal., 58: Wenyu Liu. Revisiting multiple instance neural networks.
101544, 2019. 4 Pattern Recognit., 74:15–24, 2018. 3
[81] United States Food and Drug Administration. FDA grants [90] Xiyue Wang, Sen Yang, Jun Zhang, Minghui Wang,
accelerated approval to pembrolizumab for first tissue/site Jing Zhang, Junzhou Huang, Wei Yang, and Xiao Han.
agnostic indication. 2017. 1 TransPath: Transformer-Based self-supervised learning for
[82] Abhishek Vahadane, Tingying Peng, Amit Sethi, Shadi Al- histopathological image classification. In Medical Image
barqouni, Lichao Wang, Maximilian Baust, Katja Steiger, Computing and Computer Assisted Intervention – MICCAI
Anna Melissa Schlitter, Irene Esposito, and Nassir Navab. 2021, pages 186–195. Springer International Publishing,
Structure-Preserving color normalization and sparse stain 2021. 1, 3, 2
separation for histological images. IEEE Trans. Med. Imag- [91] Xiyue Wang, Sen Yang, Jun Zhang, Minghui Wang,
ing, 35(8):1962–1971, 2016. 2, 3 Jing Zhang, Wei Yang, Junzhou Huang, and Xiao Han.
[83] Laurens Van der Maaten and Geoffrey Hinton. Visualizing Transformer-based unsupervised contrastive learning for
data using t-SNE. J. Mach. Learn. Res., 9(11), 2008. 4, 3 histopathological image classification. Med. Image Anal.,
[84] Suhas Vasaikar, Chen Huang, Xiaojing Wang, Vladislav A 81:102559, 2022. 3, 4, 7, 8, 2
Petyuk, Sara R Savage, Bo Wen, Yongchao Dou, Yun [92] Xiyue Wang, Yuexi Du, Sen Yang, Jun Zhang, Minghui
Zhang, Zhiao Shi, Osama A Arshad, et al. Proteogenomic Wang, Jing Zhang, Wei Yang, Junzhou Huang, and Xiao
analysis of human colon cancer reveals new therapeutic op- Han. RetCCL: Clustering-guided contrastive learning for
portunities. Cell, 177(4):1035–1049, 2019. 6, 1 whole-slide image retrieval. Med. Image Anal., 83:102645,
[85] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob 2023. 1, 3, 4, 7, 2
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, [93] John N Weinstein, Eric A Collisson, Gordon B Mills,
and Illia Polosukhin. Attention is all you need. In Ad- Kenna R Shaw, Brad A Ozenberger, Kyle Ellrott, Ilya
vances in Neural Information Processing Systems. Curran Shmulevich, Chris Sander, and Joshua M Stuart. The cancer
Associates, Inc., 2017. 6, 5 genome atlas pan-cancer analysis project. Nature genetics,
[86] Eugene Vorontsov, Alican Bozkurt, Adam Casson, George 45(10):1113–1120, 2013. 3, 6, 1, 2
Shaikovski, Michal Zelechowski, Siqi Liu, Philippe Math- [94] Georg Wölflein, Lucie Charlotte Magister, Pietro Liò,
ieu, Alexander van Eck, Donghun Lee, Julian Viret, Eric David J Harrison, and Ognjen Arandjelović. Deep multi-
Robert, Yi Kan Wang, Jeremy D Kunz, Matthew C H Lee, ple instance learning with Distance-Aware Self-Attention.
Jan Bernhard, Ran A Godrich, Gerard Oakley, Ewan Mil- 2023. 1, 6
lar, Matthew Hanna, Juan Retamero, William A Moye, [95] Jinxi Xiang, Xiyue Wang, Xinran Wang, Jun Zhang, Sen
Razik Yousfi, Christopher Kanan, David Klimstra, Brandon Yang, Wei Yang, Xiao Han, and Yueping Liu. Automatic
Rothrock, and Thomas J Fuchs. Virchow: A Million-Slide diagnosis and grading of prostate cancer with weakly super-
digital pathology foundation model. 2023. 1, 3, 9, 2 vised learning on whole slide images. Comput. Biol. Med.,
[87] Sophia J Wagner, Nadieh Khalili, Raghav Sharma, Melanie 152:106340, 2023. 1
Boxberg, Carsten Marr, Walter de Back, and Tingying [96] Zhongyi Yang, Xiyue Wang, Jinxi Xiang, Jun Zhang, Sen
Peng. Structure-Preserving multi-domain stain color aug- Yang, Xinran Wang, Wei Yang, Zhongyu Li, Xiao Han, and
mentation using Style-Transfer with disentangled repre- Yueping Liu. The devil is in the details: a small-lesion sen-
sentations. In Medical Image Computing and Computer sitive weakly supervised learning framework for prostate
13
cancer detection and grading. Virchows Arch., 482(3):525–
538, 2023. 1
[97] Farhad Ghazvinian Zanjani, Svitlana Zinger, Babak Eht-
eshami Bejnordi, Jeroen A W M van der Laak, and Pe-
ter H N de With. Stain normalization of histopathology
images using generative adversarial networks. In 2018
IEEE 15th International Symposium on Biomedical Imag-
ing (ISBI 2018), pages 573–577. IEEE, 2018. 3
[98] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and
Stephane Deny. Barlow twins: Self-Supervised learning
via redundancy reduction. In Proceedings of the 38th Inter-
national Conference on Machine Learning, pages 12310–
12320. PMLR, 2021. 3, 2
[99] Yunlong Zhang, Yuxuan Sun, Honglin Li, Sunyi Zheng,
Chenglu Zhu, and Lin Yang. Benchmarking the robust-
ness of deep neural networks to common corruptions in
digital pathology. In Medical Image Computing and Com-
puter Assisted Intervention – MICCAI 2022, pages 242–
252. Springer Nature Switzerland, 2022. 4
[100] Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Ci-
hang Xie, Alan Yuille, and Tao Kong. Image BERT pre-
training with online tokenizer. In International Conference
on Learning Representations, 2022. 3
14
A Good Feature Extractor Is All You Need
for Weakly Supervised Learning in Histopathology
Supplementary Material
A. Downstream tasks and their clinical rele- Target Training and validation Test dataset
-Subtype
vance -CDH1 mutation TCGA-BRCA [93] CPTAC-BRCA [50]
-TP53 mutation (833 train, 208 val samples) (120 samples)
Targets We extensively evaluated the models on nine -PIK3CA mutation
downstream tasks, summarised in Tab. 2. All of the tar- -LN status
CAMELYON17 [4]
(centre-wise cross-validation;
CAMELYON17 [4]
(centre-wise cross-validation;
gets were treated as binary variables, except for breast can- 320 train, 80 val samples) 100 samples)
cer subtype, which is a five-way classification target, deter- -MSI status
-RAS mutation TCGA-CRC [93] CPTAC-COAD [84]
mined by immunohistochemistry: Luminal A (HR+/HER2- -RAF mutation (558 samples) (110 samples)
/low Ki-67), Luminal B (HR+/HER2+/high Ki-67), HER2 -MAD4 mutation
overexpressed (HR-), Basal (which is a subgroup of triple- Table 2. Overview of the evaluated downstream tasks. Dataset
negative breast cancer), or Normal breast-like (a subtype sizes are shown in parentheses. The targets are related to breast
for which the clinical and molecular characteristics remain cancer, while the targets are related to colorectal cancer.
largely undefined throughout the existing scientific litera-
ture) [76]. This molecular subtyping of early-stage invasive
breast cancer has become an essential procedure in clini- 80%/20% split of the other four centres. We treat -LN sta-
cal management due to its implications in treatment rec- tus as a binary classification task, where the positive class
ommendations and providing valuable prognostic insights corresponds to the presence of metastatic cancer cells in
for a patient’s survival [9, 34]. In addition, our investiga- the lymph nodes. Each slide in the dataset is of a lymph
tion also included analysis for prevalent mutations in CDH1 node tissue section, and we treat each slide as a single
and TP53 as well as PIK3CA, the latter of which opens sample, i.e. a separate patient. This is slightly different to
new possibilities for targeted therapies in advanced disease the original CAMELYON17 challenge [4], where groups
stages [1]. Microsatellite instability (MSI) status is a key of five slides were arranged into “virtual patients” (though
marker in colorectal cancer owing to its profound implica- the slides themselves may be from different actual patients),
tions in shaping a patient’s prognosis and responsiveness and the task was to predict a virtual patient-level label based
to immunotherapies [13, 81]. It is driven by either spon- on a specific rule for aggregating the slide-level predictions.
taneous or germline (hence hereditary) mutations in DNA- We do not use the virtual patient labels, but instead use the
repair related genes [6] and leads to phenotypic changes in slide-level labels provided in the dataset.
the tumour tissue [75]. Therefore, the performance of vari- For all other targets, we use either TCGA-BRCA [93]
ous AI models is commonly evaluated based on their ability or TCGA-CRC [93] for training, and respectively either
to predict MSI from routine histopathology [47], often per- CPTAC-BRCA [50] or CPTAC-COAD [84] for testing. We
formed in conjunction with other prevalent genetic markers obtain the patient-level labels from the respective stud-
such as KRAS and BRAF: these are key-driver mutations ies via cbioportal.org. The only exception is -
in colorectal cancer, that shape a patient’s survival chances MSI status which is not available for TCGA-CRC in
and hold strong influence over the selection of targeted ther- cbioportal.org, but is provided in the supplementary
apies best suited for each individual patient [28, 65]. Given material of Liu et al. [51].
the high clinical relevance and availability of robust ground
truth data, we have strategically selected these particular B. Feature extractors
tasks for our analysis.
In Tab. 3, we provide an overview of the SSL feature ex-
tractors evaluated in this study. We use the weights from
Data Here, we provide additional details about where we the respective authors’ GitHub repositories. The feature ex-
obtained data for the downstream tasks, further to what is tractor called Lunit-DINO in our paper corresponds to Kang
mentioned in Sec. 4. et al.’s DINOp=16 model [44].
We predict the -LN status using the CAMELYON17
B.1. Foundation models
dataset [4], which contains data from five centres. For this
dataset, we perform centre-wise cross-validation, where we This year, a number of foundation models have emerged
use one of the centres for testing and the others for train- for pathology that were trained on datasets of unprece-
ing (each of the five random seeds uses a different cen- dented size. Unfortunately, we could not include these
tre for testing). The training and validation sets are an in our study since their weights remain proprietary. No-
1
Name Architecture SSL method SSL dataset, magnification Embedding size (dx )
CTransPath [90, 91] Swin Transformer [52] semantically-relevant con- TCGA [93] and PAIP [48] 768
trastive learning [91] based (20×)
on MoCo v3 [20]
Phikon [31] ViT-B [49] semantically-relevant con- TCGA [93] (20×) 768
trastive learning [91] based
on MoCo v3 [20]
Lunit-DINO [44] ViT-S [49] DINO [12] TCGA [93] and non-public 384
TULIP [44] (20×, 40×)
RetCCL [92] ResNet-50 [37] clustering-guided con- TCGA [93] and PAIP [48] 2048
trastive learning [92] based (20×)
on MoCo [38]
Lunit-BT [44] ResNet-50 [37] Barlow Twins [98] TCGA [93] and non-public 2048
TULIP [44] (20×, 40×)
Lunit-SwAV [44] ResNet-50 [37] SwAV [11] TCGA [93] and non-public 2048
TULIP [44] (20×, 40×)
Table 3. Overview of SSL feature extractors evaluated in this study, their architecture, SSL method, pretraining dataset, and embedding
size. As baselines, we additionally compare against the respective ImageNet pretrained backbones: Swin Transformer [52], Vit-B [49],
ViT-S [49, 78] and ResNet-50 [37].
2
Swin CTransPath ViT-B Phikon
Figure 8. Latent space visualisations (t-SNE [83]), showing the effect of stain normalisation [58]. This figure extends Fig. 3, which depicts
only two feature extractors, Lunit-DINO [44] and its ViT-S [49, 78] ImageNet baseline; here, we show the other eight. Colours are as in
Fig. 3.
L
h
AV
case we normalise the test set in the same way as the train-
iko
C
at
-5
t-B
T-
T-
IN
Sw
w
C
sP
et
Vi
Vi
t-D
Ph
ni
t-S
et
N
an
Lu
R
ing set).
es
ni
ni
Tr
Lu
Lu
R
C
3
ive
zo 1 x n
a s
m itte on
om .5 io
w g s
gh tr s
ct
lo tur n
n ur
lo bri nes
ta ic l
hi con tne
gh ur t
ro ert nta
zo 1 tat
co sa tio
C w pe
hi sat ras
ga ur j ati
lo co st
ur
af 2 5x
ia bl
fli ma .5
fli ori .0
ra te 2 °
zo om °
ga ma r
ro te 9 l
om ro
a
ta 80
nd 70
h
sa rs
ta 0°
bl
p zo
t
m 0
h 2
om .7
w nt
w o
ed n
gh h
fin x
u n
jig pe
lo enk
m ssia
hi br ig
ro te 1
sh ix
M inal
ga pe
Au ut
gM
wa e
rp
o
v
ar
ac
ig
ut
p
or
STR
ADI
MUS
NORM
TUM
BACK
LYM
DEB
MUC
Figure 10. Examples of original and augmented patches (columns) from the NCT-CRC-HE-100K dataset [45, 46]. Each row corresponds
to a representative patch from a different patch class.
AUROC deterioration vs. best AUROC deterioration vs. best AUROC deterioration vs. best
Change in test AUROC
0.2 0.20
0.15
0.0
0.10
−0.2
0.05
−0.4 0.00
(a) stain normalisation (b) rotate/flip (c) all augmentations (d) no augmentation
Effect of augmentation on downstream performance (Transformer) Relative performance comparison Feature extractor
Swin
Change in test AUROC
0.2 0.20
0.15
0.0
0.10
−0.2
0.05
−0.4 0.00
(a) stain normalisation (b) rotate/flip (c) all augmentations (d) no augmentation
Figure 11. Extended version of Fig. 1 showing the main results for all three downstream models: AttMIL [42] (top, same as Fig. 1), a
two-layer transformer [88] (middle), and mean average pooling (bottom).
results, independent of the choice of downstream aggrega- we also provide the seed-averaged absolute test AUROC
tion model and augmentation group. scores for all tasks, feature extractors, and downstream
models, and augmentation groups in Tabs. 4 to 8 (one table
Absolute AUROC scores While the normalised differen- per augmentation group). Looking at these absolute scores,
tial AUROC score provides a relative performance measure we find that the predicting the -PIK3CA target is the most
to facilitate a fair comparison between feature extractors, difficult task across the board for all feature extractors and
4
downstream models, while the -LN status and -MSI sta- dings, i.e.
n
tus targets are the easiest. However, we emphasise that the 1X
normalised differential AUROC score is the more meaning- g¯θ ({xi }ni=1 ) = xi . (4)
n i=1
ful metric for comparing feature extractors, since it is inde-
pendent of the task difficulty and accounts for the variance
across seeds (see Sec. 4.1). AttMIL [42] This model takes a weighted average of the
patch embeddings, where the weights are computed inde-
F. Training and implementation details pendently for each embedding. More formally, the slide-
level embedding is given by
Training For downstream model training, we use the
n
AdamW [55] optimiser with an initial learning rate of 10−3 , X
weight decay of 10−2 , and a batch size of one. The learn- g¯θ ({xi }ni=1 ) = αi xi , (5)
i=1
ing rate is decayed using a cosine annealing learning sched-
ule [54] over 30 epochs, but we halt training when the vali- where the attention6 weights αi ∈ R are obtained via a two-
dation loss does not improve for ten epochs. layer network with 256 tanh-activated hidden units that is
In MIL terminology, we refer to the patient as the bag, applied to each patch embedding xi independently and then
and the patches as the instances. Note that some datasets normalised across all patches using a softmax function, i.e.
have multiple WSIs per patient; in these cases, we sim-
ply mix the patches from all WSIs into a single bag. An ei = W2 tanh(W1 xi + b1 ) + b2 , (6)
epoch represents a full pass over all patients in the training exp(ei )
set. At every step, we sample a maximum of 8,192 patches αi = Pn . (7)
per patient, though most patients have fewer patches. We j=1 exp(ej )
5
patches), all augmentations ai , and all feature extractors
f . During training, we only need to load the features from
disk (dx floating point values per patch, e.g. in the case of
CTransPath dx = 768), as opposed to loading the patches
directly (224 × 224 × 3 byte values) and having to perform
augmentation and feature extraction on the fly (very expen-
sive).
6
Target -Subtype -CDH1 -TP53 -PIK3CA -LN status -MSI -KRAS -BRAF -SMAD4 Average
Model Feature extractor
AttMIL Swin 0.07 ± 0.02 0.17 ± 0.03 0.28 ± 0.02 0.07 ± 0.04 0.17 ± 0.08 0.18 ± 0.04 0.14 ± 0.04 0.14 ± 0.07 0.16 ± 0.05 0.15 ± 0.05
CTransPath 0.00 ± 0.00 0.01 ± 0.01 0.01 ± 0.01 0.04 ± 0.03 0.06 ± 0.07 0.08 ± 0.03 0.06 ± 0.03 0.06 ± 0.03 0.06 ± 0.03 0.04 ± 0.03
ViT-B 0.08 ± 0.04 0.11 ± 0.02 0.15 ± 0.03 0.07 ± 0.03 0.17 ± 0.06 0.15 ± 0.03 0.03 ± 0.04 0.18 ± 0.07 0.01 ± 0.01 0.11 ± 0.04
Phikon 0.09 ± 0.02 0.09 ± 0.02 0.09 ± 0.03 0.09 ± 0.03 0.07 ± 0.06 0.06 ± 0.04 0.07 ± 0.04 0.07 ± 0.06 0.17 ± 0.08 0.09 ± 0.05
ViT-S 0.13 ± 0.03 0.08 ± 0.03 0.14 ± 0.05 0.08 ± 0.05 0.19 ± 0.09 0.18 ± 0.04 0.06 ± 0.03 0.19 ± 0.04 0.08 ± 0.08 0.13 ± 0.05
Lunit-DINO 0.08 ± 0.03 0.03 ± 0.03 0.03 ± 0.02 0.02 ± 0.03 0.07 ± 0.04 0.00 ± 0.00 0.06 ± 0.04 0.02 ± 0.02 0.02 ± 0.02 0.04 ± 0.03
ResNet-50 0.15 ± 0.03 0.09 ± 0.04 0.11 ± 0.03 0.01 ± 0.02 0.18 ± 0.08 0.22 ± 0.04 0.11 ± 0.03 0.23 ± 0.07 0.21 ± 0.09 0.15 ± 0.05
RetCCL 0.07 ± 0.03 0.04 ± 0.02 0.04 ± 0.03 0.05 ± 0.03 0.07 ± 0.06 0.08 ± 0.03 0.03 ± 0.02 0.14 ± 0.03 0.06 ± 0.03 0.06 ± 0.03
Lunit-BT 0.13 ± 0.03 0.06 ± 0.04 0.02 ± 0.01 0.13 ± 0.04 0.34 ± 0.15 0.28 ± 0.13 0.03 ± 0.04 0.35 ± 0.13 0.25 ± 0.03 0.18 ± 0.08
Lunit-SwAV 0.06 ± 0.02 0.06 ± 0.03 0.06 ± 0.02 0.13 ± 0.06 0.07 ± 0.05 0.10 ± 0.03 0.13 ± 0.06 0.07 ± 0.07 0.14 ± 0.08 0.09 ± 0.05
Transformer Swin 0.09 ± 0.04 0.11 ± 0.03 0.21 ± 0.04 0.09 ± 0.03 0.16 ± 0.08 0.19 ± 0.07 0.09 ± 0.04 0.17 ± 0.05 0.14 ± 0.05 0.14 ± 0.05
CTransPath 0.01 ± 0.02 0.01 ± 0.02 0.03 ± 0.03 0.08 ± 0.07 0.07 ± 0.07 0.02 ± 0.02 0.04 ± 0.04 0.08 ± 0.06 0.09 ± 0.05 0.05 ± 0.05
ViT-B 0.08 ± 0.03 0.10 ± 0.02 0.17 ± 0.04 0.11 ± 0.02 0.21 ± 0.07 0.18 ± 0.05 0.13 ± 0.05 0.20 ± 0.08 0.06 ± 0.05 0.14 ± 0.05
Phikon 0.13 ± 0.04 0.08 ± 0.05 0.08 ± 0.03 0.05 ± 0.03 0.07 ± 0.05 0.05 ± 0.04 0.05 ± 0.04 0.11 ± 0.07 0.12 ± 0.06 0.08 ± 0.05
ViT-S 0.10 ± 0.02 0.07 ± 0.03 0.22 ± 0.07 0.11 ± 0.06 0.21 ± 0.09 0.16 ± 0.06 0.08 ± 0.04 0.23 ± 0.09 0.03 ± 0.02 0.13 ± 0.06
Lunit-DINO 0.04 ± 0.03 0.06 ± 0.03 0.03 ± 0.02 0.02 ± 0.02 0.05 ± 0.04 0.01 ± 0.01 0.06 ± 0.03 0.02 ± 0.04 0.02 ± 0.03 0.03 ± 0.03
ResNet-50 0.13 ± 0.04 0.10 ± 0.07 0.15 ± 0.03 0.04 ± 0.07 0.19 ± 0.08 0.19 ± 0.07 0.11 ± 0.04 0.19 ± 0.06 0.30 ± 0.11 0.16 ± 0.07
RetCCL 0.09 ± 0.04 0.04 ± 0.04 0.02 ± 0.02 0.09 ± 0.06 0.07 ± 0.06 0.15 ± 0.03 0.12 ± 0.05 0.22 ± 0.11 0.06 ± 0.04 0.10 ± 0.06
Lunit-BT 0.04 ± 0.03 0.05 ± 0.03 0.02 ± 0.02 0.10 ± 0.04 0.07 ± 0.07 0.02 ± 0.02 0.02 ± 0.02 0.13 ± 0.05 0.07 ± 0.02 0.06 ± 0.04
Lunit-SwAV 0.08 ± 0.04 0.04 ± 0.05 0.05 ± 0.03 0.11 ± 0.05 0.07 ± 0.06 0.06 ± 0.03 0.08 ± 0.03 0.07 ± 0.05 0.17 ± 0.07 0.08 ± 0.05
Mean pool Swin 0.08 ± 0.01 0.10 ± 0.04 0.13 ± 0.05 0.05 ± 0.02 0.17 ± 0.12 0.17 ± 0.02 0.02 ± 0.02 0.13 ± 0.03 0.10 ± 0.02 0.11 ± 0.05
CTransPath 0.00 ± 0.00 0.04 ± 0.02 0.02 ± 0.02 0.00 ± 0.01 0.15 ± 0.11 0.03 ± 0.02 0.11 ± 0.05 0.06 ± 0.03 0.09 ± 0.02 0.06 ± 0.05
ViT-B 0.07 ± 0.01 0.08 ± 0.01 0.07 ± 0.02 0.09 ± 0.02 0.15 ± 0.11 0.15 ± 0.02 0.07 ± 0.04 0.18 ± 0.04 0.02 ± 0.02 0.10 ± 0.04
Phikon 0.11 ± 0.02 0.04 ± 0.03 0.13 ± 0.03 0.06 ± 0.03 0.11 ± 0.11 0.07 ± 0.03 0.12 ± 0.03 0.09 ± 0.07 0.11 ± 0.05 0.09 ± 0.05
ViT-S 0.11 ± 0.01 0.03 ± 0.03 0.13 ± 0.02 0.07 ± 0.03 0.15 ± 0.11 0.19 ± 0.03 0.03 ± 0.02 0.21 ± 0.04 0.07 ± 0.03 0.11 ± 0.04
Lunit-DINO 0.08 ± 0.01 0.04 ± 0.02 0.01 ± 0.02 0.05 ± 0.03 0.09 ± 0.09 0.00 ± 0.00 0.09 ± 0.02 0.00 ± 0.00 0.01 ± 0.02 0.04 ± 0.04
ResNet-50 0.08 ± 0.01 0.00 ± 0.01 0.09 ± 0.02 0.03 ± 0.02 0.21 ± 0.09 0.22 ± 0.03 0.03 ± 0.04 0.24 ± 0.02 0.13 ± 0.05 0.11 ± 0.04
RetCCL 0.01 ± 0.00 0.03 ± 0.01 0.06 ± 0.02 0.06 ± 0.02 0.15 ± 0.11 0.10 ± 0.04 0.03 ± 0.03 0.15 ± 0.01 0.06 ± 0.02 0.07 ± 0.04
Lunit-BT 0.06 ± 0.03 0.04 ± 0.01 0.06 ± 0.04 0.07 ± 0.02 0.18 ± 0.11 0.08 ± 0.02 0.03 ± 0.03 0.21 ± 0.09 0.03 ± 0.02 0.08 ± 0.05
Lunit-SwAV 0.07 ± 0.00 0.03 ± 0.02 0.04 ± 0.02 0.11 ± 0.02 0.13 ± 0.13 0.05 ± 0.02 0.13 ± 0.03 0.03 ± 0.01 0.13 ± 0.04 0.08 ± 0.05
Table 4. Normalised differential AUROC scores for all tasks, feature extractors, downstream models, when employing no augmentations.
Target -Subtype -CDH1 -TP53 -PIK3CA -LN status -MSI -KRAS -BRAF -SMAD4 Average
Model Feature extractor
AttMIL Swin 0.08 ± 0.04 0.23 ± 0.03 0.27 ± 0.03 0.07 ± 0.05 0.18 ± 0.08 0.19 ± 0.05 0.15 ± 0.02 0.11 ± 0.07 0.16 ± 0.05 0.16 ± 0.05
CTransPath 0.00 ± 0.00 0.04 ± 0.04 0.04 ± 0.03 0.03 ± 0.02 0.06 ± 0.08 0.08 ± 0.03 0.08 ± 0.04 0.07 ± 0.06 0.08 ± 0.03 0.05 ± 0.04
ViT-B 0.08 ± 0.04 0.16 ± 0.03 0.12 ± 0.02 0.04 ± 0.03 0.16 ± 0.07 0.15 ± 0.04 0.10 ± 0.04 0.10 ± 0.05 0.01 ± 0.02 0.10 ± 0.04
Phikon 0.09 ± 0.04 0.08 ± 0.04 0.08 ± 0.04 0.10 ± 0.03 0.09 ± 0.08 0.06 ± 0.03 0.13 ± 0.04 0.12 ± 0.05 0.07 ± 0.05 0.09 ± 0.05
ViT-S 0.12 ± 0.05 0.13 ± 0.04 0.11 ± 0.04 0.04 ± 0.03 0.17 ± 0.09 0.20 ± 0.06 0.09 ± 0.04 0.15 ± 0.05 0.10 ± 0.11 0.12 ± 0.06
Lunit-DINO 0.04 ± 0.04 0.03 ± 0.02 0.04 ± 0.03 0.02 ± 0.02 0.06 ± 0.07 0.00 ± 0.01 0.05 ± 0.04 0.01 ± 0.02 0.09 ± 0.06 0.04 ± 0.04
ResNet-50 0.13 ± 0.04 0.20 ± 0.04 0.16 ± 0.03 0.03 ± 0.02 0.16 ± 0.08 0.22 ± 0.04 0.17 ± 0.05 0.15 ± 0.05 0.11 ± 0.07 0.15 ± 0.05
RetCCL 0.07 ± 0.04 0.02 ± 0.02 0.03 ± 0.03 0.05 ± 0.03 0.09 ± 0.06 0.06 ± 0.03 0.01 ± 0.02 0.13 ± 0.05 0.08 ± 0.02 0.06 ± 0.04
Lunit-BT 0.13 ± 0.06 0.04 ± 0.03 0.06 ± 0.08 0.12 ± 0.03 0.27 ± 0.17 0.17 ± 0.15 0.06 ± 0.06 0.34 ± 0.07 0.23 ± 0.07 0.16 ± 0.09
Lunit-SwAV 0.07 ± 0.04 0.02 ± 0.02 0.02 ± 0.02 0.05 ± 0.04 0.07 ± 0.06 0.13 ± 0.04 0.12 ± 0.04 0.05 ± 0.05 0.13 ± 0.06 0.07 ± 0.04
Transformer Swin 0.09 ± 0.03 0.15 ± 0.04 0.20 ± 0.03 0.04 ± 0.03 0.17 ± 0.09 0.21 ± 0.09 0.12 ± 0.06 0.21 ± 0.05 0.16 ± 0.08 0.15 ± 0.06
CTransPath 0.04 ± 0.03 0.01 ± 0.02 0.05 ± 0.05 0.08 ± 0.04 0.05 ± 0.07 0.02 ± 0.02 0.06 ± 0.03 0.03 ± 0.03 0.17 ± 0.09 0.06 ± 0.05
ViT-B 0.12 ± 0.04 0.14 ± 0.03 0.17 ± 0.03 0.02 ± 0.02 0.20 ± 0.08 0.22 ± 0.06 0.11 ± 0.04 0.23 ± 0.11 0.04 ± 0.03 0.14 ± 0.06
Phikon 0.11 ± 0.02 0.08 ± 0.02 0.09 ± 0.04 0.03 ± 0.02 0.09 ± 0.08 0.04 ± 0.03 0.06 ± 0.06 0.09 ± 0.08 0.03 ± 0.03 0.07 ± 0.05
ViT-S 0.09 ± 0.02 0.15 ± 0.04 0.15 ± 0.05 0.05 ± 0.03 0.15 ± 0.09 0.22 ± 0.08 0.10 ± 0.03 0.15 ± 0.04 0.04 ± 0.03 0.12 ± 0.05
Lunit-DINO 0.02 ± 0.03 0.06 ± 0.04 0.02 ± 0.03 0.02 ± 0.02 0.06 ± 0.05 0.01 ± 0.02 0.10 ± 0.05 0.04 ± 0.05 0.07 ± 0.07 0.04 ± 0.04
ResNet-50 0.15 ± 0.03 0.20 ± 0.07 0.16 ± 0.04 0.03 ± 0.02 0.22 ± 0.07 0.21 ± 0.04 0.13 ± 0.03 0.13 ± 0.07 0.20 ± 0.13 0.16 ± 0.06
RetCCL 0.07 ± 0.05 0.06 ± 0.03 0.03 ± 0.02 0.06 ± 0.04 0.10 ± 0.04 0.09 ± 0.03 0.11 ± 0.04 0.21 ± 0.09 0.08 ± 0.04 0.09 ± 0.05
Lunit-BT 0.03 ± 0.02 0.03 ± 0.02 0.02 ± 0.03 0.05 ± 0.02 0.06 ± 0.06 0.04 ± 0.03 0.02 ± 0.02 0.15 ± 0.07 0.05 ± 0.03 0.05 ± 0.04
Lunit-SwAV 0.07 ± 0.03 0.02 ± 0.02 0.04 ± 0.04 0.06 ± 0.05 0.08 ± 0.09 0.13 ± 0.06 0.14 ± 0.05 0.15 ± 0.10 0.18 ± 0.08 0.10 ± 0.06
Mean pool Swin 0.07 ± 0.01 0.14 ± 0.02 0.15 ± 0.04 0.03 ± 0.01 0.20 ± 0.09 0.18 ± 0.03 0.05 ± 0.05 0.13 ± 0.06 0.08 ± 0.03 0.11 ± 0.05
CTransPath 0.00 ± 0.00 0.02 ± 0.01 0.03 ± 0.03 0.00 ± 0.00 0.14 ± 0.10 0.03 ± 0.02 0.07 ± 0.05 0.06 ± 0.03 0.08 ± 0.02 0.05 ± 0.04
ViT-B 0.05 ± 0.01 0.08 ± 0.01 0.08 ± 0.03 0.04 ± 0.01 0.14 ± 0.11 0.16 ± 0.02 0.08 ± 0.03 0.13 ± 0.07 0.00 ± 0.01 0.08 ± 0.05
Phikon 0.13 ± 0.01 0.03 ± 0.02 0.11 ± 0.04 0.06 ± 0.02 0.12 ± 0.11 0.01 ± 0.01 0.02 ± 0.02 0.11 ± 0.05 0.07 ± 0.03 0.07 ± 0.05
ViT-S 0.08 ± 0.01 0.09 ± 0.02 0.12 ± 0.04 0.05 ± 0.03 0.18 ± 0.11 0.17 ± 0.06 0.02 ± 0.02 0.21 ± 0.04 0.03 ± 0.03 0.11 ± 0.05
Lunit-DINO 0.06 ± 0.01 0.02 ± 0.02 0.05 ± 0.04 0.05 ± 0.02 0.07 ± 0.08 0.02 ± 0.01 0.05 ± 0.03 0.03 ± 0.04 0.05 ± 0.02 0.04 ± 0.04
ResNet-50 0.08 ± 0.01 0.11 ± 0.04 0.15 ± 0.03 0.03 ± 0.01 0.21 ± 0.10 0.18 ± 0.04 0.07 ± 0.04 0.15 ± 0.03 0.06 ± 0.06 0.12 ± 0.05
RetCCL 0.02 ± 0.00 0.01 ± 0.01 0.07 ± 0.04 0.05 ± 0.01 0.12 ± 0.10 0.05 ± 0.02 0.02 ± 0.02 0.13 ± 0.03 0.05 ± 0.01 0.06 ± 0.04
Lunit-BT 0.09 ± 0.03 0.01 ± 0.00 0.04 ± 0.03 0.07 ± 0.01 0.21 ± 0.10 0.16 ± 0.04 0.06 ± 0.05 0.20 ± 0.08 0.02 ± 0.01 0.10 ± 0.05
Lunit-SwAV 0.08 ± 0.01 0.01 ± 0.01 0.02 ± 0.03 0.15 ± 0.02 0.13 ± 0.11 0.15 ± 0.01 0.17 ± 0.02 0.01 ± 0.02 0.13 ± 0.03 0.10 ± 0.04
Table 5. Normalised differential AUROC scores for all tasks, feature extractors, downstream models, when employing slidewise stain
normalisation [58].
7
Target -Subtype -CDH1 -TP53 -PIK3CA -LN status -MSI -KRAS -BRAF -SMAD4 Average
Model Feature extractor
AttMIL Swin 0.08 ± 0.03 0.20 ± 0.04 0.24 ± 0.03 0.05 ± 0.03 0.16 ± 0.07 0.14 ± 0.03 0.11 ± 0.04 0.11 ± 0.07 0.20 ± 0.03 0.14 ± 0.04
CTransPath 0.00 ± 0.00 0.02 ± 0.03 0.03 ± 0.02 0.04 ± 0.01 0.04 ± 0.05 0.07 ± 0.06 0.07 ± 0.03 0.06 ± 0.04 0.06 ± 0.03 0.04 ± 0.03
ViT-B 0.11 ± 0.04 0.12 ± 0.03 0.12 ± 0.02 0.04 ± 0.04 0.16 ± 0.12 0.14 ± 0.04 0.10 ± 0.04 0.13 ± 0.06 0.02 ± 0.02 0.10 ± 0.05
Phikon 0.11 ± 0.03 0.04 ± 0.01 0.09 ± 0.04 0.06 ± 0.02 0.09 ± 0.09 0.03 ± 0.03 0.05 ± 0.05 0.09 ± 0.05 0.06 ± 0.06 0.07 ± 0.05
ViT-S 0.10 ± 0.03 0.10 ± 0.03 0.10 ± 0.04 0.01 ± 0.02 0.19 ± 0.07 0.16 ± 0.04 0.06 ± 0.05 0.18 ± 0.07 0.07 ± 0.03 0.11 ± 0.05
Lunit-DINO 0.04 ± 0.03 0.01 ± 0.01 0.04 ± 0.03 0.02 ± 0.02 0.06 ± 0.06 0.01 ± 0.02 0.07 ± 0.04 0.02 ± 0.04 0.05 ± 0.03 0.04 ± 0.03
ResNet-50 0.17 ± 0.04 0.18 ± 0.04 0.17 ± 0.05 0.02 ± 0.01 0.17 ± 0.07 0.18 ± 0.03 0.13 ± 0.03 0.17 ± 0.07 0.15 ± 0.07 0.15 ± 0.05
RetCCL 0.09 ± 0.04 0.03 ± 0.02 0.03 ± 0.04 0.03 ± 0.02 0.10 ± 0.09 0.07 ± 0.03 0.02 ± 0.03 0.14 ± 0.04 0.07 ± 0.03 0.06 ± 0.04
Lunit-BT 0.11 ± 0.04 0.04 ± 0.01 0.02 ± 0.02 0.13 ± 0.02 0.25 ± 0.13 0.33 ± 0.07 0.08 ± 0.06 0.28 ± 0.10 0.15 ± 0.08 0.16 ± 0.07
Lunit-SwAV 0.05 ± 0.03 0.03 ± 0.03 0.04 ± 0.02 0.05 ± 0.03 0.08 ± 0.07 0.12 ± 0.04 0.12 ± 0.08 0.07 ± 0.05 0.11 ± 0.05 0.08 ± 0.05
Transformer Swin 0.11 ± 0.04 0.19 ± 0.05 0.20 ± 0.05 0.09 ± 0.04 0.19 ± 0.08 0.19 ± 0.04 0.15 ± 0.04 0.22 ± 0.07 0.09 ± 0.06 0.16 ± 0.05
CTransPath 0.01 ± 0.02 0.05 ± 0.04 0.02 ± 0.02 0.06 ± 0.04 0.06 ± 0.07 0.04 ± 0.04 0.08 ± 0.05 0.07 ± 0.07 0.08 ± 0.05 0.05 ± 0.05
ViT-B 0.11 ± 0.03 0.13 ± 0.03 0.18 ± 0.03 0.08 ± 0.03 0.16 ± 0.10 0.21 ± 0.07 0.13 ± 0.07 0.21 ± 0.04 0.09 ± 0.03 0.14 ± 0.05
Phikon 0.08 ± 0.03 0.09 ± 0.03 0.07 ± 0.03 0.07 ± 0.04 0.07 ± 0.06 0.04 ± 0.02 0.05 ± 0.06 0.08 ± 0.04 0.04 ± 0.03 0.06 ± 0.04
ViT-S 0.11 ± 0.03 0.11 ± 0.06 0.18 ± 0.03 0.07 ± 0.05 0.16 ± 0.09 0.16 ± 0.02 0.04 ± 0.05 0.19 ± 0.04 0.05 ± 0.06 0.12 ± 0.05
Lunit-DINO 0.04 ± 0.03 0.05 ± 0.04 0.02 ± 0.02 0.03 ± 0.03 0.04 ± 0.05 0.02 ± 0.03 0.10 ± 0.04 0.09 ± 0.07 0.06 ± 0.06 0.05 ± 0.04
ResNet-50 0.16 ± 0.05 0.18 ± 0.10 0.23 ± 0.04 0.04 ± 0.05 0.14 ± 0.08 0.21 ± 0.06 0.13 ± 0.05 0.16 ± 0.05 0.30 ± 0.11 0.17 ± 0.07
RetCCL 0.06 ± 0.03 0.06 ± 0.04 0.04 ± 0.04 0.06 ± 0.02 0.08 ± 0.06 0.08 ± 0.04 0.09 ± 0.06 0.15 ± 0.08 0.07 ± 0.04 0.08 ± 0.05
Lunit-BT 0.03 ± 0.03 0.04 ± 0.03 0.05 ± 0.03 0.07 ± 0.04 0.05 ± 0.06 0.05 ± 0.03 0.09 ± 0.06 0.15 ± 0.04 0.07 ± 0.06 0.07 ± 0.04
Lunit-SwAV 0.06 ± 0.03 0.01 ± 0.01 0.03 ± 0.02 0.08 ± 0.04 0.07 ± 0.06 0.08 ± 0.04 0.15 ± 0.05 0.07 ± 0.10 0.12 ± 0.02 0.08 ± 0.05
Mean pool Swin 0.06 ± 0.01 0.12 ± 0.02 0.11 ± 0.04 0.01 ± 0.01 0.20 ± 0.11 0.11 ± 0.03 0.04 ± 0.03 0.15 ± 0.04 0.04 ± 0.01 0.09 ± 0.04
CTransPath 0.00 ± 0.00 0.01 ± 0.01 0.02 ± 0.02 0.01 ± 0.01 0.18 ± 0.10 0.03 ± 0.03 0.09 ± 0.05 0.07 ± 0.04 0.05 ± 0.02 0.05 ± 0.04
ViT-B 0.03 ± 0.00 0.09 ± 0.01 0.07 ± 0.03 0.03 ± 0.01 0.17 ± 0.10 0.17 ± 0.04 0.10 ± 0.05 0.16 ± 0.06 0.02 ± 0.02 0.09 ± 0.05
Phikon 0.11 ± 0.01 0.01 ± 0.01 0.11 ± 0.03 0.08 ± 0.04 0.16 ± 0.15 0.02 ± 0.03 0.05 ± 0.03 0.09 ± 0.03 0.07 ± 0.06 0.08 ± 0.06
ViT-S 0.06 ± 0.01 0.05 ± 0.04 0.09 ± 0.05 0.02 ± 0.02 0.17 ± 0.12 0.17 ± 0.03 0.02 ± 0.01 0.22 ± 0.06 0.07 ± 0.04 0.10 ± 0.05
Lunit-DINO 0.05 ± 0.01 0.02 ± 0.01 0.04 ± 0.04 0.04 ± 0.02 0.11 ± 0.12 0.04 ± 0.04 0.07 ± 0.04 0.00 ± 0.00 0.03 ± 0.02 0.04 ± 0.05
ResNet-50 0.08 ± 0.00 0.11 ± 0.04 0.07 ± 0.03 0.03 ± 0.01 0.22 ± 0.11 0.15 ± 0.05 0.03 ± 0.03 0.21 ± 0.04 0.11 ± 0.10 0.11 ± 0.06
RetCCL 0.01 ± 0.00 0.02 ± 0.01 0.05 ± 0.03 0.03 ± 0.01 0.14 ± 0.10 0.04 ± 0.03 0.05 ± 0.05 0.14 ± 0.05 0.03 ± 0.01 0.06 ± 0.04
Lunit-BT 0.06 ± 0.03 0.02 ± 0.01 0.03 ± 0.03 0.05 ± 0.01 0.18 ± 0.12 0.11 ± 0.04 0.02 ± 0.03 0.18 ± 0.03 0.00 ± 0.01 0.07 ± 0.05
Lunit-SwAV 0.06 ± 0.00 0.02 ± 0.01 0.04 ± 0.02 0.12 ± 0.01 0.12 ± 0.11 0.12 ± 0.03 0.15 ± 0.02 0.04 ± 0.03 0.09 ± 0.02 0.08 ± 0.04
Table 6. Normalised differential AUROC scores for all tasks, feature extractors, downstream models, when employing patchwise stain
normalisation [58].
Target -Subtype -CDH1 -TP53 -PIK3CA -LN status -MSI -KRAS -BRAF -SMAD4 Average
Model Feature extractor
AttMIL Swin 0.05 ± 0.03 0.16 ± 0.05 0.28 ± 0.03 0.07 ± 0.02 0.14 ± 0.08 0.11 ± 0.03 0.13 ± 0.03 0.10 ± 0.04 0.20 ± 0.03 0.14 ± 0.04
CTransPath 0.00 ± 0.00 0.02 ± 0.03 0.03 ± 0.02 0.01 ± 0.01 0.05 ± 0.05 0.04 ± 0.03 0.07 ± 0.05 0.07 ± 0.02 0.06 ± 0.03 0.04 ± 0.03
ViT-B 0.07 ± 0.05 0.10 ± 0.02 0.15 ± 0.03 0.08 ± 0.03 0.16 ± 0.06 0.13 ± 0.03 0.09 ± 0.08 0.13 ± 0.04 0.01 ± 0.02 0.10 ± 0.04
Phikon 0.07 ± 0.03 0.07 ± 0.03 0.06 ± 0.06 0.11 ± 0.03 0.07 ± 0.06 0.04 ± 0.03 0.07 ± 0.04 0.09 ± 0.08 0.19 ± 0.09 0.08 ± 0.06
ViT-S 0.06 ± 0.03 0.04 ± 0.02 0.14 ± 0.04 0.06 ± 0.04 0.21 ± 0.10 0.19 ± 0.06 0.05 ± 0.04 0.19 ± 0.05 0.07 ± 0.08 0.11 ± 0.05
Lunit-DINO 0.06 ± 0.03 0.04 ± 0.03 0.02 ± 0.02 0.01 ± 0.02 0.05 ± 0.06 0.00 ± 0.00 0.06 ± 0.02 0.01 ± 0.02 0.04 ± 0.03 0.03 ± 0.03
ResNet-50 0.13 ± 0.03 0.10 ± 0.04 0.13 ± 0.04 0.03 ± 0.03 0.15 ± 0.10 0.22 ± 0.05 0.14 ± 0.05 0.22 ± 0.06 0.29 ± 0.08 0.16 ± 0.06
RetCCL 0.05 ± 0.03 0.04 ± 0.03 0.03 ± 0.03 0.04 ± 0.03 0.07 ± 0.07 0.06 ± 0.03 0.03 ± 0.04 0.16 ± 0.03 0.06 ± 0.03 0.06 ± 0.04
Lunit-BT 0.08 ± 0.04 0.03 ± 0.03 0.04 ± 0.05 0.12 ± 0.03 0.29 ± 0.20 0.25 ± 0.12 0.08 ± 0.08 0.34 ± 0.11 0.21 ± 0.05 0.16 ± 0.09
Lunit-SwAV 0.06 ± 0.03 0.06 ± 0.03 0.07 ± 0.04 0.10 ± 0.05 0.07 ± 0.06 0.06 ± 0.03 0.08 ± 0.05 0.05 ± 0.05 0.11 ± 0.04 0.07 ± 0.04
Transformer Swin 0.06 ± 0.03 0.10 ± 0.04 0.24 ± 0.04 0.04 ± 0.03 0.15 ± 0.09 0.10 ± 0.03 0.05 ± 0.05 0.19 ± 0.08 0.13 ± 0.06 0.12 ± 0.05
CTransPath 0.01 ± 0.01 0.02 ± 0.02 0.04 ± 0.04 0.08 ± 0.03 0.05 ± 0.05 0.08 ± 0.05 0.07 ± 0.05 0.13 ± 0.08 0.06 ± 0.03 0.06 ± 0.05
ViT-B 0.07 ± 0.03 0.08 ± 0.04 0.15 ± 0.04 0.08 ± 0.04 0.17 ± 0.06 0.20 ± 0.03 0.11 ± 0.02 0.14 ± 0.06 0.05 ± 0.04 0.12 ± 0.04
Phikon 0.06 ± 0.04 0.09 ± 0.03 0.08 ± 0.02 0.08 ± 0.03 0.06 ± 0.04 0.05 ± 0.04 0.04 ± 0.03 0.05 ± 0.04 0.15 ± 0.05 0.07 ± 0.04
ViT-S 0.06 ± 0.03 0.04 ± 0.03 0.17 ± 0.04 0.10 ± 0.06 0.18 ± 0.07 0.19 ± 0.02 0.10 ± 0.04 0.16 ± 0.05 0.04 ± 0.04 0.12 ± 0.04
Lunit-DINO 0.03 ± 0.03 0.08 ± 0.03 0.03 ± 0.02 0.02 ± 0.02 0.04 ± 0.04 0.00 ± 0.01 0.07 ± 0.03 0.02 ± 0.02 0.06 ± 0.06 0.04 ± 0.03
ResNet-50 0.09 ± 0.02 0.09 ± 0.05 0.18 ± 0.04 0.04 ± 0.05 0.18 ± 0.06 0.24 ± 0.05 0.09 ± 0.04 0.17 ± 0.07 0.32 ± 0.05 0.16 ± 0.05
RetCCL 0.07 ± 0.05 0.06 ± 0.04 0.02 ± 0.02 0.09 ± 0.04 0.06 ± 0.05 0.19 ± 0.05 0.12 ± 0.07 0.16 ± 0.06 0.11 ± 0.08 0.10 ± 0.06
Lunit-BT 0.02 ± 0.02 0.05 ± 0.04 0.05 ± 0.04 0.06 ± 0.03 0.07 ± 0.06 0.04 ± 0.04 0.02 ± 0.03 0.12 ± 0.04 0.05 ± 0.02 0.05 ± 0.04
Lunit-SwAV 0.06 ± 0.04 0.04 ± 0.03 0.05 ± 0.02 0.12 ± 0.04 0.07 ± 0.05 0.08 ± 0.05 0.10 ± 0.03 0.05 ± 0.06 0.18 ± 0.06 0.08 ± 0.05
Mean pool Swin 0.08 ± 0.01 0.10 ± 0.03 0.17 ± 0.04 0.06 ± 0.03 0.15 ± 0.11 0.14 ± 0.02 0.05 ± 0.05 0.13 ± 0.02 0.13 ± 0.03 0.11 ± 0.05
CTransPath 0.00 ± 0.00 0.04 ± 0.02 0.04 ± 0.03 0.01 ± 0.02 0.16 ± 0.11 0.03 ± 0.02 0.10 ± 0.03 0.04 ± 0.02 0.06 ± 0.03 0.05 ± 0.04
ViT-B 0.07 ± 0.01 0.08 ± 0.01 0.10 ± 0.02 0.09 ± 0.02 0.17 ± 0.09 0.13 ± 0.03 0.09 ± 0.05 0.16 ± 0.03 0.01 ± 0.01 0.10 ± 0.04
Phikon 0.08 ± 0.01 0.05 ± 0.02 0.13 ± 0.03 0.03 ± 0.03 0.12 ± 0.12 0.01 ± 0.01 0.13 ± 0.04 0.08 ± 0.08 0.09 ± 0.02 0.08 ± 0.05
ViT-S 0.08 ± 0.01 0.03 ± 0.02 0.15 ± 0.03 0.09 ± 0.02 0.14 ± 0.08 0.15 ± 0.02 0.02 ± 0.02 0.21 ± 0.05 0.07 ± 0.03 0.10 ± 0.03
Lunit-DINO 0.09 ± 0.01 0.04 ± 0.02 0.00 ± 0.01 0.05 ± 0.03 0.09 ± 0.09 0.01 ± 0.02 0.10 ± 0.04 0.00 ± 0.01 0.01 ± 0.02 0.04 ± 0.03
ResNet-50 0.08 ± 0.01 0.00 ± 0.01 0.12 ± 0.02 0.04 ± 0.03 0.19 ± 0.10 0.21 ± 0.05 0.04 ± 0.03 0.23 ± 0.04 0.12 ± 0.04 0.12 ± 0.04
RetCCL 0.01 ± 0.00 0.04 ± 0.01 0.08 ± 0.03 0.07 ± 0.03 0.14 ± 0.12 0.11 ± 0.04 0.07 ± 0.05 0.14 ± 0.01 0.05 ± 0.01 0.08 ± 0.05
Lunit-BT 0.06 ± 0.02 0.04 ± 0.01 0.06 ± 0.04 0.08 ± 0.02 0.22 ± 0.08 0.09 ± 0.05 0.02 ± 0.02 0.16 ± 0.01 0.02 ± 0.01 0.08 ± 0.04
Lunit-SwAV 0.07 ± 0.00 0.04 ± 0.02 0.08 ± 0.04 0.11 ± 0.03 0.13 ± 0.12 0.05 ± 0.02 0.13 ± 0.03 0.03 ± 0.02 0.11 ± 0.04 0.08 ± 0.05
Table 7. Normalised differential AUROC scores for all tasks, feature extractors, downstream models, when employing rotation/flipping
augmentations.
8
Target -Subtype -CDH1 -TP53 -PIK3CA -LN status -MSI -KRAS -BRAF -SMAD4 Average
Model Feature extractor
AttMIL Swin 0.04 ± 0.03 0.14 ± 0.02 0.21 ± 0.02 0.07 ± 0.04 0.13 ± 0.08 0.15 ± 0.04 0.10 ± 0.05 0.16 ± 0.08 0.17 ± 0.05 0.13 ± 0.05
CTransPath 0.00 ± 0.00 0.01 ± 0.02 0.00 ± 0.01 0.04 ± 0.03 0.03 ± 0.03 0.10 ± 0.04 0.06 ± 0.03 0.09 ± 0.07 0.07 ± 0.03 0.04 ± 0.03
ViT-B 0.04 ± 0.03 0.10 ± 0.04 0.12 ± 0.03 0.08 ± 0.04 0.14 ± 0.06 0.13 ± 0.04 0.06 ± 0.03 0.16 ± 0.05 0.02 ± 0.02 0.09 ± 0.04
Phikon 0.13 ± 0.05 0.09 ± 0.03 0.10 ± 0.05 0.12 ± 0.05 0.07 ± 0.07 0.06 ± 0.03 0.10 ± 0.05 0.18 ± 0.08 0.13 ± 0.06 0.11 ± 0.05
ViT-S 0.08 ± 0.03 0.07 ± 0.02 0.14 ± 0.02 0.08 ± 0.04 0.17 ± 0.09 0.15 ± 0.03 0.04 ± 0.03 0.19 ± 0.06 0.06 ± 0.05 0.11 ± 0.05
Lunit-DINO 0.05 ± 0.04 0.03 ± 0.03 0.04 ± 0.02 0.04 ± 0.04 0.06 ± 0.06 0.00 ± 0.01 0.07 ± 0.04 0.01 ± 0.02 0.05 ± 0.05 0.04 ± 0.04
ResNet-50 0.09 ± 0.03 0.07 ± 0.03 0.14 ± 0.03 0.01 ± 0.01 0.16 ± 0.08 0.24 ± 0.05 0.14 ± 0.03 0.24 ± 0.08 0.30 ± 0.11 0.15 ± 0.06
RetCCL 0.06 ± 0.04 0.03 ± 0.03 0.03 ± 0.02 0.07 ± 0.03 0.06 ± 0.05 0.11 ± 0.06 0.04 ± 0.05 0.18 ± 0.05 0.06 ± 0.02 0.07 ± 0.04
Lunit-BT 0.17 ± 0.05 0.11 ± 0.07 0.20 ± 0.20 0.17 ± 0.03 0.40 ± 0.07 0.21 ± 0.10 0.12 ± 0.05 0.24 ± 0.09 0.20 ± 0.05 0.20 ± 0.09
Lunit-SwAV 0.07 ± 0.03 0.03 ± 0.02 0.07 ± 0.04 0.10 ± 0.04 0.07 ± 0.06 0.08 ± 0.03 0.07 ± 0.05 0.13 ± 0.07 0.11 ± 0.05 0.08 ± 0.04
Transformer Swin 0.07 ± 0.02 0.13 ± 0.06 0.21 ± 0.03 0.03 ± 0.03 0.13 ± 0.09 0.13 ± 0.03 0.06 ± 0.06 0.07 ± 0.04 0.11 ± 0.03 0.10 ± 0.05
CTransPath 0.02 ± 0.02 0.06 ± 0.02 0.03 ± 0.02 0.04 ± 0.03 0.04 ± 0.04 0.06 ± 0.04 0.08 ± 0.03 0.07 ± 0.08 0.13 ± 0.06 0.06 ± 0.04
ViT-B 0.04 ± 0.03 0.11 ± 0.04 0.15 ± 0.03 0.09 ± 0.02 0.18 ± 0.13 0.15 ± 0.02 0.16 ± 0.05 0.23 ± 0.07 0.03 ± 0.03 0.13 ± 0.06
Phikon 0.12 ± 0.03 0.10 ± 0.04 0.06 ± 0.03 0.11 ± 0.03 0.08 ± 0.05 0.05 ± 0.04 0.04 ± 0.03 0.01 ± 0.02 0.15 ± 0.05 0.08 ± 0.04
ViT-S 0.06 ± 0.03 0.06 ± 0.03 0.14 ± 0.05 0.08 ± 0.03 0.19 ± 0.05 0.17 ± 0.05 0.06 ± 0.04 0.20 ± 0.04 0.02 ± 0.02 0.11 ± 0.04
Lunit-DINO 0.04 ± 0.03 0.05 ± 0.03 0.02 ± 0.01 0.04 ± 0.03 0.06 ± 0.06 0.01 ± 0.01 0.09 ± 0.05 0.06 ± 0.04 0.02 ± 0.02 0.04 ± 0.04
ResNet-50 0.09 ± 0.03 0.12 ± 0.04 0.10 ± 0.03 0.01 ± 0.02 0.18 ± 0.08 0.18 ± 0.03 0.04 ± 0.03 0.18 ± 0.05 0.27 ± 0.07 0.13 ± 0.05
RetCCL 0.03 ± 0.03 0.06 ± 0.04 0.01 ± 0.01 0.09 ± 0.04 0.08 ± 0.07 0.12 ± 0.07 0.13 ± 0.05 0.24 ± 0.08 0.13 ± 0.07 0.10 ± 0.06
Lunit-BT 0.03 ± 0.03 0.03 ± 0.03 0.04 ± 0.03 0.10 ± 0.03 0.09 ± 0.08 0.07 ± 0.06 0.03 ± 0.02 0.13 ± 0.04 0.05 ± 0.02 0.06 ± 0.04
Lunit-SwAV 0.08 ± 0.03 0.02 ± 0.03 0.03 ± 0.03 0.10 ± 0.03 0.07 ± 0.06 0.10 ± 0.04 0.10 ± 0.04 0.06 ± 0.04 0.16 ± 0.06 0.08 ± 0.04
Mean pool Swin 0.06 ± 0.01 0.10 ± 0.03 0.16 ± 0.03 0.04 ± 0.01 0.19 ± 0.12 0.15 ± 0.02 0.03 ± 0.04 0.18 ± 0.05 0.13 ± 0.04 0.12 ± 0.05
CTransPath 0.00 ± 0.00 0.03 ± 0.02 0.04 ± 0.02 0.00 ± 0.00 0.15 ± 0.11 0.04 ± 0.03 0.08 ± 0.03 0.04 ± 0.02 0.09 ± 0.03 0.05 ± 0.04
ViT-B 0.07 ± 0.01 0.08 ± 0.01 0.10 ± 0.02 0.08 ± 0.01 0.18 ± 0.08 0.17 ± 0.02 0.11 ± 0.05 0.20 ± 0.02 0.02 ± 0.02 0.11 ± 0.04
Phikon 0.11 ± 0.01 0.02 ± 0.02 0.13 ± 0.03 0.07 ± 0.04 0.12 ± 0.11 0.02 ± 0.02 0.11 ± 0.05 0.09 ± 0.07 0.12 ± 0.03 0.09 ± 0.05
ViT-S 0.11 ± 0.01 0.03 ± 0.02 0.16 ± 0.02 0.06 ± 0.01 0.16 ± 0.11 0.20 ± 0.03 0.04 ± 0.02 0.23 ± 0.03 0.06 ± 0.03 0.12 ± 0.04
Lunit-DINO 0.09 ± 0.01 0.02 ± 0.02 0.01 ± 0.02 0.04 ± 0.03 0.08 ± 0.09 0.01 ± 0.02 0.09 ± 0.02 0.00 ± 0.00 0.00 ± 0.01 0.04 ± 0.03
ResNet-50 0.08 ± 0.01 0.01 ± 0.01 0.11 ± 0.02 0.02 ± 0.01 0.23 ± 0.10 0.22 ± 0.03 0.01 ± 0.01 0.27 ± 0.05 0.15 ± 0.06 0.12 ± 0.04
RetCCL 0.01 ± 0.01 0.03 ± 0.01 0.07 ± 0.02 0.06 ± 0.01 0.14 ± 0.11 0.10 ± 0.05 0.08 ± 0.07 0.16 ± 0.03 0.06 ± 0.02 0.08 ± 0.05
Lunit-BT 0.08 ± 0.04 0.04 ± 0.01 0.10 ± 0.05 0.09 ± 0.02 0.29 ± 0.09 0.12 ± 0.07 0.03 ± 0.02 0.19 ± 0.02 0.09 ± 0.14 0.11 ± 0.07
Lunit-SwAV 0.07 ± 0.00 0.02 ± 0.01 0.03 ± 0.02 0.10 ± 0.02 0.15 ± 0.13 0.05 ± 0.02 0.13 ± 0.04 0.11 ± 0.05 0.13 ± 0.05 0.09 ± 0.05
Table 8. Normalised differential AUROC scores for all tasks, feature extractors, downstream models, when employing all augmentations.
Target -Subtype -CDH1 -TP53 -PIK3CA -LN status -MSI -KRAS -BRAF -SMAD4
Model Feature extractor
AttMIL Swin 0.75 ± 0.01 0.65 ± 0.02 0.54 ± 0.02 0.60 ± 0.02 0.74 ± 0.09 0.72 ± 0.04 0.51 ± 0.05 0.63 ± 0.07 0.55 ± 0.05
CTransPath 0.82 ± 0.02 0.81 ± 0.02 0.80 ± 0.02 0.62 ± 0.01 0.86 ± 0.08 0.82 ± 0.03 0.60 ± 0.03 0.71 ± 0.01 0.65 ± 0.02
ViT-B 0.74 ± 0.04 0.70 ± 0.01 0.66 ± 0.03 0.59 ± 0.01 0.74 ± 0.06 0.75 ± 0.03 0.62 ± 0.05 0.59 ± 0.08 0.70 ± 0.03
Phikon 0.73 ± 0.01 0.73 ± 0.02 0.72 ± 0.03 0.57 ± 0.02 0.85 ± 0.08 0.84 ± 0.05 0.59 ± 0.05 0.70 ± 0.06 0.54 ± 0.08
ViT-S 0.69 ± 0.02 0.73 ± 0.02 0.68 ± 0.06 0.58 ± 0.04 0.73 ± 0.10 0.72 ± 0.04 0.59 ± 0.03 0.58 ± 0.03 0.63 ± 0.08
Lunit-DINO 0.74 ± 0.02 0.78 ± 0.04 0.79 ± 0.03 0.64 ± 0.02 0.85 ± 0.03 0.90 ± 0.02 0.59 ± 0.04 0.76 ± 0.04 0.69 ± 0.02
ResNet-50 0.67 ± 0.02 0.73 ± 0.04 0.70 ± 0.03 0.65 ± 0.04 0.74 ± 0.09 0.68 ± 0.04 0.54 ± 0.04 0.55 ± 0.07 0.50 ± 0.10
RetCCL 0.76 ± 0.03 0.78 ± 0.01 0.78 ± 0.03 0.62 ± 0.01 0.85 ± 0.07 0.82 ± 0.03 0.63 ± 0.03 0.63 ± 0.02 0.66 ± 0.02
Lunit-BT 0.69 ± 0.03 0.75 ± 0.04 0.80 ± 0.00 0.54 ± 0.03 0.58 ± 0.17 0.62 ± 0.15 0.62 ± 0.05 0.43 ± 0.15 0.46 ± 0.03
Lunit-SwAV 0.76 ± 0.01 0.75 ± 0.03 0.76 ± 0.02 0.54 ± 0.06 0.84 ± 0.06 0.80 ± 0.03 0.53 ± 0.06 0.70 ± 0.08 0.58 ± 0.09
Transformer Swin 0.74 ± 0.04 0.70 ± 0.02 0.61 ± 0.03 0.54 ± 0.04 0.76 ± 0.09 0.69 ± 0.08 0.56 ± 0.03 0.60 ± 0.04 0.57 ± 0.05
CTransPath 0.81 ± 0.03 0.80 ± 0.01 0.80 ± 0.03 0.55 ± 0.08 0.85 ± 0.09 0.86 ± 0.02 0.60 ± 0.04 0.68 ± 0.07 0.62 ± 0.05
ViT-B 0.74 ± 0.03 0.71 ± 0.02 0.65 ± 0.03 0.52 ± 0.01 0.71 ± 0.07 0.70 ± 0.06 0.51 ± 0.05 0.56 ± 0.08 0.65 ± 0.06
Phikon 0.69 ± 0.04 0.73 ± 0.05 0.75 ± 0.02 0.59 ± 0.03 0.85 ± 0.06 0.83 ± 0.04 0.60 ± 0.04 0.65 ± 0.07 0.59 ± 0.06
ViT-S 0.72 ± 0.01 0.74 ± 0.03 0.60 ± 0.08 0.52 ± 0.07 0.71 ± 0.10 0.72 ± 0.07 0.57 ± 0.04 0.53 ± 0.10 0.68 ± 0.03
Lunit-DINO 0.78 ± 0.04 0.75 ± 0.03 0.79 ± 0.01 0.62 ± 0.02 0.87 ± 0.05 0.87 ± 0.02 0.59 ± 0.02 0.74 ± 0.05 0.69 ± 0.03
ResNet-50 0.69 ± 0.04 0.71 ± 0.08 0.67 ± 0.02 0.59 ± 0.08 0.73 ± 0.09 0.69 ± 0.07 0.54 ± 0.03 0.57 ± 0.06 0.41 ± 0.12
RetCCL 0.73 ± 0.03 0.77 ± 0.05 0.80 ± 0.04 0.55 ± 0.06 0.85 ± 0.07 0.73 ± 0.03 0.53 ± 0.05 0.55 ± 0.11 0.65 ± 0.06
Lunit-BT 0.78 ± 0.03 0.76 ± 0.03 0.80 ± 0.01 0.53 ± 0.05 0.85 ± 0.08 0.86 ± 0.02 0.63 ± 0.03 0.63 ± 0.04 0.65 ± 0.02
Lunit-SwAV 0.74 ± 0.05 0.77 ± 0.06 0.77 ± 0.02 0.53 ± 0.06 0.85 ± 0.06 0.82 ± 0.03 0.57 ± 0.03 0.69 ± 0.05 0.54 ± 0.07
Mean pool Swin 0.73 ± 0.01 0.68 ± 0.04 0.62 ± 0.05 0.59 ± 0.02 0.67 ± 0.13 0.72 ± 0.02 0.66 ± 0.02 0.67 ± 0.03 0.61 ± 0.02
CTransPath 0.82 ± 0.00 0.74 ± 0.02 0.72 ± 0.02 0.64 ± 0.02 0.69 ± 0.12 0.86 ± 0.02 0.58 ± 0.06 0.73 ± 0.04 0.62 ± 0.02
ViT-B 0.75 ± 0.01 0.71 ± 0.01 0.68 ± 0.02 0.56 ± 0.01 0.69 ± 0.11 0.74 ± 0.02 0.61 ± 0.04 0.61 ± 0.04 0.69 ± 0.02
Phikon 0.71 ± 0.02 0.74 ± 0.03 0.61 ± 0.03 0.59 ± 0.03 0.73 ± 0.12 0.82 ± 0.04 0.57 ± 0.03 0.70 ± 0.07 0.60 ± 0.05
ViT-S 0.71 ± 0.01 0.76 ± 0.04 0.61 ± 0.01 0.57 ± 0.02 0.69 ± 0.11 0.70 ± 0.04 0.65 ± 0.03 0.58 ± 0.05 0.64 ± 0.02
Lunit-DINO 0.74 ± 0.01 0.74 ± 0.02 0.73 ± 0.02 0.60 ± 0.03 0.75 ± 0.12 0.89 ± 0.02 0.60 ± 0.01 0.79 ± 0.01 0.70 ± 0.03
ResNet-50 0.74 ± 0.01 0.78 ± 0.01 0.65 ± 0.01 0.61 ± 0.01 0.63 ± 0.09 0.67 ± 0.03 0.66 ± 0.04 0.56 ± 0.03 0.58 ± 0.05
RetCCL 0.81 ± 0.00 0.75 ± 0.01 0.68 ± 0.02 0.58 ± 0.01 0.69 ± 0.12 0.79 ± 0.05 0.66 ± 0.03 0.64 ± 0.01 0.65 ± 0.00
Lunit-BT 0.76 ± 0.03 0.75 ± 0.00 0.69 ± 0.05 0.57 ± 0.01 0.66 ± 0.12 0.81 ± 0.02 0.66 ± 0.03 0.58 ± 0.10 0.68 ± 0.01
Lunit-SwAV 0.75 ± 0.00 0.75 ± 0.02 0.70 ± 0.02 0.53 ± 0.01 0.71 ± 0.15 0.84 ± 0.01 0.56 ± 0.03 0.76 ± 0.01 0.58 ± 0.05
Table 9. Test AUROC scores (averaged across the five seeds) for all tasks, feature extractors, and downstream models, when employing no
augmentations.
9
Target -Subtype -CDH1 -TP53 -PIK3CA -LN status -MSI -KRAS -BRAF -SMAD4
Model Feature extractor
AttMIL Swin 0.74 ± 0.02 0.58 ± 0.03 0.55 ± 0.03 0.58 ± 0.05 0.73 ± 0.09 0.72 ± 0.04 0.55 ± 0.01 0.66 ± 0.05 0.55 ± 0.05
CTransPath 0.82 ± 0.04 0.77 ± 0.05 0.78 ± 0.03 0.62 ± 0.01 0.84 ± 0.10 0.84 ± 0.00 0.61 ± 0.04 0.69 ± 0.05 0.63 ± 0.03
ViT-B 0.74 ± 0.02 0.65 ± 0.03 0.69 ± 0.02 0.61 ± 0.04 0.75 ± 0.08 0.77 ± 0.03 0.60 ± 0.03 0.67 ± 0.02 0.70 ± 0.04
Phikon 0.73 ± 0.02 0.72 ± 0.04 0.73 ± 0.04 0.55 ± 0.03 0.82 ± 0.08 0.85 ± 0.03 0.57 ± 0.04 0.65 ± 0.01 0.65 ± 0.05
ViT-S 0.70 ± 0.04 0.68 ± 0.04 0.70 ± 0.04 0.61 ± 0.03 0.74 ± 0.10 0.72 ± 0.06 0.61 ± 0.04 0.62 ± 0.03 0.62 ± 0.13
Lunit-DINO 0.78 ± 0.02 0.77 ± 0.02 0.78 ± 0.03 0.63 ± 0.01 0.84 ± 0.08 0.91 ± 0.04 0.65 ± 0.04 0.76 ± 0.06 0.62 ± 0.07
ResNet-50 0.69 ± 0.03 0.61 ± 0.05 0.66 ± 0.04 0.62 ± 0.02 0.74 ± 0.08 0.69 ± 0.03 0.53 ± 0.05 0.62 ± 0.02 0.60 ± 0.07
RetCCL 0.76 ± 0.03 0.78 ± 0.03 0.78 ± 0.03 0.60 ± 0.03 0.82 ± 0.06 0.85 ± 0.03 0.69 ± 0.01 0.63 ± 0.02 0.64 ± 0.01
Lunit-BT 0.69 ± 0.05 0.76 ± 0.04 0.75 ± 0.10 0.53 ± 0.02 0.64 ± 0.19 0.75 ± 0.17 0.63 ± 0.08 0.42 ± 0.07 0.49 ± 0.07
Lunit-SwAV 0.75 ± 0.03 0.78 ± 0.02 0.79 ± 0.02 0.60 ± 0.05 0.83 ± 0.06 0.79 ± 0.04 0.58 ± 0.03 0.71 ± 0.04 0.58 ± 0.07
Transformer Swin 0.73 ± 0.03 0.66 ± 0.04 0.61 ± 0.03 0.58 ± 0.03 0.74 ± 0.10 0.69 ± 0.10 0.57 ± 0.06 0.53 ± 0.04 0.55 ± 0.09
CTransPath 0.79 ± 0.03 0.79 ± 0.03 0.76 ± 0.05 0.54 ± 0.05 0.87 ± 0.08 0.88 ± 0.02 0.63 ± 0.03 0.71 ± 0.05 0.54 ± 0.09
ViT-B 0.70 ± 0.04 0.67 ± 0.03 0.64 ± 0.03 0.60 ± 0.03 0.71 ± 0.09 0.68 ± 0.07 0.58 ± 0.04 0.52 ± 0.11 0.67 ± 0.04
Phikon 0.72 ± 0.01 0.73 ± 0.01 0.72 ± 0.04 0.59 ± 0.01 0.82 ± 0.09 0.86 ± 0.03 0.63 ± 0.07 0.66 ± 0.08 0.68 ± 0.04
ViT-S 0.73 ± 0.01 0.65 ± 0.05 0.66 ± 0.06 0.57 ± 0.03 0.76 ± 0.10 0.68 ± 0.09 0.59 ± 0.03 0.60 ± 0.02 0.67 ± 0.03
Lunit-DINO 0.81 ± 0.03 0.74 ± 0.04 0.79 ± 0.03 0.60 ± 0.03 0.86 ± 0.06 0.89 ± 0.03 0.59 ± 0.07 0.71 ± 0.06 0.64 ± 0.07
ResNet-50 0.68 ± 0.03 0.61 ± 0.07 0.64 ± 0.04 0.59 ± 0.02 0.70 ± 0.08 0.69 ± 0.04 0.56 ± 0.03 0.62 ± 0.06 0.51 ± 0.14
RetCCL 0.76 ± 0.05 0.75 ± 0.04 0.78 ± 0.02 0.56 ± 0.05 0.81 ± 0.04 0.81 ± 0.02 0.58 ± 0.04 0.54 ± 0.09 0.63 ± 0.03
Lunit-BT 0.80 ± 0.03 0.78 ± 0.02 0.78 ± 0.03 0.57 ± 0.01 0.85 ± 0.06 0.86 ± 0.02 0.67 ± 0.02 0.60 ± 0.07 0.66 ± 0.01
Lunit-SwAV 0.76 ± 0.03 0.79 ± 0.01 0.77 ± 0.04 0.56 ± 0.06 0.83 ± 0.10 0.78 ± 0.06 0.55 ± 0.05 0.59 ± 0.11 0.53 ± 0.09
Mean pool Swin 0.76 ± 0.01 0.62 ± 0.02 0.60 ± 0.04 0.61 ± 0.01 0.62 ± 0.09 0.73 ± 0.03 0.63 ± 0.05 0.67 ± 0.07 0.63 ± 0.03
CTransPath 0.83 ± 0.00 0.74 ± 0.00 0.71 ± 0.01 0.64 ± 0.01 0.67 ± 0.09 0.89 ± 0.01 0.60 ± 0.05 0.74 ± 0.03 0.62 ± 0.02
ViT-B 0.78 ± 0.01 0.68 ± 0.01 0.67 ± 0.02 0.60 ± 0.01 0.67 ± 0.12 0.75 ± 0.02 0.60 ± 0.04 0.67 ± 0.07 0.70 ± 0.01
Phikon 0.70 ± 0.01 0.73 ± 0.02 0.64 ± 0.03 0.58 ± 0.02 0.69 ± 0.12 0.91 ± 0.02 0.65 ± 0.03 0.69 ± 0.06 0.63 ± 0.03
ViT-S 0.75 ± 0.01 0.68 ± 0.02 0.63 ± 0.03 0.59 ± 0.03 0.63 ± 0.11 0.74 ± 0.06 0.65 ± 0.03 0.59 ± 0.04 0.67 ± 0.03
Lunit-DINO 0.76 ± 0.01 0.74 ± 0.03 0.70 ± 0.05 0.60 ± 0.01 0.75 ± 0.12 0.89 ± 0.01 0.63 ± 0.03 0.77 ± 0.05 0.65 ± 0.02
ResNet-50 0.74 ± 0.01 0.65 ± 0.05 0.60 ± 0.02 0.61 ± 0.01 0.61 ± 0.10 0.73 ± 0.04 0.61 ± 0.04 0.65 ± 0.02 0.65 ± 0.06
RetCCL 0.80 ± 0.00 0.76 ± 0.01 0.68 ± 0.03 0.59 ± 0.00 0.69 ± 0.10 0.86 ± 0.01 0.65 ± 0.02 0.67 ± 0.03 0.66 ± 0.00
Lunit-BT 0.73 ± 0.03 0.75 ± 0.00 0.71 ± 0.04 0.57 ± 0.00 0.60 ± 0.10 0.76 ± 0.04 0.61 ± 0.05 0.60 ± 0.08 0.68 ± 0.01
Lunit-SwAV 0.74 ± 0.01 0.75 ± 0.01 0.72 ± 0.02 0.49 ± 0.02 0.69 ± 0.11 0.76 ± 0.01 0.51 ± 0.02 0.78 ± 0.02 0.57 ± 0.04
Table 10. Test AUROC scores (averaged across the five seeds) for all tasks, feature extractors, and downstream models, when employing
slidewise stain normalisation [58].
Target -Subtype -CDH1 -TP53 -PIK3CA -LN status -MSI -KRAS -BRAF -SMAD4
Model Feature extractor
AttMIL Swin 0.73 ± 0.02 0.61 ± 0.05 0.57 ± 0.03 0.60 ± 0.03 0.75 ± 0.08 0.76 ± 0.02 0.57 ± 0.04 0.65 ± 0.08 0.51 ± 0.02
CTransPath 0.81 ± 0.03 0.78 ± 0.04 0.78 ± 0.02 0.60 ± 0.01 0.88 ± 0.07 0.83 ± 0.06 0.61 ± 0.03 0.70 ± 0.02 0.65 ± 0.02
ViT-B 0.71 ± 0.03 0.69 ± 0.03 0.69 ± 0.01 0.60 ± 0.05 0.75 ± 0.13 0.76 ± 0.04 0.58 ± 0.04 0.63 ± 0.06 0.69 ± 0.02
Phikon 0.70 ± 0.02 0.76 ± 0.01 0.72 ± 0.04 0.59 ± 0.02 0.82 ± 0.10 0.87 ± 0.03 0.62 ± 0.05 0.66 ± 0.03 0.65 ± 0.06
ViT-S 0.72 ± 0.02 0.70 ± 0.03 0.71 ± 0.04 0.63 ± 0.02 0.72 ± 0.07 0.75 ± 0.04 0.62 ± 0.06 0.58 ± 0.07 0.64 ± 0.03
Lunit-DINO 0.77 ± 0.02 0.79 ± 0.01 0.77 ± 0.03 0.62 ± 0.02 0.85 ± 0.07 0.89 ± 0.03 0.61 ± 0.04 0.73 ± 0.07 0.66 ± 0.03
ResNet-50 0.64 ± 0.03 0.62 ± 0.04 0.64 ± 0.06 0.63 ± 0.01 0.75 ± 0.07 0.72 ± 0.02 0.55 ± 0.03 0.59 ± 0.07 0.57 ± 0.07
RetCCL 0.73 ± 0.03 0.77 ± 0.02 0.78 ± 0.05 0.62 ± 0.02 0.82 ± 0.10 0.83 ± 0.03 0.66 ± 0.04 0.62 ± 0.02 0.64 ± 0.03
Lunit-BT 0.70 ± 0.03 0.76 ± 0.01 0.79 ± 0.03 0.51 ± 0.02 0.66 ± 0.14 0.57 ± 0.08 0.60 ± 0.06 0.48 ± 0.10 0.56 ± 0.11
Lunit-SwAV 0.76 ± 0.01 0.78 ± 0.03 0.77 ± 0.01 0.59 ± 0.03 0.83 ± 0.08 0.78 ± 0.04 0.55 ± 0.08 0.69 ± 0.05 0.60 ± 0.05
Transformer Swin 0.71 ± 0.04 0.63 ± 0.05 0.61 ± 0.05 0.56 ± 0.03 0.72 ± 0.09 0.71 ± 0.04 0.53 ± 0.02 0.55 ± 0.07 0.61 ± 0.07
CTransPath 0.80 ± 0.02 0.76 ± 0.04 0.80 ± 0.02 0.59 ± 0.04 0.85 ± 0.08 0.86 ± 0.05 0.60 ± 0.04 0.69 ± 0.08 0.62 ± 0.06
ViT-B 0.70 ± 0.03 0.69 ± 0.02 0.64 ± 0.03 0.57 ± 0.02 0.75 ± 0.11 0.69 ± 0.08 0.54 ± 0.07 0.55 ± 0.03 0.61 ± 0.03
Phikon 0.74 ± 0.03 0.73 ± 0.03 0.74 ± 0.03 0.58 ± 0.03 0.84 ± 0.07 0.86 ± 0.02 0.62 ± 0.06 0.69 ± 0.03 0.67 ± 0.04
ViT-S 0.71 ± 0.03 0.70 ± 0.06 0.63 ± 0.02 0.59 ± 0.05 0.75 ± 0.10 0.74 ± 0.02 0.63 ± 0.08 0.57 ± 0.03 0.65 ± 0.07
Lunit-DINO 0.78 ± 0.03 0.77 ± 0.04 0.79 ± 0.01 0.62 ± 0.04 0.87 ± 0.06 0.88 ± 0.04 0.58 ± 0.03 0.68 ± 0.09 0.64 ± 0.07
ResNet-50 0.66 ± 0.05 0.64 ± 0.11 0.58 ± 0.04 0.61 ± 0.07 0.77 ± 0.09 0.69 ± 0.06 0.54 ± 0.04 0.61 ± 0.04 0.40 ± 0.12
RetCCL 0.76 ± 0.03 0.76 ± 0.05 0.77 ± 0.04 0.59 ± 0.01 0.83 ± 0.07 0.82 ± 0.05 0.58 ± 0.05 0.62 ± 0.08 0.64 ± 0.05
Lunit-BT 0.78 ± 0.03 0.77 ± 0.03 0.77 ± 0.03 0.58 ± 0.04 0.86 ± 0.07 0.85 ± 0.03 0.59 ± 0.06 0.62 ± 0.02 0.63 ± 0.07
Lunit-SwAV 0.75 ± 0.03 0.80 ± 0.02 0.78 ± 0.04 0.57 ± 0.04 0.84 ± 0.06 0.82 ± 0.04 0.52 ± 0.04 0.69 ± 0.13 0.59 ± 0.01
Mean pool Swin 0.74 ± 0.01 0.65 ± 0.02 0.61 ± 0.04 0.61 ± 0.01 0.65 ± 0.11 0.78 ± 0.02 0.64 ± 0.04 0.65 ± 0.03 0.64 ± 0.01
CTransPath 0.80 ± 0.00 0.77 ± 0.01 0.70 ± 0.02 0.62 ± 0.02 0.67 ± 0.11 0.87 ± 0.02 0.59 ± 0.06 0.72 ± 0.03 0.64 ± 0.02
ViT-B 0.77 ± 0.00 0.68 ± 0.01 0.65 ± 0.02 0.60 ± 0.01 0.68 ± 0.11 0.73 ± 0.03 0.58 ± 0.06 0.63 ± 0.06 0.66 ± 0.03
Phikon 0.69 ± 0.01 0.76 ± 0.01 0.61 ± 0.02 0.55 ± 0.04 0.68 ± 0.16 0.88 ± 0.05 0.63 ± 0.03 0.70 ± 0.03 0.62 ± 0.07
ViT-S 0.74 ± 0.01 0.72 ± 0.04 0.63 ± 0.05 0.61 ± 0.02 0.67 ± 0.13 0.73 ± 0.02 0.67 ± 0.02 0.58 ± 0.06 0.61 ± 0.04
Lunit-DINO 0.76 ± 0.01 0.75 ± 0.02 0.68 ± 0.05 0.59 ± 0.02 0.73 ± 0.15 0.85 ± 0.03 0.61 ± 0.04 0.79 ± 0.03 0.65 ± 0.03
ResNet-50 0.73 ± 0.00 0.66 ± 0.05 0.65 ± 0.01 0.60 ± 0.01 0.63 ± 0.11 0.75 ± 0.05 0.66 ± 0.03 0.58 ± 0.04 0.58 ± 0.11
RetCCL 0.79 ± 0.00 0.75 ± 0.01 0.67 ± 0.02 0.60 ± 0.01 0.71 ± 0.10 0.85 ± 0.01 0.63 ± 0.05 0.66 ± 0.05 0.65 ± 0.01
Lunit-BT 0.75 ± 0.04 0.75 ± 0.01 0.69 ± 0.05 0.57 ± 0.01 0.67 ± 0.12 0.79 ± 0.03 0.66 ± 0.03 0.61 ± 0.01 0.68 ± 0.01
Lunit-SwAV 0.74 ± 0.00 0.75 ± 0.01 0.68 ± 0.01 0.51 ± 0.01 0.73 ± 0.14 0.78 ± 0.02 0.53 ± 0.01 0.75 ± 0.02 0.60 ± 0.02
Table 11. Test AUROC scores (averaged across the five seeds) for all tasks, feature extractors, and downstream models, when employing
patchwise stain normalisation [58].
10
Target -Subtype -CDH1 -TP53 -PIK3CA -LN status -MSI -KRAS -BRAF -SMAD4
Model Feature extractor
AttMIL Swin 0.76 ± 0.02 0.66 ± 0.06 0.54 ± 0.03 0.59 ± 0.01 0.77 ± 0.08 0.77 ± 0.03 0.53 ± 0.02 0.68 ± 0.04 0.52 ± 0.03
CTransPath 0.81 ± 0.04 0.79 ± 0.03 0.80 ± 0.01 0.65 ± 0.03 0.86 ± 0.06 0.85 ± 0.03 0.59 ± 0.05 0.71 ± 0.01 0.65 ± 0.02
ViT-B 0.74 ± 0.04 0.71 ± 0.01 0.67 ± 0.02 0.58 ± 0.03 0.76 ± 0.06 0.76 ± 0.03 0.57 ± 0.09 0.66 ± 0.03 0.70 ± 0.04
Phikon 0.74 ± 0.02 0.74 ± 0.02 0.76 ± 0.08 0.55 ± 0.03 0.84 ± 0.07 0.85 ± 0.03 0.59 ± 0.04 0.70 ± 0.09 0.52 ± 0.10
ViT-S 0.75 ± 0.01 0.77 ± 0.02 0.69 ± 0.04 0.59 ± 0.03 0.71 ± 0.11 0.70 ± 0.06 0.61 ± 0.04 0.60 ± 0.05 0.64 ± 0.09
Lunit-DINO 0.76 ± 0.02 0.77 ± 0.03 0.80 ± 0.01 0.64 ± 0.01 0.86 ± 0.07 0.88 ± 0.02 0.59 ± 0.02 0.77 ± 0.04 0.68 ± 0.03
ResNet-50 0.68 ± 0.02 0.71 ± 0.04 0.69 ± 0.04 0.63 ± 0.02 0.76 ± 0.11 0.66 ± 0.05 0.52 ± 0.06 0.57 ± 0.06 0.43 ± 0.08
RetCCL 0.77 ± 0.02 0.77 ± 0.03 0.80 ± 0.02 0.61 ± 0.02 0.84 ± 0.08 0.82 ± 0.03 0.62 ± 0.05 0.63 ± 0.02 0.65 ± 0.01
Lunit-BT 0.73 ± 0.02 0.78 ± 0.02 0.78 ± 0.05 0.53 ± 0.02 0.62 ± 0.22 0.64 ± 0.14 0.57 ± 0.10 0.44 ± 0.12 0.50 ± 0.05
Lunit-SwAV 0.75 ± 0.01 0.75 ± 0.03 0.75 ± 0.04 0.56 ± 0.05 0.84 ± 0.07 0.82 ± 0.02 0.57 ± 0.06 0.73 ± 0.05 0.60 ± 0.04
Transformer Swin 0.74 ± 0.04 0.70 ± 0.03 0.58 ± 0.03 0.60 ± 0.02 0.76 ± 0.09 0.79 ± 0.04 0.61 ± 0.06 0.56 ± 0.09 0.59 ± 0.07
CTransPath 0.79 ± 0.02 0.78 ± 0.02 0.78 ± 0.05 0.57 ± 0.02 0.87 ± 0.06 0.82 ± 0.06 0.59 ± 0.06 0.62 ± 0.09 0.66 ± 0.01
ViT-B 0.74 ± 0.04 0.72 ± 0.03 0.67 ± 0.04 0.57 ± 0.04 0.74 ± 0.06 0.70 ± 0.04 0.54 ± 0.01 0.61 ± 0.07 0.67 ± 0.05
Phikon 0.75 ± 0.05 0.71 ± 0.02 0.74 ± 0.02 0.57 ± 0.02 0.86 ± 0.04 0.84 ± 0.04 0.61 ± 0.02 0.70 ± 0.05 0.57 ± 0.04
ViT-S 0.75 ± 0.03 0.76 ± 0.01 0.65 ± 0.04 0.55 ± 0.06 0.74 ± 0.08 0.71 ± 0.01 0.55 ± 0.04 0.59 ± 0.05 0.68 ± 0.04
Lunit-DINO 0.78 ± 0.03 0.72 ± 0.03 0.79 ± 0.02 0.63 ± 0.03 0.87 ± 0.04 0.89 ± 0.02 0.59 ± 0.03 0.73 ± 0.03 0.66 ± 0.07
ResNet-50 0.72 ± 0.01 0.71 ± 0.05 0.64 ± 0.04 0.61 ± 0.07 0.74 ± 0.07 0.65 ± 0.05 0.57 ± 0.03 0.58 ± 0.07 0.39 ± 0.05
RetCCL 0.74 ± 0.06 0.74 ± 0.04 0.80 ± 0.04 0.55 ± 0.04 0.86 ± 0.07 0.71 ± 0.06 0.54 ± 0.08 0.59 ± 0.06 0.61 ± 0.09
Lunit-BT 0.79 ± 0.02 0.75 ± 0.04 0.77 ± 0.04 0.58 ± 0.02 0.84 ± 0.06 0.86 ± 0.04 0.63 ± 0.04 0.63 ± 0.03 0.67 ± 0.01
Lunit-SwAV 0.74 ± 0.05 0.76 ± 0.05 0.77 ± 0.01 0.53 ± 0.04 0.84 ± 0.05 0.82 ± 0.05 0.56 ± 0.03 0.70 ± 0.08 0.54 ± 0.06
Mean pool Swin 0.75 ± 0.01 0.69 ± 0.03 0.60 ± 0.04 0.59 ± 0.02 0.69 ± 0.12 0.74 ± 0.02 0.63 ± 0.06 0.65 ± 0.01 0.57 ± 0.03
CTransPath 0.82 ± 0.00 0.75 ± 0.02 0.73 ± 0.02 0.64 ± 0.03 0.69 ± 0.12 0.85 ± 0.02 0.59 ± 0.03 0.75 ± 0.02 0.64 ± 0.03
ViT-B 0.76 ± 0.01 0.71 ± 0.01 0.67 ± 0.01 0.56 ± 0.01 0.68 ± 0.09 0.75 ± 0.03 0.59 ± 0.06 0.63 ± 0.03 0.69 ± 0.01
Phikon 0.74 ± 0.01 0.74 ± 0.02 0.64 ± 0.02 0.61 ± 0.03 0.73 ± 0.13 0.87 ± 0.01 0.56 ± 0.04 0.71 ± 0.09 0.61 ± 0.02
ViT-S 0.74 ± 0.01 0.77 ± 0.03 0.62 ± 0.02 0.56 ± 0.01 0.70 ± 0.08 0.73 ± 0.01 0.66 ± 0.03 0.57 ± 0.05 0.63 ± 0.03
Lunit-DINO 0.73 ± 0.01 0.75 ± 0.02 0.77 ± 0.02 0.60 ± 0.02 0.76 ± 0.11 0.87 ± 0.02 0.58 ± 0.04 0.78 ± 0.02 0.69 ± 0.02
ResNet-50 0.74 ± 0.01 0.79 ± 0.02 0.65 ± 0.01 0.61 ± 0.03 0.66 ± 0.10 0.67 ± 0.05 0.64 ± 0.03 0.55 ± 0.04 0.58 ± 0.04
RetCCL 0.81 ± 0.00 0.75 ± 0.00 0.69 ± 0.02 0.58 ± 0.02 0.70 ± 0.13 0.77 ± 0.04 0.61 ± 0.05 0.65 ± 0.01 0.65 ± 0.00
Lunit-BT 0.76 ± 0.02 0.75 ± 0.00 0.71 ± 0.05 0.57 ± 0.01 0.63 ± 0.08 0.80 ± 0.05 0.66 ± 0.01 0.62 ± 0.00 0.68 ± 0.00
Lunit-SwAV 0.75 ± 0.00 0.75 ± 0.01 0.69 ± 0.04 0.53 ± 0.01 0.71 ± 0.15 0.83 ± 0.02 0.55 ± 0.03 0.76 ± 0.02 0.59 ± 0.05
Table 12. Test AUROC scores (averaged across the five seeds) for all tasks, feature extractors, and downstream models, when employing
rotation/flipping augmentations.
Target -Subtype -CDH1 -TP53 -PIK3CA -LN status -MSI -KRAS -BRAF -SMAD4
Model Feature extractor
AttMIL Swin 0.77 ± 0.01 0.66 ± 0.02 0.61 ± 0.02 0.59 ± 0.03 0.79 ± 0.09 0.74 ± 0.04 0.56 ± 0.06 0.63 ± 0.06 0.54 ± 0.04
CTransPath 0.81 ± 0.03 0.79 ± 0.02 0.82 ± 0.01 0.62 ± 0.02 0.89 ± 0.05 0.79 ± 0.03 0.60 ± 0.03 0.70 ± 0.05 0.65 ± 0.02
ViT-B 0.77 ± 0.01 0.70 ± 0.05 0.70 ± 0.03 0.58 ± 0.03 0.78 ± 0.06 0.76 ± 0.04 0.60 ± 0.02 0.63 ± 0.02 0.70 ± 0.04
Phikon 0.68 ± 0.05 0.71 ± 0.03 0.72 ± 0.06 0.54 ± 0.04 0.84 ± 0.07 0.84 ± 0.03 0.56 ± 0.08 0.61 ± 0.06 0.59 ± 0.07
ViT-S 0.73 ± 0.02 0.73 ± 0.02 0.68 ± 0.02 0.58 ± 0.04 0.74 ± 0.10 0.75 ± 0.02 0.61 ± 0.03 0.60 ± 0.03 0.65 ± 0.06
Lunit-DINO 0.76 ± 0.03 0.77 ± 0.03 0.78 ± 0.03 0.62 ± 0.03 0.86 ± 0.06 0.89 ± 0.03 0.59 ± 0.03 0.78 ± 0.07 0.67 ± 0.06
ResNet-50 0.72 ± 0.01 0.74 ± 0.03 0.68 ± 0.03 0.65 ± 0.04 0.76 ± 0.09 0.65 ± 0.04 0.52 ± 0.02 0.55 ± 0.06 0.41 ± 0.13
RetCCL 0.75 ± 0.03 0.77 ± 0.04 0.79 ± 0.03 0.59 ± 0.01 0.85 ± 0.05 0.79 ± 0.07 0.62 ± 0.06 0.61 ± 0.02 0.65 ± 0.01
Lunit-BT 0.64 ± 0.05 0.69 ± 0.07 0.62 ± 0.22 0.49 ± 0.01 0.51 ± 0.07 0.68 ± 0.11 0.54 ± 0.05 0.55 ± 0.08 0.52 ± 0.06
Lunit-SwAV 0.74 ± 0.01 0.77 ± 0.01 0.75 ± 0.04 0.56 ± 0.04 0.84 ± 0.06 0.82 ± 0.02 0.58 ± 0.05 0.66 ± 0.05 0.61 ± 0.05
Transformer Swin 0.73 ± 0.01 0.67 ± 0.06 0.60 ± 0.03 0.62 ± 0.03 0.80 ± 0.10 0.76 ± 0.03 0.60 ± 0.08 0.69 ± 0.03 0.60 ± 0.03
CTransPath 0.79 ± 0.04 0.74 ± 0.01 0.78 ± 0.03 0.61 ± 0.03 0.89 ± 0.04 0.83 ± 0.04 0.58 ± 0.03 0.69 ± 0.08 0.58 ± 0.07
ViT-B 0.76 ± 0.02 0.70 ± 0.03 0.66 ± 0.03 0.56 ± 0.02 0.75 ± 0.14 0.74 ± 0.01 0.50 ± 0.06 0.53 ± 0.08 0.68 ± 0.04
Phikon 0.69 ± 0.03 0.71 ± 0.03 0.75 ± 0.03 0.54 ± 0.03 0.85 ± 0.06 0.84 ± 0.05 0.63 ± 0.04 0.75 ± 0.04 0.56 ± 0.05
ViT-S 0.75 ± 0.02 0.74 ± 0.02 0.66 ± 0.05 0.56 ± 0.03 0.74 ± 0.05 0.72 ± 0.05 0.60 ± 0.05 0.56 ± 0.03 0.69 ± 0.01
Lunit-DINO 0.77 ± 0.02 0.75 ± 0.02 0.79 ± 0.01 0.61 ± 0.03 0.87 ± 0.07 0.88 ± 0.02 0.58 ± 0.05 0.71 ± 0.04 0.69 ± 0.04
ResNet-50 0.71 ± 0.03 0.69 ± 0.04 0.71 ± 0.03 0.64 ± 0.02 0.75 ± 0.08 0.71 ± 0.02 0.63 ± 0.03 0.59 ± 0.05 0.44 ± 0.07
RetCCL 0.78 ± 0.02 0.75 ± 0.06 0.79 ± 0.01 0.56 ± 0.04 0.85 ± 0.08 0.77 ± 0.08 0.54 ± 0.05 0.53 ± 0.09 0.58 ± 0.08
Lunit-BT 0.77 ± 0.03 0.78 ± 0.04 0.76 ± 0.03 0.55 ± 0.03 0.84 ± 0.09 0.82 ± 0.06 0.64 ± 0.03 0.63 ± 0.03 0.65 ± 0.02
Lunit-SwAV 0.72 ± 0.02 0.78 ± 0.02 0.77 ± 0.03 0.55 ± 0.03 0.86 ± 0.07 0.79 ± 0.05 0.57 ± 0.05 0.70 ± 0.03 0.55 ± 0.07
Mean pool Swin 0.77 ± 0.01 0.68 ± 0.04 0.62 ± 0.03 0.60 ± 0.02 0.66 ± 0.12 0.75 ± 0.02 0.65 ± 0.04 0.61 ± 0.05 0.58 ± 0.04
CTransPath 0.83 ± 0.00 0.75 ± 0.02 0.73 ± 0.01 0.64 ± 0.01 0.70 ± 0.12 0.86 ± 0.03 0.61 ± 0.03 0.75 ± 0.02 0.61 ± 0.02
ViT-B 0.76 ± 0.01 0.70 ± 0.01 0.68 ± 0.01 0.56 ± 0.01 0.68 ± 0.08 0.72 ± 0.02 0.58 ± 0.05 0.59 ± 0.01 0.69 ± 0.01
Phikon 0.71 ± 0.01 0.76 ± 0.03 0.65 ± 0.03 0.56 ± 0.04 0.73 ± 0.12 0.88 ± 0.02 0.57 ± 0.05 0.70 ± 0.07 0.59 ± 0.02
ViT-S 0.72 ± 0.02 0.75 ± 0.02 0.62 ± 0.01 0.57 ± 0.00 0.69 ± 0.11 0.69 ± 0.03 0.65 ± 0.04 0.56 ± 0.03 0.65 ± 0.02
Lunit-DINO 0.74 ± 0.01 0.76 ± 0.02 0.77 ± 0.03 0.59 ± 0.03 0.77 ± 0.12 0.88 ± 0.03 0.59 ± 0.02 0.79 ± 0.01 0.70 ± 0.03
ResNet-50 0.75 ± 0.01 0.77 ± 0.01 0.67 ± 0.02 0.61 ± 0.01 0.62 ± 0.10 0.67 ± 0.03 0.68 ± 0.01 0.52 ± 0.05 0.55 ± 0.06
RetCCL 0.82 ± 0.00 0.75 ± 0.01 0.71 ± 0.01 0.57 ± 0.01 0.71 ± 0.12 0.79 ± 0.05 0.61 ± 0.07 0.63 ± 0.03 0.65 ± 0.00
Lunit-BT 0.74 ± 0.04 0.74 ± 0.00 0.68 ± 0.06 0.55 ± 0.02 0.57 ± 0.09 0.77 ± 0.07 0.66 ± 0.01 0.60 ± 0.01 0.61 ± 0.16
Lunit-SwAV 0.76 ± 0.00 0.76 ± 0.02 0.75 ± 0.01 0.54 ± 0.02 0.70 ± 0.15 0.85 ± 0.01 0.55 ± 0.04 0.68 ± 0.05 0.58 ± 0.05
Table 13. Test AUROC scores (averaged across the five seeds) for all tasks, feature extractors, and downstream models, when employing
all augmentations.
11