A Good Feature Extractor Is All You Need For Weakly Supervised Learning in Histopathology

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

A Good Feature Extractor Is All You Need

for Weakly Supervised Learning in Histopathology

Georg Wölflein1,2,∗ Dyke Ferber2,3 Asier R. Meneghetti2 Omar S. M. El Nahhas2 Daniel Truhn4
Zunamys I. Carrero2 David J. Harrison1,5 Ognjen Arandjelović1 Jakob N. Kather2,3,6
1
University of St Andrews 2
EKFZ for Digital Health, TU Dresden 3 University of Heidelberg
4
University Hospital Aachen 5
Lothian NHS University Hospitals 6 University Hospital Dresden
arXiv:2311.11772v2 [cs.CV] 22 Nov 2023

Effect of augmentation on downstream performance Relative performance comparison Feature extractor

AUROC deterioration vs. best


Swin
Change in test AUROC

0.4 (lower is better) 0.25 CTransPath


0.2 0.20 ViT-B
Phikon
0.15
0.0 ViT-S
0.10 Lunit-DINO
−0.2 ResNet-50
0.05
RetCCL
−0.4 0.00 Lunit-BT
(a) stain normalisation (b) rotate/flip (c) all augmentations (d) no augmentation Lunit-SwAV

Figure 1. Stain normalisation and image augmentations do not impact downstream performance. We empirically evaluate ten feature
extractors across nine weakly supervised pathology tasks, observing no benefit in employing stain normalisation (a) or augmentations (b,
c) before feature extraction. The best models (d), Lunit-DINO and CTransPath, are particularly robust, unlike ImageNet baselines (bold).

Abstract netic alterations and survival directly from routinely avail-


Deep learning is revolutionising pathology, offering able histopathology whole slide images (WSIs) [23, 30, 32,
novel opportunities in disease prognosis and personalised 47, 53, 56, 61, 88, 96]. Due to their immense size, reach-
treatment. Historically, stain normalisation has been ing billions of pixels at 20× magnification, these images
a crucial preprocessing step in computational pathology are first divided into small, non-overlapping patches. What
pipelines, and persists into the deep learning era. Yet, follows can be broken down into two stages: (i) feature ex-
with the emergence of feature extractors trained using self- traction, where a feature vector is obtained separately for
supervised learning (SSL) on diverse pathology datasets, each patch and (ii) feature aggregation, where the extracted
we call this practice into question. In an empirical eval- feature vectors are combined to form the slide-level pre-
uation of publicly available feature extractors, we find that diction [7, 73]. Both steps are parametrised using neural
omitting stain normalisation and image augmentations does networks; usually, the feature extractor is a deep backbone
not compromise downstream performance, while incurring architecture whose parameters are frozen1 , while the aggre-
substantial savings in memory and compute. Further, we gator is shallower, but trainable. In the past, convolutional
show that the top-performing feature extractors are remark- neural networks (CNNs) such as ResNet-50 [37] pretrained
ably robust to variations in stain and augmentations like on ImageNet [26] were used to perform feature extraction.
rotation in their latent space. Contrary to previous patch- Recent advances in SSL have made it possible to train
level benchmarking studies, our approach emphasises clin- powerful feature extractors without labels, a development
ical relevance by focusing on slide-level prediction tasks that is gaining traction in the field of computational pathol-
in a weakly supervised setting with external validation co- ogy, where large quantities of images are available but an-
horts. This work represents the most comprehensive robust- notated data is sparse. As such, the last few years have
ness evaluation of public pathology SSL feature extractors witnessed the emergence of several SSL models trained on
to date, involving more than 6,000 training runs across nine large-scale pathology datasets [2, 15, 16, 31, 44, 57, 86, 90–
tasks, five datasets, three downstream architectures, and 92]. These models produce better representations for
various preprocessing setups. Our findings stand to stream- downstream tasks than their ImageNet-pretrained counter-
line digital pathology workflows by minimising preprocess- parts [8, 14, 22, 25, 44, 74], and are quickly establish-
ing needs and informing the selection of feature extractors. ing themselves as the leading choice for feature extrac-
tion [30, 36, 61, 67, 88, 94, 95].
1. Introduction * georgwoelflein.de
There has been a recent surge in studies using deep learn- 1 Employing a frozen feature extractor has significant computational
ing in oncology to predict clinical variables such as ge- benefits because the feature vectors can be pre-computed before training.

1
In computational pathology, stain normalisation has tra- preprocess train
ditionally been a standard preprocessing step which was split into patches
deterministic
augment
(non)-deterministic
extract
frozen
aggregate classify
trainable
crucial in order to account for variations in scanners and
haematoxylin and eosin (H&E) stains by adjusting WSIs to

.
match a reference image [58, 63, 82]. Yet, with the shift

feature aggregator

.
from ImageNet CNNs to SSL models trained on vast and
varied data from multiple centres, it is worth reconsider-

MLP
ing its need. Beyond stain normalisation, image augmen-
tations are a broad category of image-to-image transforma-
tions that may be applied during training, such as random
flips, rotations, and colour transformations. Some augmen- Figure 2. Common setup for weakly supervised learning on WSIs.
tations, like rotation, are particularly well-suited for pathol- In the preprocessing stage, the input image is split into patches
ogy due to the rotational invariance of micrographs [68]. that undergo independent image augmentations a before feature
SSL feature extractors that have been trained on a wide vari- extraction. The feature aggregator and classifier are trained jointly
ety of images from multiple international sites might there- as a single neural network gθ ◦ hθ , which, given the feature vectors
as inputs, predicts the output y. For stain normalisation (shown
fore extract diagnostically/prognostically relevant features
here), the same a(·) is applied every time, though in general, the
irrespective of site- or scanner-specific traits. This leads to
augmentation function may vary between patches and epochs.
our primary research question: with SSL feature extractors
trained on rich datasets, is there still a need for image aug- bag of patches representing a WSI.
mentations and stain normalisation to improve the general- It is computationally infeasible to parametrise M using
isability of weakly supervised learning models? Our study a single deep neural network trained end-to-end. Instead,
approaches this question in two ways: the common approach in the literature is a two-step process
1. We assess the latent space similarity between original consisting of preprocessing (feature extraction) and training
patches and their stain-normalised/augmented counter- (aggregation and classification), outlined in Fig. 2. The pre-
parts in Sec. 3. Our analysis reveals that many augmen- processing stage often entails stain normalisation, and may
tations induce only minor perturbations in the extracted optionally include image augmentations as well.
features, especially compared to ImageNet backbones. We first consider the simple case with a predetermined
2. In the most comprehensive robustness evaluation of pub- augmentation function a : X → X that is applied inde-
licly available pathology SSL feature extractors to date, pendently to each patch xi to obtain the augmented patches
we compare over 6,000 trained models, both with and x̂i = a(xi ) for i = 1, 2, . . . , n. Then, we apply the feature
without normalisation/augmentation, across multiple ex- extractor f : X → Rdx , which for each patch x̂i outputs a
ternally validated slide-level tasks to determine whether dx -dimensional feature vector zi = f (x̂i ). Now, we have
the increased preprocessing effort holds merit in terms n feature vectors, z1 , z2 , . . . , zn , which are aggregated into
of downstream performance (Fig. 1 and Sec. 4). a single vector z̄ ∈ Rdz (usually dx = dz ) via an aggrega-
Our analysis has implications for computational pathology tion function gθ : Rn×dx → Rdz with learnable parameters
practitioners and researchers alike, given the overhead in- θ. Finally, the aggregated feature vector z̄ passes through
curred by employing image augmentations and stain nor- a classifier hθ : Rdz → Y, to obtain the final prediction.
malisation in feature extraction pipelines. In summary, we can express the process M : X n → Y of
obtaining a prediction y from a bag of patches {xi }ni=1 as
1.1. Problem formulation preprocessing

In a WSI classification task, we have a dataset of labelled


z }| {
M ({xi }ni=1 ) = (hθ ◦ gθ ) ({(f ◦ a)(xi )}ni=1 ), (1)
WSIs. Each WSI X ∈ RW ×H×3 is a RGB image of width | {z }
training
W and height H, though dimensions vary between slides.
where ◦ denotes function composition. Notice that f ◦ a is
It is associated with a ground truth label y ∈ Y = Rc for
independent of the learnable parameters θ and thus can be
a c-way classification problem. However, due to their large
pre-computed for all patches xi before training.
size, we usually consider each WSI as a bag of patches,
In the general case, we define a set of augmentation func-
framing the WSI classification problem as a weakly super-
tions A ∈ X X before training (X X is the set of functions
vised learning task. More specifically, we split each WSI
from X to X ). During training, for every patch xi , we uni-
X into a set of n non-overlapping patches {x1 , x2 , . . . , xn }
formly sample2 an augmentation ai ∼ A. Then, the aug-
where each xi ∈ X = RP ×P ×3 for a fixed patch size P .
mented feature vector is x̂i = ai (xi ), so Eq. (1) becomes
Here, n varies depending on the particular slide’s dimen-
sions and usually lies between 1,000 and 10,000 for slides M ({xi }ni=1 ) = (hθ ◦ gθ ) ({(f ◦ ai )(xi )}ni=1 ) . (2)
at 10× magnification with patch size P = 224. The task is
to find a model M : X n → Y that predicts the label given a 2 The augmentation is resampled for every patch at every epoch.

2
While just a small modification in terms of notation, this that were pretrained on large-scale multi-centre pathology
change incurs a significant increase in time and memory datasets such as The Cancer Genome Atlas (TCGA) [93].
complexity of the preprocessing task by a factor of |A|, Wang et al. [90, 91] proposed CTransPath, a Swin Trans-
since augmentation and feature extraction must be per- former [52] feature extractor trained using semantically-
formed for all possible augmentations ai ∈ A for every relevant contrastive learning (SRCL), a novel SSL tech-
patch3 . As a result of this overhead, practitioners must care- nique based on MoCo v3 [20] that is specifically tailored
fully choose which augmentations to apply, if any. We ad- to pathology. Previously, they had put forth RetCCL [92],
dress this problem by assessing the performance benefit ob- a ResNet-50 model trained using a SSL technique they
tained by different augmentations on our benchmark tasks. termed clustering-guided contrastive learning (CCL) based
on MoCo [38]. Owkin [31] evaluated different ViT vari-
2. Related work ants [49] using the iBOT framework [100], and later termed
their best ViT-B variant “Phikon”. Lunit [44] benchmarked
Weakly supervised WSI classification Early work on
various SSL techniques including Barlow Twins [98],
WSI classification with slide-level labels employed CNNs
SwAV [11], MoCo v2 [19], and DINO [12] for pathology
such as ResNet [37] which were pretrained on Ima-
by training them on TCGA. All of the aforementioned mod-
geNet [26] and then fine-tuned on the classification task
els are available publicly, and – with one exception4 – form
using slide-level labels as patch-level supervision [23, 47].
the basis of our study. We refer the reader to Appendix B
Recognising that this approach introduces excessive noise
for a detailed overview of the evaluated feature extractors.
in the patch-level supervision to the detriment of the train-
This year, a number of pathology foundation models
ing process, later work [42, 89] reframed this task as an
have emerged [2, 8, 16, 57, 86] that were trained on consid-
embedding-based multiple instance learning (MIL) prob-
erably larger datasets. Unfortunately, we could not include
lem [29]. In this line of work, a feature vector is extracted
these in our study since their weights remain proprietary, yet
for every patch using a CNN (f in Fig. 2), and these feature
provide a more detailed account of these in Appendix B.1.
vectors are aggregated and classified via a learnable pooling
function and classifier (hθ ◦ gθ in Fig. 2). Initially, the entire Stain normalisation Different medical sites employ dif-
network, including feature extraction, was trained end-to- ferent microscopes, scanners, protocols, and dyes, result-
end [42]. However, end-to-end training becomes intractable ing in variations in the appearance of WSIs. For over
as MIL approaches scale to larger datasets, so more recent 20 years [58, 63, 66], stain normalisation has been com-
models operate on frozen features extracted using ImageNet monplace in digital pathology workflows to account for
pretrained models [56]. The frozen feature approach is now these factors by adjusting colours to match a reference
widely adopted for weakly supervised learning on WSIs, al- image. Classical techniques [58, 63, 82] achieve this by
beit with better feature extractors trained using SSL. performing colour deconvolution, standardising stain in-
SSL in pathology The goal of SSL is to learn useful tensity, and then transforming the colour space of the in-
representations for downstream tasks from unlabelled data. put images to that of a reference image. More recently,
Unlike supervised learning, SSL leverages structures inher- GAN-based approaches have been proposed to this end as
ent in data through pretext tasks, without needing explicit well [60, 87, 97]. Boschman et al. [5] compared eight clas-
labels. The development of SSL models is an active area of sical and GAN-based stain normalisation techniques, con-
research, from which a variety of algorithms like contrastive cluding that stain normalisation, especially the methods of
learning [17, 20, 38], non-contrastive learning [35, 98] and Vahadane et al. [82] and Macenko et al. [58], can indeed
clustering-based methods [10, 11] have emerged in recent bolster slide-level classification performance when validat-
years, each with unique advantages and challenges. These ing on external datasets. However, their approach aggre-
models have quickly found adoption in the pathology field, gated patch-level predictions through a simplistic majority
which is well-situated to benefit from SSL due to the avail- vote and did not integrate SSL feature extractors. In con-
ability of large datasets that lack patch-level labels. Indeed, trast, we contend that with SSL feature extractors, stain
SSL feature extractors pretrained on pathology data have normalisation becomes obsolete. To show this, we focus
been shown to outperform ImageNet pretrained models on our analysis on Macenko normalisation [58], the technique
downstream pathology tasks [8, 14, 44, 71, 74]. It is also most widely adopted in the literature [21, 30, 32, 70].
not surprising that obtaining more diverse data (e.g. from Image augmentations As a common regularisation tech-
different centres and countries) further improves generalis- nique for neural network training in general [24], image
ability [71]. In the last three years, a number of SSL mod- augmentationsand have unsurprisingly found widespread
els have been developed [2, 15, 16, 31, 44, 57, 86, 90–92] adoption in histopathology as well [68]. In this field, the
3 If the number of augmentations |A| is smaller than the number of most popular augmentations include flipping, scaling, ro-
training epochs, it is cheaper to pre-compute all augmentations before
training. Otherwise, it is better to sample the augmentations for every patch 4 To save computational resources, we excluded Lunit’s MoCo model

and epoch before training, and just pre-compute for those combinations. because both CTransPath and RetCCL already employ MoCo.

3
tating, and colour alterations due to the nature of pathol- Lunit-DINO ViT-S
ogy slides [68], though a recent line of research introduces TUM
STR
“stain augmentation” as a combination of stain normalisa- NORM
tion and image augmentations to increase data diversity as MUS
well [59, 72, 79]. In this work, we study 26 image augmen- MUC
LYM
tations, focusing our analysis on those popular in pathology.
DEB
ADI
Robustness of feature extractors in pathology Assess-
Figure 3. Latent space visualisations (t-SNE [83]) of features
ing the robustness and generalisation ability of deep learn-
extracted with Lunit-DINO (left) vs. ImageNet baseline (right).
ing models for pathology in the face of domain shift and out
Colours represent tissue classes [45]. Both feature extractors use
of distribution (OOD) data is an active area of research [33, the same architecture (ViT-S), but the left was trained on pathol-
43, 69, 99] and an important undertaking, considering the ogy images using SSL. Each dot represents a feature vector in la-
stakes may be human life. Our work builds upon Lunit’s tent space extracted from an unaltered image patch, and we draw
aforementioned SSL benchmarking initiative [44], which a line from that dot to the corresponding stain-normalised version.
involves training and evaluating four pathology-oriented
SSL feature extractors; we have integrated three of these embedding for a particular patch and its stain-normalised
into our study4 . Lunit’s evaluation, however, is confined to version (as we want the feature extractor to be invariant
patch classification and nuclei segmentation. While such to this factor), but yield very different embeddings for two
tile-based tasks are scientifically interesting and the pre- patches of different tissue classes (i.e. normal vs. tumour).
dominant means of evaluation in the literature [44, 77, 80], In this section, we study the effect of various augmen-
it has been suggested [8] that for evaluations to have greater tations on the latent space, beginning with stain normalisa-
clinical relevance, they should instead focus on slide-level tion. We employ the NCT-CRC-HE-100K dataset [45, 46],
tasks – predicting patient variables such as prognostic out- comprising 100,000 patches extracted from H&E images of
comes and biomarkers – and validate results on independent colorectal cancer (CRC) without stain normalisation. This
external cohorts. In response to this, we evaluate a total of dataset includes patch-level labels of the tissue type which
six SSL feature extractors across nine slide-level classifica- enables more fine-grained analysis and visualisation.
tion targets (whose clinical utility we detail in Appendix A), 3.1. Stain normalisation
and use external cohorts that were unseen during training
(including both SSL pretraining and downstream training). How similar are feature vectors extracted from image
Similar to our work, Tellez et al. [80] explore the in- patches to those derived from their stain-normalised coun-
fluence of stain normalisation and image augmentations terparts? We contend that simply looking at the aver-
on the generalisability of pathology models. However, age distance between original embeddings and their stain-
their 2019 study predates SSL models trained on expan- normalised versions does not provide enough information
sive pathology datasets akin to those employed in our eval- to make claims about the quality of a feature extractor. To
uation; their analysis is limited to CNNs trained from obtain a more nuanced view of how stain normalisation af-
scratch on narrow patch classification tasks. Springenberg fects embeddings, we present a dimensionality-reduced la-
et al. [77] empirically assess the robustness of CNNs and tent space visualisation of Lunit’s DINO feature extractor
ViTs in pathology with and without self-supervised pre- in Fig. 3. This feature extractor is highlighted due to its su-
training (CTransPath [91] and RetCCL [92]), but their eval- perior downstream performance (see Fig. 1d and analysis in
uation, again, is confined to patch classification. Sikaroudi Sec. 4.1). In our visualisation, each point corresponds to a
et al. [74] compare the OOD generalisability of pathol- feature vector, with a line connecting each original feature
ogy pretrained models (focusing on supervised and self- vector to its stain-normalised version. Notably, Lunit-DINO
supervised models trained on natural images as well as a clusters tissue types in latent space and the displacement of
non-SSL pathology-specific model [64], the latter achieving the feature vectors induced by stain normalisation is largely
the best results), but also only consider patch classification. confined to these clusters. In contrast, a baseline extractor
using the same ViT-S architecture [52] but trained via su-
3. Effect on latent space pervised learning on ImageNet, demonstrates less effective
clustering and exhibits a different pattern: some features
An ideal feature extractor for pathology extracts meaningful move hardly at all while others make large jumps between
features from a patch. More specifically, it should: clusters, as indicated by the longer inter-cluster lines in
1. be invariant to factors we deem unimportant, e.g. stain, Fig. 3, right. In fact, this pattern is consistent across various
orientation, etc.; and feature extractors: those pretrained on pathology data are
2. vary with properties we are interested in, e.g. tissue type, less prone to “jump” between tissue type clusters compared
cell type, and many other factors not known a priori. to their ImageNet-pretrained counterparts when undergoing
For example, a good feature extractor will produce a similar stain normalisation, further detailed in Appendix D.

4
original ↔ stain norm intra-class inter-class low contrast
Lunit-DINO
low saturation
1.00 low brightness ViT-S
Cosine distance

flip vertical
0.75 high saturation
flip horizontal
0.50
rotate 180°
high contrast
0.25
gamma 0.5
0.00 colour jitter
gamma 2.0
rotate 270°
in

L
th

AV
ko

C
-5

t-B
T-
T-

IN
Pa
Sw

rotate 90°

w
C
et
Vi
Vi

t-D
Ph

ni

t-S
et
ns

Lu
gaussian blur

R
es
a

ni

ni
Tr

Lu

Lu
R
C

AugMix
Figure 4. Boxplot of cosine distances between patch embeddings jigsaw
and their stain-normalised versions, as well as between embed- zoom 1.5x
sharpen
dings of randomly chosen patches of the same or of differing tissue high brightness
types. Feature extractors are grouped by architecture (ImageNet zoom 1.75x
Macenko
baselines are bold). Whiskers represent 95% of the distances. Cutout
median blur
zoom 2x
In Fig. 4, we compare the cosine distances of the em- affine
bedding displacement caused by stain normalisation across random rotation
warp perspective
all ten feature extractors. Despite the important difference
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
in terms of intra-cluster vs. inter-cluster jumps identified in
Cosine distance
the latent space visualisation above, Lunit-DINO and ViT-S
Figure 5. Boxplot of embedding displacement induced by image
exhibit similar averages (cf . their medians in Fig. 4, blue).
augmentations for Lunit-DINO and ViT-S. Dashed lines represent
This observation highlights the importance of examining the average distance between randomly selected patches (without
the distribution of distances, not merely their averages: the augmentation), indicating how ‘dispersed’ the latent spaces are.
boxplot in Fig. 4 reflects this difference by the increased
range of the whiskers of ViT-S compared to Lunit-DINO. that the choice of pretraining data influences the stability of
We note that an analysis that considers embedding dis- feature vectors in the context of stain normalisation. More
placement only from the perspective of stain normalisa- specifically, feature extractors that have seen diverse stains
tion is insufficient to make meaningful claims about fea- as part of their SSL pretraining can learn to become more
ture extractor utility. For example, an extractor that maps robust to variations in stain, while still preserving variations
all images to a single point in latent space would negate in aspects relevant to downstream tasks, i.e. tissue type.
any embedding displacement induced by stain normalisa-
3.2. Image augmentations
tion and prevent inter-cluster jumps, yet its features would
be wholly useless to the downstream model. This observa- In principle, the methodology presented above is suitable to
tion leads us to also consider the second key criterion out- study how any transformation of the input patches, not just
lined at the beginning of this section: the ability of feature stain normalisation,manifests itself in latent space. Here,
extractors to vary embeddings according to characteristics we consider 26 common image augmentations, for which
critical for downstream tasks. We select tissue type a sur- we provide representative examples in Appendix D. For
rogate marker to investigate this second criterion. However, Lunit-DINO and the ViT-S baseline, we compare the mag-
it is important to recognise as a limitation of this analysis nitudes of the embedding displacement across augmenta-
that there are numerous other potentially significant charac- tions in Fig. 5. We observe that Lunit-DINO’s embed-
teristics that remain unidentified at this stage, and for which dings are more robust to image augmentations compared
specific labels are unavailable. Nonetheless, we posit that to the ImageNet baseline: for all augmentations except
feature vectors from similar tissue types (indicated in blue ‘Cutout’ [27] and ‘warp perspective’, the cosine distances
in Fig. 4) should be closer in latent space compared to those tend to be smaller in Lunit-DINO. That is even though
from different tissue types (shown in green). Upon examin- Lunit-DINO’s embeddings are spread out more in latent
ing the disparity between these distance measures, we find space, i.e. the average distance between any two randomly
that the ImageNet baselines tend to lump all features more selected non-augmented patches is greater, indicated by the
closely together, regardless of tissue type. In contrast, the dashed lines in Fig. 5. When normalising the distances by
SSL models show better differentiation, as indicated by a this average, Cutout remains as the only (minor) exception.
greater separation between the blue and green boxes in the We observe that Lunit-DINO excels in terms of robust-
boxplot. Furthermore, the extent to which patches of differ- ness to right-angle rotations and flips – a much desired prop-
ent tissue types are distanced in the latent space (green) also erty considering that WSIs, unlike natural images, lack a
provides a useful scale for contextualising the original vs. canonical orientation. In fact, in selecting augmentations
stain-normalised distances (blue). These findings suggest for generating positive pairs for SSL pretraining of the Lunit

5
Lunit-DINO ViT-S 4. Impact on downstream performance
Motivated by the findings above – that some augmenta-
rotate 90°

tions have larger effects than others on the latent represen-


tations, we investigate in the remainder of this paper how
stain normalisation and augmentations affect downstream
performance. To do so, we train weakly supervised models
on nine downstream tasks using publicly available datasets.
random rotation

Models We compare three parametrisations of the down-


stream aggregation model gθ (·) in Eq. (2): (1) AttMIL [42],
the most common approach in the literature, (2) a two-layer
decoder-only transformer [85], which is gaining popularity
in recent works [88, 94], and (3) a simple baseline perform-
ingPmean average pooling across features (gθ ({xi }ni=1 ) =
n
i=1 xi where each xi is a feature vector). In our exper-
1
ablation

n
iments, we parametrise hθ (·) as a linear layer with softmax
activation over the number of classes for the particular task.
Tasks and datasets In selecting downstream tasks, we
Figure 6. Visualisations of latent space transformations caused prioritise those with clinical utility and whose underly-
by rotation-based augmentations (rows) in Lunit-DINO (left) and ing variables are also available in adequately sized public
ViT-S (right). Colours and lines are as explained in Fig. 3. Top datasets. Training on TCGA-BRCA [93] and testing on
row: 90° rotation. Middle: rotating by a random angle. Bot-
CPTAC-BRCA [50], we predict four breast cancer ( ) tar-
tom (ablation): each line represents the transformation from (a)
gets: subtype as well as the CDH1, TP53, and PIK3CA ge-
the embedding of the 1.5× zoomed version of a patch to (b) the
embedding obtained by randomly rotating before the 1.5× zoom. netic mutations. Furthermore, we predict -lymph node sta-
tus in the CAMELYON17 breast cancer dataset [4] (which
contains data from five centres – we used one of the centres
feature extractors, Kang et al. [44] employed the aforemen- for testing and the others for training). Finally, we predict
tioned augmentations for this precise reason, incentivising four markers in colorectal cancer ( ): MSI status as well as
rotated/flipped embeddings to be close in latent space. On BRAF, KRAS, and SMAD4 genetic mutations (training on
the other hand, the ImageNet baseline is significantly less TCGA-CRC [93] and testing on CPTAC-COAD [84]). We
robust to such augmentations. Interestingly, it is more ro- elaborate on these variables, their clinical relevancy, and the
bust to horizontal flips than vertical flips, which may be ex- underlying datasets in Appendix A.
plained by the fact that it was trained on natural images. The aforementioned choice of tasks and datasets uses ex-
Although Lunit-DINO is remarkably robust to rotating ternal cohorts for testing, so that we can assess generalis-
by angles that are multiples of 90°, non-right angles cause ability to unseen datasets. We were also diligent in ensur-
the greatest displacement in latent space aside from per- ing no data leakage occurred between the SSL pretraining
spective warp (penultimate row in Fig. 5). To investigate and downstream test datasets. Notably, given that all eval-
this further, we visualise the latent space in Fig. 6. As uated pathology feature extractors included TCGA in their
expected, for 90° rotation (top row), Lunit-DINO’s latent pretraining, we deliberately chose other datasets for testing.
space remains largely unchanged, as opposed to ViT-S. Training details We train each model using the AdamW
However, for random angle rotations (middle row), we ob- optimiser [55] with an initial learning rate of 0.001 which is
serve a high degree of chaotic jumps in both feature extrac- decayed using cosine annealing [54] for up to 30 epochs,
tors, indicating neither is robust to this augmentation. We though training typically ends sooner due to our use of
hypothesise that this is caused by the loss of pixels at the early stopping (when the validation loss fails to improve
edges of the patches in off-angle rotations, and design an for ten consecutive epochs). For this, we allocate 80%
ablation study to investigate this phenomenon. To eliminate of the training set for model training and 20% for val-
the black pixel problem, we perform a centercrop on the idation. We conduct training with five distinct random
original and augmented patches in a manner that ensures seeds for the cartesian product of the ten feature extractors,
there are no black pixels in any rotation. The corresponding nine tasks, three downstream models, and six preprocess-
latent space visualisations at the bottom of Fig. 6 confirm ing/augmentation setups (slidewise or patchwise stain nor-
our assumption: Lunit-DINO’s latent space remains un- malisation, rotate/flip, all augmentations, or none), result-
changed whereas ViT-S’s embeddings move significantly. ing in over 6,000 trained models. The training and valida-
Similar reasoning may explain the poor robustness regard- tion splits are kept fixed per-task across the seeds for all
ing ‘random affine’, ‘warp perspective’, and ‘Cutout’ [27]. tasks except for lymph node status classification. This latter

6
Feature extractor -subtype -CDH1 -TP53 -PIK3CA -LN status -MSI -KRAS -BRAF -SMAD4 Average
Swin [52] 0.07 ± 0.02 0.17 ± 0.03 0.28 ± 0.02 0.07 ± 0.04 0.17 ± 0.08 0.18 ± 0.04 0.14 ± 0.04 0.14 ± 0.07 0.16 ± 0.05 0.15 ± 0.05
CTransPath [91] 0.00 ± 0.00 0.01 ± 0.01 0.01 ± 0.01 0.04 ± 0.03 0.06 ± 0.07 0.08 ± 0.03 0.06 ± 0.03 0.06 ± 0.03 0.06 ± 0.03 0.04 ± 0.03
ViT-B [49] 0.08 ± 0.04 0.11 ± 0.02 0.15 ± 0.03 0.07 ± 0.03 0.17 ± 0.06 0.15 ± 0.03 0.03 ± 0.04 0.18 ± 0.07 0.01 ± 0.01 0.11 ± 0.04
Phikon [31] 0.09 ± 0.02 0.09 ± 0.02 0.09 ± 0.03 0.09 ± 0.03 0.07 ± 0.06 0.06 ± 0.04 0.07 ± 0.04 0.07 ± 0.06 0.17 ± 0.08 0.09 ± 0.05
ViT-S [49] 0.13 ± 0.03 0.08 ± 0.03 0.14 ± 0.05 0.08 ± 0.05 0.19 ± 0.09 0.18 ± 0.04 0.06 ± 0.03 0.19 ± 0.04 0.08 ± 0.08 0.13 ± 0.05
Lunit-DINO [44] 0.08 ± 0.03 0.03 ± 0.03 0.03 ± 0.02 0.02 ± 0.03 0.07 ± 0.04 0.00 ± 0.00 0.06 ± 0.04 0.02 ± 0.02 0.02 ± 0.02 0.04 ± 0.03
ResNet-50 [37] 0.15 ± 0.03 0.09 ± 0.04 0.11 ± 0.03 0.01 ± 0.02 0.18 ± 0.08 0.22 ± 0.04 0.11 ± 0.03 0.23 ± 0.07 0.21 ± 0.09 0.15 ± 0.05
RetCCL [92] 0.07 ± 0.03 0.04 ± 0.02 0.04 ± 0.03 0.05 ± 0.03 0.07 ± 0.06 0.08 ± 0.03 0.03 ± 0.02 0.14 ± 0.03 0.06 ± 0.03 0.06 ± 0.03
Lunit-BT [44] 0.13 ± 0.03 0.06 ± 0.04 0.02 ± 0.01 0.13 ± 0.04 0.34 ± 0.15 0.28 ± 0.13 0.03 ± 0.04 0.35 ± 0.13 0.25 ± 0.03 0.18 ± 0.08
Lunit-SwAV [44] 0.06 ± 0.02 0.06 ± 0.03 0.06 ± 0.02 0.13 ± 0.06 0.07 ± 0.05 0.10 ± 0.03 0.13 ± 0.06 0.07 ± 0.07 0.14 ± 0.08 0.09 ± 0.05
Table 1. Comparative evaluation of feature extractors. This table presents the normalised differential AUROC scores (lower is better) for
all feature extractors, across the evaluated targets using the AttMIL [42] aggregation model. The scores reflect the expected decrease in
test AUROC when selecting a given feature extractor relative to the best-performing one for each task-model combination (see Sec. 4.1).

task uses the CAMELYON17 dataset, allowing us to per- gregation model and the type of input augmentation, as we
form leave-one-hospital-out cross-validation with a differ- show in the extended data table in Appendix E. Moreover,
ent random seed for each of the five hospitals. For the exper- we find the ImageNet baselines perform worse than the
iments involving augmentations, we apply these augmenta- pathology models (with the exception of Lunit-BT which
tions only on the images of the training datasets, never the performs very poorly indeed), which is in line with many
test datasets (except for the stain normalisation experiments, previous works [8, 14, 22, 25, 44, 74].
where we ensure the same normalisation is applied to train-
ing and test datasets). We perform feature extraction once 4.2. Stain normalisation does not impact down-
before training, caching for every patch in every dataset its stream performance
original feature vector as well as the feature vectors of all 27 We quantify the effect of stain normalisation on down-
augmented versions of that patch, including stain normali- stream model performance by determining the expected
sation. More details are provided in Appendix F.2. difference in test AUROC between models trained with
stain normalisation vs. without. Given a feature extractor
4.1. Lunit-DINO and CTransPath extract the most and downstream aggregation model, e.g. Lunit-DINO with
useful features AttMIL, we must compare 45 models trained with stain nor-
malisation (nine tasks times five random seeds) with another
Having trained a large number of downstream models based 45 models trained without stain normalisation. To estimate
on ten feature extractors across a diverse set of tasks, we are the difference in AUROC, we perform bootstrapping. For
in a position to identify the most effective feature extractor each of the 45 task-seed pairs, we generate 25 random re-
overall. We present these findings first and focus our later samples of the respective test datasets with replacement, to-
discussion on these feature extractors in particular. talling 45 × 25 = 1,125 bootstraps. Since each bootstrap is
First, let us consider how to determine the best feature associated with a particular task-seed combination, it corre-
extractor for a given task and downstream aggregator (such sponds to two trained models: one trained with stain nor-
as predicting -CDH1 with AttMIL aggregation). For any malisation and one without. We deploy both models on
such task-model pair, we trained 50 models – spanning the the given bootstrap, computing the difference in AUROC.
ten feature extractors across five random seeds. We define Repeating for all bootstraps, we obtain a distribution of
a ‘trial’ as one particular configuration where each feature 1,125 AUROC differences which we present as a boxplot
extractor is paired with a random seed, leading to 510 (≈ in Fig. 1a, with a separate box for every feature extractor
10 million) unique trials. Within each trial, we evaluate the (we focus on the AttMIL [42] aggregation model because it
feature extractors based on the difference between their test is the most widely used, but provide analogous plots for the
area under the receiver operating characteristic curve (AU- other two in Appendix E). We find no clear AUROC differ-
ROC) and the highest test AUROC observed, thus assign- ence between the two groups, for any feature extractor: all
ing a score of zero to the top performer. By calculating 95% confidence intervals (and interquartile ranges) include
the mean of these scores across all 510 trials, we derive the zero. Surprisingly, this holds even for ImageNet extractors,
‘normalised differential AUROC score’ – a measure that whose latent spaces we previously identified more suscep-
captures the relative efficacy of the feature extractors and tible to larger displacements due to stain normalisation.
allows fair comparisons across tasks of varying difficulty.
The outcomes of this analysis, when considering the down- Slidewise versus patchwise normalisation While in
stream AttMIL model and no augmentations, are detailed Fig. 1a, we perform stain normalisation on a per-slide basis,
for all tasks individually in Tab. 1 and averaged across tasks a more computationally efficient5 alternative is normalising
in Fig. 1d. Notably, Lunit-DINO and CTransPath are tied in 5 Macenko normalisation [58] requires an eigenvalue decomposition
achieving the best task-averaged performance. Indeed, they across all pixels in the image, scaling cubically in the number of pixels. For
consistently perform best, regardless of the downstream ag- a slide with n patches of k pixels each, the complexity is in O(n3 k3 ) for

7
employ the technique from Sec. 4.1, but instead of keep-
AUROC deterioration vs. best

0.12
AttMIL Transformer Mean pool
ing fixed the downstream model and determining the best
0.09
feature extractor, we choose a feature extractor and vary
0.06 the downstream aggregation model. As shown in Fig. 7,
AttMIL performs best, closely followed by the two-layer
0.03 transformer, and finally mean average pooling, but we note
the differences are small and exhibit high variance.
0.00

Lunit-DINO CTransPath 5. Discussion


Figure 7. Performance impact of choosing a particular down-
stream aggregation model (lower is better).
We dedicate this section to answer key questions that may
arise among computational pathology researchers about
each patch individually. However, in this approach, adja- employing SSL feature extractors for slide-level prediction.
cent patches might experience different colour transforma- What is the best feature extractor for pathology? We
tions, potentially affecting consistency across the slide. We recommend Lunit-DINO and CTransPath for feature extrac-
perform an ablation study, detailed in Appendix C, where tion, since they consistently achieve the best task-averaged
we employ the bootstrapping procedure from above with downstream performance (Fig. 1d), independent of the em-
patchwise instead of slidewise normalisation, but find no ployed augmentations and downstream aggregation model.
consistent performance differences between both methods. In general, we find pathology-specific extractors to outper-
Therefore, we recommend the patchwise approach for prac- form their ImageNet baselines, adding to the body of evi-
titioners still seeking to employ stain normalisation in their dence suggesting that SSL models pretrained on pathology
preprocessing pipelines, due to its computational benefit. data produce more useful features [8, 14, 44, 71, 74].
Which aggregation model should we use? For Lunit-
4.3. Augmentations do not impact performance
DINO and CTransPath, we notice slight benefits in employ-
Having emphasised rotation and reflection as augmenta- ing the AttMIL aggregation model downstream, though the
tions of particular relevance to pathology in our investiga- differences are small and exhibit high variance (Fig. 7). The
tion of the latent space, we now study their downstream im- primary factor remains the choice feature extractor.
pact on performance. To this end, we trained a batch of Should we perform stain normalisation and augmenta-
models where at each epoch, each patch is randomly flipped tions? Our data does not support the necessity of either –
(horizontally or vertically) or rotated by a right angle before they do not markedly improve outcomes, yet introduce sig-
feature extraction. Analogous to our analysis of stain nor- nificant preprocessing overhead. For those still interested in
malisation, we perform a bootstrapped quantification of the stain normalisation, we recommend the patchwise approach
difference in performance incurred by employing the aug- due to its lower computational cost. Image augmentations,
mented versus non-augmented features in Fig. 1b. Again, on the other hand, should be avoided because in addition to
we observe no consistent benefit in employing this type of the preprocessing cost, they considerably increase training
augmentation. Interestingly, we find the variance in the dif- time (Appendix F.2), and the top extractors resist pathology-
ferences to be even smaller in Lunit-DINO and CTransPath relevant augmentations in their latent space (Sec. 3.2).
compared to stain normalisation in Fig. 1a. Furthermore,
expanding the set of augmentations to all 27 studied trans- 6. Conclusion and future work
formations (with each being equally likely to be selected for
every patch at every epoch) yields similar results (Fig. 1c). In this work, we perform the most comprehensive robust-
While Fig. 1 employs AttMIL [42], we come to the same ness evaluation of publicly available pathology feature ex-
conclusion for the other downstream aggregation architec- tractors for slide-level prediction to date, spanning ten fea-
tures, for which we present extended results in Appendix E. ture extractors, three aggregation models, stain normalisa-
tion, and numerous image augmentations, on nine down-
4.4. Downstream aggregation models stream weakly supervised learning tasks with external vali-
dation cohorts. Among these factors, we identify the choice
In Sec. 4.1, we identified Lunit-DINO and CTransPath as
of feature extractor as most consequential for downstream
the best feature extractors in terms of achieving the low-
performance, and observe no benefit in employing stain nor-
est normalised differential AUROC scores averaged across
malisation or image augmentations.
all tasks, and found this to be the case for all three down-
Our latent space analysis reveals a remarkable robust-
stream models. Yet, it remains to be seen which downstream
ness to stain variations and image augmentations in the
model achieves the best results. To answer this question, we
top-performing feature extractors, Lunit-DINO [44] and
slidewise normalisation, but O(nk3 ) for patchwise normalisation. More- CTransPath [91], which employ domain-specific knowl-
over, the latter is embarrassingly parallel across the n dimension. edge in their SSL training regimes. This underlines a key

8
direction for future research into the development of pathol- eosin-stained pathology images. J. Pathol., 256(1):15–24,
ogy feature extractors and foundation models [2, 8, 16, 57, 2022. 3
86]: the importance of not only scaling size and diversity of [6] Bruno Buecher, Wulfran Cacheux, Etienne Rouleau, Bar-
pretraining datasets, but also tailoring SSL methods to the bara Dieumegard, Emmanuel Mitry, and Astrid Lièvre.
pathology domain, in order to effectively leverage this data. Role of microsatellite instability in the management of col-
Looking ahead, we aim to investigate whether the inef- orectal cancers. Dig. Liver Dis., 45(6):441–449, 2013. 1
fectiveness of augmentations persists in limited-data scenar- [7] Gabriele Campanella, Matthew G Hanna, Luke Genes-
ios, and how slide magnification impacts extracted features. law, Allen Miraflor, Vitor Werneck Krauss Silva, Klaus J
Busam, Edi Brogi, Victor E Reuter, David S Klimstra, and
Acknowledgements: GW is supported by Lothian NHS. This Thomas J Fuchs. Clinical-grade computational pathology
project received funding from the European Union’s Horizon using weakly supervised deep learning on whole slide im-
2020 research and innovation programme under Grant Agreement ages. Nat. Med., 25(8):1301–1309, 2019. 1
No. 101017453 as part of the KATY project. This work is sup-
[8] Gabriele Campanella, Ricky Kwan, Eugene Fluder, Jen-
ported in part by the Industrial Centre for AI Research in Digital
nifer Zeng, Aryeh Stock, Brandon Veremis, Alexan-
Diagnostics (iCAIRD) which is funded by Innovate UK on behalf
dros D Polydorides, Cyrus Hedvat, Adam Schoenfeld,
of UK Research and Innovation (UKRI) (project number 104690).
Chad Vanderbilt, Patricia Kovatch, Carlos Cordon-Cardo,
and Thomas J Fuchs. Computational pathology at health
system scale – Self-Supervised foundation models from
References three billion images. 2023. 1, 3, 4, 7, 8, 9, 2
[9] F Cardoso, S Kyriakides, S Ohno, F Penault-Llorca, P
[1] Fabrice André, Eva Ciruelos, Gabor Rubovszky, Mario Poortmans, I T Rubio, S Zackrisson, E Senkus, and ESMO
Campone, Sibylle Loibl, Hope S Rugo, Hiroji Iwata, Pier- Guidelines Committee. Electronic address: clinicalguide-
franco Conte, Ingrid A Mayer, Bella Kaufman, Toshinari [email protected]. Early breast cancer: ESMO clinical prac-
Yamashita, Yen-Shen Lu, Kenichi Inoue, Masato Taka- tice guidelines for diagnosis, treatment and follow-up†.
hashi, Zsuzsanna Pápai, Anne-Sophie Longin, David Mills, Ann. Oncol., 30(8):1194–1220, 2019. 1
Celine Wilke, Samit Hirawat, and Dejan Juric. Alpelisib for
[10] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and
PIK3CA-Mutated, hormone Receptor–Positive advanced
Matthijs Douze. Deep clustering for unsupervised learning
breast cancer. N. Engl. J. Med., 380(20):1929–1940, 2019.
of visual features. In Proceedings of the European confer-
1
ence on computer vision (ECCV), pages 132–149, 2018. 3
[2] Shekoofeh Azizi, Laura Culp, Jan Freyberg, Basil Mustafa, [11] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal,
Sebastien Baur, Simon Kornblith, Ting Chen, Nenad Toma- Piotr Bojanowski, and Armand Joulin. Unsupervised learn-
sev, Jovana Mitrović, Patricia Strachan, S Sara Mahdavi, ing of visual features by contrasting cluster assignments.
Ellery Wulczyn, Boris Babenko, Megan Walker, Aaron Adv. Neural Inf. Process. Syst., 33:9912–9924, 2020. 3, 2
Loh, Po-Hsuan Cameron Chen, Yuan Liu, Pinal Bav- [12] Mathilde Caron, Hugo Touvron, Ishan Misra, Herve Je-
ishi, Scott Mayer McKinney, Jim Winkens, Abhijit Guha gou, Julien Mairal, Piotr Bojanowski, and Armand Joulin.
Roy, Zach Beaver, Fiona Ryan, Justin Krogue, Mozziyar Emerging properties in self-supervised vision transformers.
Etemadi, Umesh Telang, Yun Liu, Lily Peng, Greg S Cor- In 2021 IEEE/CVF International Conference on Computer
rado, Dale R Webster, David Fleet, Geoffrey Hinton, Neil Vision (ICCV). IEEE, 2021. 3, 2
Houlsby, Alan Karthikesalingam, Mohammad Norouzi,
[13] M Chalabi, Y L Verschoor, J Van den Berg, and others.
and Vivek Natarajan. Robust and data-efficient general-
LBA7 neoadjuvant immune checkpoint inhibition in lo-
ization of self-supervised machine learning for diagnostic
cally advanced MMR-deficient colon cancer: The NICHE-
imaging. Nat Biomed Eng, 7(6):756–779, 2023. 1, 3, 9, 2
2 study. Annals of, 2022. 1
[3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. [14] Richard J Chen and Rahul G Krishnan. Self-Supervised
Layer normalization. arXiv preprint arXiv:1607.06450, vision transformers learn visual concepts in histopathol-
2016. 5 ogy. Learning Meaningful Representations of Life, NeurIPS
[4] Peter Bandi, Oscar Geessink, Quirine Manson, Mar- 2021, 2021. 1, 3, 7, 8
cory Van Dijk, Maschenka Balkenhol, Meyke Hermsen, [15] Richard J Chen, Chengkuan Chen, Yicong Li, Tiffany Y
Babak Ehteshami Bejnordi, Byungjae Lee, Kyunghyun Chen, Andrew D Trister, Rahul G Krishnan, and Faisal
Paeng, Aoxiao Zhong, et al. From detection of individ- Mahmood. Scaling vision transformers to gigapixel im-
ual metastases to classification of lymph node status at the ages via hierarchical self-supervised learning. In 2022
patient level: the camelyon17 challenge. IEEE transactions IEEE/CVF Conference on Computer Vision and Pattern
on medical imaging, 38(2):550–560, 2018. 6, 1 Recognition (CVPR), pages 16144–16155. IEEE, 2022. 1,
[5] Jeffrey Boschman, Hossein Farahani, Amirali Darbandsari, 3
Pouya Ahmadvand, Ashley Van Spankeren, David Farnell, [16] Richard J Chen, Tong Ding, Ming Y Lu, Drew F K
Adrian B Levine, Julia R Naso, Andrew Churg, Steven Jm Williamson, Guillaume Jaume, Bowen Chen, Andrew
Jones, Stephen Yip, Martin Köbel, David G Huntsman, Zhang, Daniel Shao, Andrew H Song, Muhammad Shaban,
C Blake Gilks, and Ali Bashashati. The utility of color Mane Williams, Anurag Vaidya, Sharifa Sahai, Lukas Old-
normalization for AI-based diagnosis of hematoxylin and enburg, Luca L Weishaupt, Judy J Wang, Walt Williams,

9
Long Phi Le, Georg Gerber, and Faisal Mahmood. A [30] Omar S M El Nahhas, Chiara M L Loeffler, Zunamys I
General-Purpose Self-Supervised model for computational Carrero, Marko van Treeck, Fiona R Kolbinger, Kather-
pathology. 2023. 1, 3, 9, 2 ine J Hewitt, Hannah S Muti, Mara Graziani, Qinghe
[17] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- Zeng, Julien Calderaro, Nadina Ortiz-Brüchle, Tanwei
offrey Hinton. A simple framework for contrastive learning Yuan, Michael Hoffmeister, Hermann Brenner, Alexan-
of visual representations. In Proceedings of the 37th In- der Brobeil, Jorge S Reis-Filho, and Jakob Nikolas
ternational Conference on Machine Learning, pages 1597– Kather. Regression-based Deep-Learning predicts molecu-
1607. PMLR, 2020. 3, 2 lar biomarkers from pathology slides. arXiv preprint arXiv,
[18] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad 2023. 1, 3
Norouzi, and Geoffrey E Hinton. Big Self-Supervised [31] Alexandre Filiot, Ridouane Ghermi, Antoine Olivier, Paul
models are strong Semi-Supervised learners. In Advances Jacob, Lucas Fidon, Alice Mac Kain, Charlie Saillard, and
in Neural Information Processing Systems, pages 22243– Jean-Baptiste Schiratti. Scaling self-supervised learning for
22255. Curran Associates, Inc., 2020. 2 histopathology with masked image modeling. 2023. 1, 3,
[19] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. 7, 2
Improved baselines with momentum contrastive learning. [32] Narmin Ghaffari Laleh, Hannah Sophie Muti, Chiara
2020. 3 Maria Lavinia Loeffler, Amelie Echle, Oliver Lester Sal-
[20] Xinlei Chen, Saining Xie, and Kaiming He. An empiri- danha, Faisal Mahmood, Ming Y Lu, Christian Trautwein,
cal study of training self-supervised vision transformers. In Rupert Langer, Bastian Dislich, Roman D Buelow,
2021 IEEE/CVF International Conference on Computer Vi- Heike Irmgard Grabsch, Hermann Brenner, Jenny Chang-
sion (ICCV). IEEE, 2021. 3, 2 Claude, Elizabeth Alwers, Titus J Brinker, Firas Khader,
[21] Philip Chikontwe, Hyun Jung Sung, Jaehoon Jeong, Mee- Daniel Truhn, Nadine T Gaisa, Peter Boor, Michael
jeong Kim, Heounjeong Go, Soo Jeong Nam, and Sang Hoffmeister, Volkmar Schulz, and Jakob Nikolas Kather.
Hyun Park. Weakly supervised segmentation on neural Benchmarking weakly-supervised deep learning pipelines
compressed histopathology with self-equivariant regular- for whole slide classification in computational pathology.
ization. Med. Image Anal., 80:102482, 2022. 3 Med. Image Anal., 79:102474, 2022. 1, 3
[22] Ozan Ciga, Tony Xu, and Anne Louise Martel. Self super- [33] Narmin Ghaffari Laleh, Daniel Truhn, Gregory Patrick
vised contrastive learning for digital histopathology. Ma- Veldhuizen, Tianyu Han, Marko van Treeck, Roman D
chine Learning with Applications, 7:100198, 2022. 1, 7 Buelow, Rupert Langer, Bastian Dislich, Peter Boor, Volk-
[23] Nicolas Coudray, Paolo Santiago Ocampo, Theodore Sakel- mar Schulz, and Jakob Nikolas Kather. Adversarial attacks
laropoulos, Navneet Narula, Matija Snuderl, David Fenyö, and adversarial robustness in computational pathology. Nat.
Andre L Moreira, Narges Razavian, and Aristotelis Tsiri- Commun., 13(1):5711, 2022. 4
gos. Classification and mutation prediction from non–small [34] A Goldhirsch, E P Winer, A S Coates, R D Gelber, M
cell lung cancer histopathology images using deep learning. Piccart-Gebhart, B Thürlimann, H-J Senn, and Panel mem-
Nat. Med., 24(10):1559–1567, 2018. 1, 3 bers. Personalizing the treatment of women with early
[24] Ekin Dogus Cubuk, Ethan S Dyer, Rapha Gontijo Lopes, breast cancer: highlights of the st gallen international ex-
and Sylvia Smullin. Tradeoffs in data augmentation: An pert consensus on the primary therapy of early breast cancer
empirical study. In ICLR, 2021. 3 2013. Ann. Oncol., 24(9):2206–2223, 2013. 1
[25] Olivier Dehaene, Axel Camara, Olivier Moindrot, Axel [35] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin
de Lavergne, and Pierre Courtiol. Self-Supervision closes Tallec, Pierre Richemond, Elena Buchatskaya, Carl Do-
the gap between weak and strong supervision in histology. ersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad
2020. 1, 7 Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Remi
[26] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Munos, and Michal Valko. Bootstrap your own latent - a
Li Fei-Fei. ImageNet: A large-scale hierarchical image new approach to Self-Supervised learning. In Advances
database. In 2009 IEEE Conference on Computer Vision in Neural Information Processing Systems, pages 21271–
and Pattern Recognition, pages 248–255, 2009. 1, 3 21284. Curran Associates, Inc., 2020. 3
[27] Terrance DeVries and Graham W Taylor. Improved regular- [36] Yonghang Guan, Jun Zhang, Kuan Tian, Sen Yang, Pei
ization of convolutional neural networks with cutout. 2017. Dong, Jinxi Xiang, Wei Yang, Junzhou Huang, Yuyao
5, 6, 2 Zhang, and Xiao Han. Node-aligned graph convolutional
[28] R Dienstmann, M J Mason, F A Sinicrope, A I Phipps, network for whole-slide image representation and classifi-
S Tejpar, A Nesbakken, S A Danielsen, A Sveen, D D cation. In 2022 IEEE/CVF Conference on Computer Vi-
Buchanan, M Clendenning, C Rosty, B Bot, S R Alberts, sion and Pattern Recognition (CVPR), pages 18813–18823.
J Milburn Jessup, R A Lothe, M Delorenzi, P A Newcomb, IEEE, 2022. 1
D Sargent, and J Guinney. Prediction of overall survival in [37] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
stage II and III colon cancer beyond TNM system: a ret- Deep residual learning for image recognition. In 2016 IEEE
rospective, pooled biomarker study. Ann. Oncol., 28(5): Conference on Computer Vision and Pattern Recognition,
1023–1031, 2017. 1 CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages
[29] Thomas G Dietterich, Richard H Lathrop, and Tomás 770–778. IEEE Computer Society, 2016. 1, 3, 7, 2
Lozano-Pérez. Solving the multiple instance problem with [38] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
axis-parallel rectangles. Artif. Intell., 89(1):31–71, 1997. 3 Girshick. Momentum contrast for unsupervised visual rep-

10
resentation learning. In 2020 IEEE/CVF Conference on Thomas Rösch, Rene Werner, Jie Tian, Elodie Puybareau,
Computer Vision and Pattern Recognition (CVPR). IEEE, Matteo Bovio, Xiufeng Zhang, Yifeng Zhu, Se Young
2020. 3, 2 Chun, Won-Ki Jeong, Peom Park, and Jinwook Choi. PAIP
[39] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Pi- 2019: Liver cancer segmentation challenge. Med. Image
otr Dollár, and Ross B. Girshick. Masked autoencoders are Anal., 67:101854, 2021. 2
scalable vision learners. In IEEE/CVF Conference on Com- [49] Alexander Kolesnikov, Alexey Dosovitskiy, Dirk Weis-
puter Vision and Pattern Recognition, CVPR 2022, New senborn, Georg Heigold, Jakob Uszkoreit, Lucas Beyer,
Orleans, LA, USA, June 18-24, 2022, pages 15979–15988. Matthias Minderer, Mostafa Dehghani, Neil Houlsby, Syl-
IEEE, 2022. 2 vain Gelly, Thomas Unterthiner, and Xiaohua Zhai. An im-
[40] Dan Hendrycks and Kevin Gimpel. Gaussian error linear age is worth 16x16 words: Transformers for image recog-
units (gelus). arXiv preprint arXiv:1606.08415, 2016. 5 nition at scale. In International Conference on Learning
[41] Dan Hendrycks, Norman Mu, Ekin D. Cubuk, Barret Zoph, Representations, 2021. 3, 7, 2
Justin Gilmer, and Balaji Lakshminarayanan. AugMix: A [50] Karsten Krug, Eric J Jaehnig, Shankha Satpathy, Lili Blu-
simple data processing method to improve robustness and menberg, Alla Karpova, Meenakshi Anurag, George Miles,
uncertainty. Proceedings of the International Conference Philipp Mertins, Yifat Geffen, Lauren C Tang, et al. Pro-
on Learning Representations (ICLR), 2020. 2 teogenomic landscape of breast cancer tumorigenesis and
[42] Maximilian Ilse, Jakub Tomczak, and Max Welling. targeted therapy. Cell, 183(5):1436–1456, 2020. 6, 1
Attention-based deep multiple instance learning. In Pro- [51] Yang Liu, Nilay S Sethi, Toshinori Hinoue, Barbara G
ceedings of the 35th International Conference on Machine Schneider, Andrew D Cherniack, Francisco Sanchez-Vega,
Learning, pages 2127–2136. PMLR, 2018. 3, 6, 7, 8, 4, 5 Jose A Seoane, Farshad Farshidfar, Reanne Bowlby, Mi-
[43] Mostafa Jahanifar, Manahil Raza, Kesi Xu, Trinh Vuong, razul Islam, et al. Comparative molecular analysis of gas-
Rob Jewsbury, Adam Shephard, Neda Zamanitajeddin, trointestinal adenocarcinomas. Cancer cell, 33(4):721–735,
Jin Tae Kwak, Shan E Ahmed Raza, Fayyaz Minhas, and 2018. 1
Nasir Rajpoot. Domain generalization in computational [52] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
pathology: Survey and guidelines. 2023. 4 Zhang, Stephen Lin, and Baining Guo. Swin transformer:
[44] Mingu Kang, Heon Song, Seonwook Park, Donggeun Yoo, Hierarchical vision transformer using shifted windows. In
and Sérgio Pereira. Benchmarking self-supervised learning 2021 IEEE/CVF International Conference on Computer Vi-
on diverse pathology datasets. In 2023 IEEE/CVF Confer- sion (ICCV). IEEE, 2021. 3, 4, 7, 2
ence on Computer Vision and Pattern Recognition (CVPR), [53] Chiara Maria Lavinia Loeffler, Omar S M El Nahhas,
pages 3344–3354. IEEE, 2023. 1, 3, 4, 6, 7, 8, 2 Hannah Sophie Muti, Tobias Seibel, Didem Cifci, Marko
[45] Jakob Nikolas Kather, Niels Halama, and Alexander Marx. van Treeck, Marco Gustav, Zunamys I Carrero, Nadine T
100,000 histological images of human colorectal cancer Gaisa, Kjong-Van Lehmann, Alexandra Leary, Pier Se-
and healthy tissue, 2018. 4 lenica, Jorge S Reis-Filho, Nadina Ortiz Bruechle, and
[46] Jakob Nikolas Kather, Johannes Krisam, Pornpimol Jakob Nikolas Kather. Direct prediction of homologous re-
Charoentong, Tom Luedde, Esther Herpel, Cleo-Aron combination deficiency from routine histology in ten dif-
Weis, Timo Gaiser, Alexander Marx, Nektarios A Val- ferent tumor types with attention-based multiple instance
ous, Dyke Ferber, Lina Jansen, Constantino Carlos Reyes- learning: a development and validation study. medRxiv,
Aldasoro, Inka Zörnig, Dirk Jäger, Hermann Brenner, 2023. 1
Jenny Chang-Claude, Michael Hoffmeister, and Niels Ha- [54] Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradi-
lama. Predicting survival from colorectal cancer histol- ent descent with warm restarts. In International Conference
ogy slides using deep learning: A retrospective multicenter on Learning Representations, 2017. 6, 5
study. PLoS Med., 16(1):e1002730, 2019. 4 [55] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
[47] Jakob Nikolas Kather, Alexander T Pearson, Niels Halama, regularization. In 7th International Conference on Learning
Dirk Jäger, Jeremias Krause, Sven H Loosen, Alexander Representations, ICLR 2019, New Orleans, LA, USA, May
Marx, Peter Boor, Frank Tacke, Ulf Peter Neumann, Heike I 6-9, 2019, 2019. 6, 5
Grabsch, Takaki Yoshikawa, Hermann Brenner, Jenny [56] Ming Y Lu, Drew F K Williamson, Tiffany Y Chen,
Chang-Claude, Michael Hoffmeister, Christian Trautwein, Richard J Chen, Matteo Barbieri, and Faisal Mahmood.
and Tom Luedde. Deep learning can predict microsatellite Data-efficient and weakly supervised computational pathol-
instability directly from histology in gastrointestinal cancer. ogy on whole-slide images. Nat Biomed Eng, 5(6):555–
Nat. Med., 25(7):1054–1056, 2019. 1, 3 570, 2021. 1, 3
[48] Yoo Jung Kim, Hyungjoon Jang, Kyoungbun Lee, [57] Ming Y Lu, Bowen Chen, Drew F K Williamson, Richard J
Seongkeun Park, Sung-Gyu Min, Choyeon Hong, Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor
Jeong Hwan Park, Kanggeun Lee, Jisoo Kim, Wonjae Odintsov, Andrew Zhang, Long Phi Le, Georg Gerber,
Hong, Hyun Jung, Yanling Liu, Haran Rajkumar, Ma- Anil V Parwani, and Faisal Mahmood. Towards a Visual-
hendra Khened, Ganapathy Krishnamurthi, Sen Yang, Language foundation model for computational pathology.
Xiyue Wang, Chang Hee Han, Jin Tae Kwak, Jianqiang 2023. 1, 3, 9, 2
Ma, Zhe Tang, Bahram Marami, Jack Zeineh, Zixu Zhao, [58] Marc Macenko, Marc Niethammer, J S Marron, David Bor-
Pheng-Ann Heng, Rüdiger Schmitz, Frederic Madesta, land, John T Woosley, Xiaojun Guan, Charles Schmitt, and

11
Nancy E Thomas. A method for normalizing histology [66] A C Ruifrok and D A Johnston. Quantification of histo-
slides for quantitative analysis. In 2009 IEEE International chemical staining by color deconvolution. Anal. Quant. Cy-
Symposium on Biomedical Imaging: From Nano to Macro, tol. Histol., 23(4):291–299, 2001. 3
pages 1107–1110, 2009. 2, 3, 7, 8, 10 [67] Oliver Lester Saldanha, Chiara M L Loeffler, Jan Moritz
[59] Niccolò Marini, Sebastian Otalora, Marek Wodzinski, Se- Niehues, Marko van Treeck, Tobias P Seraphin, Kather-
lene Tomassini, Aldo Franco Dragoni, Stephane Marchand- ine Jane Hewitt, Didem Cifci, Gregory Patrick Veldhuizen,
Maillet, Juan Pedro Dominguez Morales, Lourdes Duran- Siddhi Ramesh, Alexander T Pearson, and Jakob Nikolas
Lopez, Simona Vatrano, Henning Müller, and Manfredo Kather. Self-supervised attention-based deep learning for
Atzori. Data-driven color augmentation for H&E stained pan-cancer mutation prediction from histopathology. NPJ
images in computational pathology. J. Pathol. Inform., 14: Precis Oncol, 7(1):35, 2023. 1
100183, 2023. 4 [68] Massimo Salvi, U Rajendra Acharya, Filippo Molinari, and
[60] Haseeb Nazki, Ognjen Arandjelovic, In Hwa Um, and Kristen M Meiburger. The impact of pre- and post-image
David Harrison. MultiPathGAN: Structure preserving stain processing techniques on deep learning frameworks: A
normalization using unsupervised multi-domain adversarial comprehensive review for digital pathology image analysis.
network with perception loss. In Proceedings of the 38th Comput. Biol. Med., 128:104129, 2021. 2, 3, 4
ACM/SIGAPP Symposium on Applied Computing, pages [69] Birgid Schömig-Markiefka, Alexey Pryalukhin, Wolfgang
1197–1204, New York, NY, USA, 2023. Association for Hulla, Andrey Bychkov, Junya Fukuoka, Anant Madab-
Computing Machinery. 3 hushi, Viktor Achter, Lech Nieroda, Reinhard Büttner,
[61] Jan Moritz Niehues, Philip Quirke, Nicholas P West, Alexander Quaas, and Yuri Tolkach. Quality control stress
Heike I Grabsch, Marko van Treeck, Yoni Schirris, Gre- test for deep learning-based diagnostic model in digital
gory P Veldhuizen, Gordon G A Hutchins, Susan D Rich- pathology. Mod. Pathol., 34(12):2098–2108, 2021. 4
man, Sebastian Foersch, Titus J Brinker, Junya Fukuoka, [70] Peter Leonard Schrammen, Narmin Ghaffari Laleh, Amelie
Andrey Bychkov, Wataru Uegami, Daniel Truhn, Her- Echle, Daniel Truhn, Volkmar Schulz, Titus J Brinker,
mann Brenner, Alexander Brobeil, Michael Hoffmeister, Hermann Brenner, Jenny Chang-Claude, Elizabeth Alw-
and Jakob Nikolas Kather. Generalizable biomarker pre- ers, Alexander Brobeil, Matthias Kloor, Lara R Heij, Dirk
diction from cancer pathology slides with self-supervised Jäger, Christian Trautwein, Heike I Grabsch, Philip Quirke,
deep learning: A retrospective multi-centric study. Cell Rep Nicholas P West, Michael Hoffmeister, and Jakob Nikolas
Med, 4(4):100980, 2023. 1 Kather. Weakly supervised annotation-free cancer detec-
[62] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy tion and prediction of genotype in routine histopathology.
Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, J. Pathol., 256(1):50–60, 2022. 3
Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby,
[71] Zhuchen Shao, Liuxi Dai, Jitendra Jonnagaddala, Yang
Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Rus-
Chen, Yifeng Wang, Zijie Fang, and Yongbing Zhang. Gen-
sell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra,
eralizability of Self-Supervised training models for digital
Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu
pathology: A multicountry comparison in colorectal cancer.
Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand
JCO Clin Cancer Inform, 7:e2200178, 2023. 3, 8
Joulin, and Piotr Bojanowski. DINOv2: Learning robust
visual features without supervision. 2023. 2 [72] Yiqing Shen, Yulin Luo, Dinggang Shen, and Jing Ke.
RandStainNA: Learning Stain-Agnostic features from his-
[63] E Reinhard, M Adhikhmin, B Gooch, and P Shirley. Color
tology slides by bridging stain augmentation and normal-
transfer between images. IEEE Comput. Graph. Appl., 21
ization. In Medical Image Computing and Computer
(5):34–41, 2001. 2, 3
Assisted Intervention – MICCAI 2022, pages 212–221.
[64] Abtin Riasatian, Morteza Babaie, Danial Maleki, Shivam
Springer Nature Switzerland, 2022. 4
Kalra, Mojtaba Valipour, Sobhan Hemati, Manit Zaveri,
Amir Safarpoor, Sobhan Shafiei, Mehdi Afshari, Maral Ra- [73] Artem Shmatko, Narmin Ghaffari Laleh, Moritz Ger-
soolijaberi, Milad Sikaroudi, Mohd Adnan, Sultaan Shah, stung, and Jakob Nikolas Kather. Artificial intelligence in
Charles Choi, Savvas Damaskinos, Clinton Jv Campbell, histopathology: enhancing cancer research and clinical on-
Phedias Diamandis, Liron Pantanowitz, Hany Kashani, Ali cology. Nat Cancer, 3(9):1026–1038, 2022. 1
Ghodsi, and H R Tizhoosh. Fine-Tuning and training [74] Milad Sikaroudi, Maryam Hosseini, Ricardo Gonzalez,
of densenet for histopathology image representation using Shahryar Rahnamayan, and H R Tizhoosh. Generalization
TCGA diagnostic slides. Med. Image Anal., 70:102032, of vision pre-trained models for histopathology. Sci. Rep.,
2021. 4 13(1):6065, 2023. 1, 3, 4, 7, 8
[65] Arnaud D Roth, Sabine Tejpar, Mauro Delorenzi, Pu Yan, [75] T C Smyrk, P Watson, K Kaul, and H T Lynch. Tumor-
Roberto Fiocca, Dirk Klingbiel, Daniel Dietrich, Bart Bies- infiltrating lymphocytes are a marker for microsatellite in-
mans, György Bodoky, Carlo Barone, Enrique Aranda, stability in colorectal carcinoma. Cancer, 91(12):2417–
Bernard Nordlinger, Laura Cisar, Roberto Labianca, David 2422, 2001. 1
Cunningham, Eric Van Cutsem, and Fred Bosman. Prog- [76] T Sørlie, C M Perou, R Tibshirani, T Aas, S Geisler, H
nostic role of KRAS and BRAF in stage II and III re- Johnsen, T Hastie, M B Eisen, M van de Rijn, S S Jeffrey,
sected colon cancer: results of the translational study on T Thorsen, H Quist, J C Matese, P O Brown, D Botstein,
the PETACC-3, EORTC 40993, SAKK 60-00 trial. J. Clin. P E Lønning, and A L Børresen-Dale. Gene expression
Oncol., 28(3):466–474, 2010. 1 patterns of breast carcinomas distinguish tumor subclasses

12
with clinical implications. Proc. Natl. Acad. Sci. U. S. A., Assisted Intervention – MICCAI 2021, pages 257–266.
98(19):10869–10874, 2001. 1 Springer International Publishing, 2021. 3
[77] Maximilian Springenberg, Annika Frommholz, Markus [88] Sophia J Wagner, Daniel Reisenbüchler, Nicholas P West,
Wenzel, Eva Weicken, Jackie Ma, and Nils Strodthoff. Jan Moritz Niehues, Jiefu Zhu, Sebastian Foersch, Gre-
From modern CNNs to vision transformers: Assessing the gory Patrick Veldhuizen, Philip Quirke, Heike I Grab-
performance, robustness, and classification strategies of sch, Piet A van den Brandt, Gordon G A Hutchins, Su-
deep learning models in histopathology. Med. Image Anal., san D Richman, Tanwei Yuan, Rupert Langer, Josien C A
87:102809, 2023. 4 Jenniskens, Kelly Offermans, Wolfram Mueller, Richard
[78] Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Gray, Stephen B Gruber, Joel K Greenson, Gad Rennert,
Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How Joseph D Bonner, Daniel Schmolze, Jitendra Jonnagad-
to train your ViT? data, augmentation, and regularization in dala, Nicholas J Hawkins, Robyn L Ward, Dion Morton,
vision transformers. 2021. 2, 3 Matthew Seymour, Laura Magill, Marta Nowak, Jennifer
[79] David Tellez, Maschenka Balkenhol, Nico Karssemeijer, Hay, Viktor H Koelzer, David N Church, TransSCOT con-
Geert Litjens, Jeroen van der Laak, and Francesco Ciompi. sortium, Christian Matek, Carol Geppert, Chaolong Peng,
H and E stain augmentation improves generalization of con- Cheng Zhi, Xiaoming Ouyang, Jacqueline A James, Mau-
volutional networks for histopathological mitosis detection. rice B Loughrey, Manuel Salto-Tellez, Hermann Bren-
In Medical Imaging 2018: Digital Pathology, pages 264– ner, Michael Hoffmeister, Daniel Truhn, Julia A Schn-
270. SPIE, 2018. 4 abel, Melanie Boxberg, Tingying Peng, and Jakob Nikolas
[80] David Tellez, Geert Litjens, Péter Bándi, Wouter Bul- Kather. Transformer-based biomarker prediction from col-
ten, John-Melle Bokhorst, Francesco Ciompi, and Jeroen orectal cancer histology: A large-scale multicentric study.
van der Laak. Quantifying the effects of data augmentation Cancer Cell, 2023. 1, 6, 3, 4, 5
and stain color normalization in convolutional neural net- [89] Xinggang Wang, Yongluan Yan, Peng Tang, Xiang Bai, and
works for computational pathology. Med. Image Anal., 58: Wenyu Liu. Revisiting multiple instance neural networks.
101544, 2019. 4 Pattern Recognit., 74:15–24, 2018. 3
[81] United States Food and Drug Administration. FDA grants [90] Xiyue Wang, Sen Yang, Jun Zhang, Minghui Wang,
accelerated approval to pembrolizumab for first tissue/site Jing Zhang, Junzhou Huang, Wei Yang, and Xiao Han.
agnostic indication. 2017. 1 TransPath: Transformer-Based self-supervised learning for
[82] Abhishek Vahadane, Tingying Peng, Amit Sethi, Shadi Al- histopathological image classification. In Medical Image
barqouni, Lichao Wang, Maximilian Baust, Katja Steiger, Computing and Computer Assisted Intervention – MICCAI
Anna Melissa Schlitter, Irene Esposito, and Nassir Navab. 2021, pages 186–195. Springer International Publishing,
Structure-Preserving color normalization and sparse stain 2021. 1, 3, 2
separation for histological images. IEEE Trans. Med. Imag- [91] Xiyue Wang, Sen Yang, Jun Zhang, Minghui Wang,
ing, 35(8):1962–1971, 2016. 2, 3 Jing Zhang, Wei Yang, Junzhou Huang, and Xiao Han.
[83] Laurens Van der Maaten and Geoffrey Hinton. Visualizing Transformer-based unsupervised contrastive learning for
data using t-SNE. J. Mach. Learn. Res., 9(11), 2008. 4, 3 histopathological image classification. Med. Image Anal.,
[84] Suhas Vasaikar, Chen Huang, Xiaojing Wang, Vladislav A 81:102559, 2022. 3, 4, 7, 8, 2
Petyuk, Sara R Savage, Bo Wen, Yongchao Dou, Yun [92] Xiyue Wang, Yuexi Du, Sen Yang, Jun Zhang, Minghui
Zhang, Zhiao Shi, Osama A Arshad, et al. Proteogenomic Wang, Jing Zhang, Wei Yang, Junzhou Huang, and Xiao
analysis of human colon cancer reveals new therapeutic op- Han. RetCCL: Clustering-guided contrastive learning for
portunities. Cell, 177(4):1035–1049, 2019. 6, 1 whole-slide image retrieval. Med. Image Anal., 83:102645,
[85] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob 2023. 1, 3, 4, 7, 2
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, [93] John N Weinstein, Eric A Collisson, Gordon B Mills,
and Illia Polosukhin. Attention is all you need. In Ad- Kenna R Shaw, Brad A Ozenberger, Kyle Ellrott, Ilya
vances in Neural Information Processing Systems. Curran Shmulevich, Chris Sander, and Joshua M Stuart. The cancer
Associates, Inc., 2017. 6, 5 genome atlas pan-cancer analysis project. Nature genetics,
[86] Eugene Vorontsov, Alican Bozkurt, Adam Casson, George 45(10):1113–1120, 2013. 3, 6, 1, 2
Shaikovski, Michal Zelechowski, Siqi Liu, Philippe Math- [94] Georg Wölflein, Lucie Charlotte Magister, Pietro Liò,
ieu, Alexander van Eck, Donghun Lee, Julian Viret, Eric David J Harrison, and Ognjen Arandjelović. Deep multi-
Robert, Yi Kan Wang, Jeremy D Kunz, Matthew C H Lee, ple instance learning with Distance-Aware Self-Attention.
Jan Bernhard, Ran A Godrich, Gerard Oakley, Ewan Mil- 2023. 1, 6
lar, Matthew Hanna, Juan Retamero, William A Moye, [95] Jinxi Xiang, Xiyue Wang, Xinran Wang, Jun Zhang, Sen
Razik Yousfi, Christopher Kanan, David Klimstra, Brandon Yang, Wei Yang, Xiao Han, and Yueping Liu. Automatic
Rothrock, and Thomas J Fuchs. Virchow: A Million-Slide diagnosis and grading of prostate cancer with weakly super-
digital pathology foundation model. 2023. 1, 3, 9, 2 vised learning on whole slide images. Comput. Biol. Med.,
[87] Sophia J Wagner, Nadieh Khalili, Raghav Sharma, Melanie 152:106340, 2023. 1
Boxberg, Carsten Marr, Walter de Back, and Tingying [96] Zhongyi Yang, Xiyue Wang, Jinxi Xiang, Jun Zhang, Sen
Peng. Structure-Preserving multi-domain stain color aug- Yang, Xinran Wang, Wei Yang, Zhongyu Li, Xiao Han, and
mentation using Style-Transfer with disentangled repre- Yueping Liu. The devil is in the details: a small-lesion sen-
sentations. In Medical Image Computing and Computer sitive weakly supervised learning framework for prostate

13
cancer detection and grading. Virchows Arch., 482(3):525–
538, 2023. 1
[97] Farhad Ghazvinian Zanjani, Svitlana Zinger, Babak Eht-
eshami Bejnordi, Jeroen A W M van der Laak, and Pe-
ter H N de With. Stain normalization of histopathology
images using generative adversarial networks. In 2018
IEEE 15th International Symposium on Biomedical Imag-
ing (ISBI 2018), pages 573–577. IEEE, 2018. 3
[98] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and
Stephane Deny. Barlow twins: Self-Supervised learning
via redundancy reduction. In Proceedings of the 38th Inter-
national Conference on Machine Learning, pages 12310–
12320. PMLR, 2021. 3, 2
[99] Yunlong Zhang, Yuxuan Sun, Honglin Li, Sunyi Zheng,
Chenglu Zhu, and Lin Yang. Benchmarking the robust-
ness of deep neural networks to common corruptions in
digital pathology. In Medical Image Computing and Com-
puter Assisted Intervention – MICCAI 2022, pages 242–
252. Springer Nature Switzerland, 2022. 4
[100] Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Ci-
hang Xie, Alan Yuille, and Tao Kong. Image BERT pre-
training with online tokenizer. In International Conference
on Learning Representations, 2022. 3

14
A Good Feature Extractor Is All You Need
for Weakly Supervised Learning in Histopathology
Supplementary Material
A. Downstream tasks and their clinical rele- Target Training and validation Test dataset
-Subtype
vance -CDH1 mutation TCGA-BRCA [93] CPTAC-BRCA [50]
-TP53 mutation (833 train, 208 val samples) (120 samples)
Targets We extensively evaluated the models on nine -PIK3CA mutation
downstream tasks, summarised in Tab. 2. All of the tar- -LN status
CAMELYON17 [4]
(centre-wise cross-validation;
CAMELYON17 [4]
(centre-wise cross-validation;
gets were treated as binary variables, except for breast can- 320 train, 80 val samples) 100 samples)
cer subtype, which is a five-way classification target, deter- -MSI status
-RAS mutation TCGA-CRC [93] CPTAC-COAD [84]
mined by immunohistochemistry: Luminal A (HR+/HER2- -RAF mutation (558 samples) (110 samples)
/low Ki-67), Luminal B (HR+/HER2+/high Ki-67), HER2 -MAD4 mutation

overexpressed (HR-), Basal (which is a subgroup of triple- Table 2. Overview of the evaluated downstream tasks. Dataset
negative breast cancer), or Normal breast-like (a subtype sizes are shown in parentheses. The targets are related to breast
for which the clinical and molecular characteristics remain cancer, while the targets are related to colorectal cancer.
largely undefined throughout the existing scientific litera-
ture) [76]. This molecular subtyping of early-stage invasive
breast cancer has become an essential procedure in clini- 80%/20% split of the other four centres. We treat -LN sta-
cal management due to its implications in treatment rec- tus as a binary classification task, where the positive class
ommendations and providing valuable prognostic insights corresponds to the presence of metastatic cancer cells in
for a patient’s survival [9, 34]. In addition, our investiga- the lymph nodes. Each slide in the dataset is of a lymph
tion also included analysis for prevalent mutations in CDH1 node tissue section, and we treat each slide as a single
and TP53 as well as PIK3CA, the latter of which opens sample, i.e. a separate patient. This is slightly different to
new possibilities for targeted therapies in advanced disease the original CAMELYON17 challenge [4], where groups
stages [1]. Microsatellite instability (MSI) status is a key of five slides were arranged into “virtual patients” (though
marker in colorectal cancer owing to its profound implica- the slides themselves may be from different actual patients),
tions in shaping a patient’s prognosis and responsiveness and the task was to predict a virtual patient-level label based
to immunotherapies [13, 81]. It is driven by either spon- on a specific rule for aggregating the slide-level predictions.
taneous or germline (hence hereditary) mutations in DNA- We do not use the virtual patient labels, but instead use the
repair related genes [6] and leads to phenotypic changes in slide-level labels provided in the dataset.
the tumour tissue [75]. Therefore, the performance of vari- For all other targets, we use either TCGA-BRCA [93]
ous AI models is commonly evaluated based on their ability or TCGA-CRC [93] for training, and respectively either
to predict MSI from routine histopathology [47], often per- CPTAC-BRCA [50] or CPTAC-COAD [84] for testing. We
formed in conjunction with other prevalent genetic markers obtain the patient-level labels from the respective stud-
such as KRAS and BRAF: these are key-driver mutations ies via cbioportal.org. The only exception is -
in colorectal cancer, that shape a patient’s survival chances MSI status which is not available for TCGA-CRC in
and hold strong influence over the selection of targeted ther- cbioportal.org, but is provided in the supplementary
apies best suited for each individual patient [28, 65]. Given material of Liu et al. [51].
the high clinical relevance and availability of robust ground
truth data, we have strategically selected these particular B. Feature extractors
tasks for our analysis.
In Tab. 3, we provide an overview of the SSL feature ex-
tractors evaluated in this study. We use the weights from
Data Here, we provide additional details about where we the respective authors’ GitHub repositories. The feature ex-
obtained data for the downstream tasks, further to what is tractor called Lunit-DINO in our paper corresponds to Kang
mentioned in Sec. 4. et al.’s DINOp=16 model [44].
We predict the -LN status using the CAMELYON17
B.1. Foundation models
dataset [4], which contains data from five centres. For this
dataset, we perform centre-wise cross-validation, where we This year, a number of foundation models have emerged
use one of the centres for testing and the others for train- for pathology that were trained on datasets of unprece-
ing (each of the five random seeds uses a different cen- dented size. Unfortunately, we could not include these
tre for testing). The training and validation sets are an in our study since their weights remain proprietary. No-

1
Name Architecture SSL method SSL dataset, magnification Embedding size (dx )
CTransPath [90, 91] Swin Transformer [52] semantically-relevant con- TCGA [93] and PAIP [48] 768
trastive learning [91] based (20×)
on MoCo v3 [20]
Phikon [31] ViT-B [49] semantically-relevant con- TCGA [93] (20×) 768
trastive learning [91] based
on MoCo v3 [20]
Lunit-DINO [44] ViT-S [49] DINO [12] TCGA [93] and non-public 384
TULIP [44] (20×, 40×)
RetCCL [92] ResNet-50 [37] clustering-guided con- TCGA [93] and PAIP [48] 2048
trastive learning [92] based (20×)
on MoCo [38]
Lunit-BT [44] ResNet-50 [37] Barlow Twins [98] TCGA [93] and non-public 2048
TULIP [44] (20×, 40×)
Lunit-SwAV [44] ResNet-50 [37] SwAV [11] TCGA [93] and non-public 2048
TULIP [44] (20×, 40×)

Table 3. Overview of SSL feature extractors evaluated in this study, their architecture, SSL method, pretraining dataset, and embedding
size. As baselines, we additionally compare against the respective ImageNet pretrained backbones: Swin Transformer [52], Vit-B [49],
ViT-S [49, 78] and ResNet-50 [37].

tably, UNI [16] has been trained on a dataset exceed- D. Augmentations


ing 100,000 slides, while Virchow [86] utilises an even
larger corpus of 1.5 million slides, both employing the DI- Including patchwise stain normalisation, we study 27 im-
NOv2 framework [62]. Moreover, Campanella et al. [8] age augmentations in this work. We provide representative
trained two foundation models on over 400,000 slides us- examples of these augmentations in Fig. 10, and describe
ing DINO [12] and MAE [39]. On the other hand, Az- them below:
izi et al. [2] integrate both medical and non-medical im-
• macenko: Macenko stain normalisation [58] (patchwise)
ages to train their foundation model, REMEDIS, using Sim-
• rotate {90°, 180°, 270°}: rotate by the specified angle
CLR/BiT [17, 18]. Furthermore, Lu et al. [57] made use
• random rotation: rotate by an angle β sampled uni-
of 1.17 million image-caption pairs to develop a vision-
formly such that (β mod 90) ∈ [10, 80], i.e. forcing an
language foundation model named CONCH. In stark con-
off-axis rotation
trast, the publicly available models employ orders of mag-
• flip {horizontal, vertical}: flip along the specified axis
nitude fewer WSIs, as TCGA contains around 30,000 diag-
• zoom {1.5×, 1.75×, 2×}: enlarge the patch by the spec-
nostic and tissue slides in total [93].
ified factor and crop the centre
• affine: random affine transformation with a maximum ro-
C. Stain normalisation tation of 10°, maximum translation of 20% of the patch
size, maximum scaling of 20%, and maximum shear of
In Fig. 8, we show the effect of stain normalisation on the 10°
latent space of the remaining eight feature extractors that • warp perspective: random perspective transformation
were not depicted in Fig. 3. with a maximum distortion of 0.2
• jigsaw: cut the patch into a 4 × 4 grid and randomly per-
mute the tiles
• Cutout: randomly erase a rectangle that covers between
Patchwise versus slidewise stain normalisation In 2% and 25% of the total area [27]
Sec. 4.2, we state that there is no consistent improvement • AugMix: see Hendrycks et al. [41]
obtained by employing stain normalisation, regardless of • {low, high} brightness: reduce the brightness by a factor
whether it is performed on a per-patch or per-slide basis. of 0.7 or increase it by a factor of 1.5
Further to the results in Fig. 1a showing only slidewise • {low, high} contrast: reduce the contrast by a factor of
stain normalisation, we perform an ablation study where we 0.7 or increase it by a factor of 1.5
normalise each patch individually. We provide an analo- • {low, high} saturation: reduce the saturation by a factor
gous boxplot for both types of stain normalisation in Fig. 9, of 0.7 or increase it by a factor of 1.5
which shows that the conclusion holds for both types of • colour jitter: randomly adjust the brightness, contrast,
stain normalisation. saturation, and hue by maximum factors of 0.4, 0.4, 0.4,

2
Swin CTransPath ViT-B Phikon

ResNet-50 RetCCL Lunit-BT Lunit-SwAV

Figure 8. Latent space visualisations (t-SNE [83]), showing the effect of stain normalisation [58]. This figure extends Fig. 3, which depicts
only two feature extractors, Lunit-DINO [44] and its ViT-S [49, 78] ImageNet baseline; here, we show the other eight. Colours are as in
Fig. 3.

slidewise patchwise tion [58] is applied on a per-patch basis


0.3 • Macenko (slidewise): Macenko stain normalisation [58]
Change in test AUROC

0.2 is applied on a per-slide basis


• rotation/flipping: each patch is randomly rotated by a
0.1
right angle or flipped along the horizontal or vertical axis,
0.0 with equal probability
• all: any of the 27 augmentations, or no augmentation, is
−0.1
applied to each patch with equal probability
−0.2 We apply no augmentations to the test set (except when ap-
plying slidewise or patchwise stain normalisation, in which
in

L
h

AV

case we normalise the test set in the same way as the train-
iko

C
at

-5

t-B
T-
T-

IN
Sw

w
C
sP

et
Vi
Vi

t-D
Ph

ni

t-S
et
N
an

Lu
R

ing set).
es
ni

ni
Tr

Lu

Lu
R
C

Figure 9. Improvement obtained by employing slidewise (blue,


boxes are as in Fig. 1a) or patchwise (orange) stain normalisa- E. Extended data tables and figures
tion compared to no normalisation. There is no clear benefit or
detriment in applying either type of stain normalisation (all confi- In much of our discussion in Secs. 4 and 5, we focus on par-
dence intervals cross zero). While this figure reports results only ticular augmentations, models or feature extractors. Here,
for the downstream AttMIL model, the conclusion holds for the we produce extended versions of figures and tables from
other models as well, as reported in ??. the main text providing more results for different choices of
the above.
Figure 11 summarises the main results for all three
and 0.1, respectively downstream aggregation models: AttMIL [42], the two-
• gamma {0.5, 2.0}: apply a gamma correction with the layer transformer as employed by Wagner et al. [88], and
specified exponent the mean average pooling baseline.
• sharpen: sharpen the image by a factor of 5
• Gaussian blur: apply a Gaussian blur with a kernel size
of 5 and a standard deviation of 2.0 Normalised differential AUROC scores In Tabs. 4 to 8,
• median blur: apply a median blur with a kernel size of 5 we present the normalised differential AUROC scores
for all tasks, feature extractors, and downstream mod-
els, and augmentation groups (one table per augmentation
Augmentation groups In Sec. 4, we study the effect of group). This extends Tab. 1 from the main text, which
various groups of augmentations on downstream perfor- only shows the results for the AttMIL [42] aggregation
mance. These groups are defined as follows: model without augmentations (corresponding to the first ten
• none: no augmentations, i.e. the original patches are used rows in Tab. 4). We observe that Lunit-DINO [44] and
• Macenko (patchwise): Macenko stain normalisa- CTransPath [91] consistently achieve the best task-averaged

3
ive
zo 1 x n
a s

m itte on

om .5 io
w g s
gh tr s

ct
lo tur n

n ur
lo bri nes

ta ic l
hi con tne

gh ur t

ro ert nta

zo 1 tat
co sa tio

C w pe
hi sat ras

ga ur j ati
lo co st

ur
af 2 5x

ia bl
fli ma .5
fli ori .0

ra te 2 °
zo om °
ga ma r

ro te 9 l

om ro
a

ta 80
nd 70
h

sa rs
ta 0°

bl
p zo
t

m 0
h 2

om .7
w nt
w o

ed n
gh h

fin x

u n
jig pe
lo enk

m ssia
hi br ig

ro te 1

sh ix
M inal

ga pe
Au ut
gM
wa e
rp

o
v

ar
ac
ig

ut
p
or

STR
ADI
MUS
NORM
TUM
BACK
LYM
DEB
MUC
Figure 10. Examples of original and augmented patches (columns) from the NCT-CRC-HE-100K dataset [45, 46]. Each row corresponds
to a representative patch from a different patch class.

Effect of augmentation on downstream performance (AttMIL) Relative performance comparison

AUROC deterioration vs. best AUROC deterioration vs. best AUROC deterioration vs. best
Change in test AUROC

0.4 (lower is better) 0.25

0.2 0.20

0.15
0.0
0.10
−0.2
0.05
−0.4 0.00
(a) stain normalisation (b) rotate/flip (c) all augmentations (d) no augmentation
Effect of augmentation on downstream performance (Transformer) Relative performance comparison Feature extractor
Swin
Change in test AUROC

0.4 (lower is better) 0.25


CTransPath
0.2 0.20 ViT-B
Phikon
0.15
0.0 ViT-S
0.10 Lunit-DINO
−0.2 ResNet-50
0.05
RetCCL
−0.4 0.00 Lunit-BT
(a) stain normalisation (b) rotate/flip (c) all augmentations (d) no augmentation Lunit-SwAV

Effect of augmentation on downstream performance (Mean pool) Relative performance comparison


Change in test AUROC

0.4 (lower is better) 0.25

0.2 0.20

0.15
0.0
0.10
−0.2
0.05
−0.4 0.00
(a) stain normalisation (b) rotate/flip (c) all augmentations (d) no augmentation

Figure 11. Extended version of Fig. 1 showing the main results for all three downstream models: AttMIL [42] (top, same as Fig. 1), a
two-layer transformer [88] (middle), and mean average pooling (bottom).

results, independent of the choice of downstream aggrega- we also provide the seed-averaged absolute test AUROC
tion model and augmentation group. scores for all tasks, feature extractors, and downstream
models, and augmentation groups in Tabs. 4 to 8 (one table
Absolute AUROC scores While the normalised differen- per augmentation group). Looking at these absolute scores,
tial AUROC score provides a relative performance measure we find that the predicting the -PIK3CA target is the most
to facilitate a fair comparison between feature extractors, difficult task across the board for all feature extractors and

4
downstream models, while the -LN status and -MSI sta- dings, i.e.
n
tus targets are the easiest. However, we emphasise that the 1X
normalised differential AUROC score is the more meaning- g¯θ ({xi }ni=1 ) = xi . (4)
n i=1
ful metric for comparing feature extractors, since it is inde-
pendent of the task difficulty and accounts for the variance
across seeds (see Sec. 4.1). AttMIL [42] This model takes a weighted average of the
patch embeddings, where the weights are computed inde-
F. Training and implementation details pendently for each embedding. More formally, the slide-
level embedding is given by
Training For downstream model training, we use the
n
AdamW [55] optimiser with an initial learning rate of 10−3 , X
weight decay of 10−2 , and a batch size of one. The learn- g¯θ ({xi }ni=1 ) = αi xi , (5)
i=1
ing rate is decayed using a cosine annealing learning sched-
ule [54] over 30 epochs, but we halt training when the vali- where the attention6 weights αi ∈ R are obtained via a two-
dation loss does not improve for ten epochs. layer network with 256 tanh-activated hidden units that is
In MIL terminology, we refer to the patient as the bag, applied to each patch embedding xi independently and then
and the patches as the instances. Note that some datasets normalised across all patches using a softmax function, i.e.
have multiple WSIs per patient; in these cases, we sim-
ply mix the patches from all WSIs into a single bag. An ei = W2 tanh(W1 xi + b1 ) + b2 , (6)
epoch represents a full pass over all patients in the training exp(ei )
set. At every step, we sample a maximum of 8,192 patches αi = Pn . (7)
per patient, though most patients have fewer patches. We j=1 exp(ej )

found it beneficial to employ a batch size of one: not only


Here, W1 ∈ R256×512 , b1 ∈ R256 , W2 ∈ R1×256 , and
does this reduce GPU memory requirements, it also accel-
b2 ∈ R are learnable parameters (captured within the set of
erates training. Indeed, we found that padding the bags to
learnable parameters θ).
the maximum number of patches per patient (8,192) slows
down training considerably, but with a batch size of one, we
can use a variable number of patches per bag. Nonetheless, Two-layer transformer We also employ a two-layer
we accumulate gradients over four steps before performing transformer [85], closely aligned with the configuration pre-
a weight update, which effectively increases the batch size sented by Wagner et al. [88]. This setup differs from the
to four. classical transformer architecture [85] in that there is just
one branch, i.e. just the decoder (or encoder, depending on
F.1. Downstream aggregation models perspective). Both layers have 512 hidden units and 8 atten-
We describe the three downstream aggregation models in tion heads, employ a dropout rate of 0.1, use GELU activa-
more detail below. In essence, these are different parametri- tion [40] in the feedforward layers, and use layer normalisa-
sations of the gθ function in Eq. (2) that aggregate the patch tion [3] before the attention layers. We employ no masking.
embeddings into a single slide-level embedding. All three The input tokens are the patch embeddings, and the output
models first pass the patch embeddings through a linear tokens are averaged like in Eq. (4) to obtain the slide-level
layer with 512 output units and ReLU activation, i.e. embedding.

gθ ({xi }ni=1 ) = g¯θ ({max(0, W̄θ xi + b¯θ )}ni=1 ) (3)


F.2. Overhead and caching
Feature extraction Prior to training, we extract features
with learnable parameters W̄θ ∈ R512×dx and b¯θ ∈ R512 . from all patches in the training and validation sets, and store
However, the three models differ in how they aggregate the them on disk. We do this for each of the ten feature ex-
resulting patch embeddings in the g¯θ function, which we tractors. For the training sets, we additionally perform fea-
describe below. ture extraction for all 27 augmented versions of each patch,
In any case, the classifier hθ in Eq. (2) is a linear layer and store these on disk as well. For both the training and
with softmax activation over the number of classes, to test sets, we also extract features for the stain-normalised
which we apply a cross-entropy loss. Note that we employ versions of the patches. This way, we effectively have a
dropout with a probability of 0.5 to the slide-level embed- cache of the ai ◦ f function in Eq. (2) for all inputs (i.e.
ding before passing it to the linear layer.
6 Ilse et al. [42]’s use of the term “attention” should not be confused
with the scaled dot product attention in the transformer architecture [85].
Mean average pooling As a baseline model, we compute Here, the attention weight for a particular token is computed solely based
the slide-level embedding as the mean of the patch embed- on that token alone.

5
patches), all augmentations ai , and all feature extractors
f . During training, we only need to load the features from
disk (dx floating point values per patch, e.g. in the case of
CTransPath dx = 768), as opposed to loading the patches
directly (224 × 224 × 3 byte values) and having to perform
augmentation and feature extraction on the fly (very expen-
sive).

Training with augmentation Even though our training


runs employed already extracted features, they took 30×
longer with all augmentations, or 5× longer with just the
rotation augmentations as compared to employing no aug-
mentations. This approximately linear scaling in the num-
ber of augmentations a is the result of slower data loading,
as random reads are performed over a times as many fea-
tures compared to the no-augmentation case. We alleviated
some of this bottleneck by implementing additional caches,
but even this solution only bore fruit because we ran many
experiments with similar dataset configurations. Thus, we
emphasise again that augmentations are too expensive to be
viable in computational pathology pipelines, due their sig-
nificant preprocessing and training overhead which does not
even yield a consistent improvement in downstream perfor-
mance.

Total training time In total, we trained 6,750 models


across the cartesian product of:
• 10 feature extractors,
• 5 augmentation groups,
• 3 downstream aggregation models,
• 9 downstream tasks, and
• 5 random seeds.
We trained these models on NVIDIA Tesla V100 GPUs
(one training run per GPU at a time), which cumulatively
took 4,648 GPU hours (193.7 days).

6
Target -Subtype -CDH1 -TP53 -PIK3CA -LN status -MSI -KRAS -BRAF -SMAD4 Average
Model Feature extractor
AttMIL Swin 0.07 ± 0.02 0.17 ± 0.03 0.28 ± 0.02 0.07 ± 0.04 0.17 ± 0.08 0.18 ± 0.04 0.14 ± 0.04 0.14 ± 0.07 0.16 ± 0.05 0.15 ± 0.05
CTransPath 0.00 ± 0.00 0.01 ± 0.01 0.01 ± 0.01 0.04 ± 0.03 0.06 ± 0.07 0.08 ± 0.03 0.06 ± 0.03 0.06 ± 0.03 0.06 ± 0.03 0.04 ± 0.03
ViT-B 0.08 ± 0.04 0.11 ± 0.02 0.15 ± 0.03 0.07 ± 0.03 0.17 ± 0.06 0.15 ± 0.03 0.03 ± 0.04 0.18 ± 0.07 0.01 ± 0.01 0.11 ± 0.04
Phikon 0.09 ± 0.02 0.09 ± 0.02 0.09 ± 0.03 0.09 ± 0.03 0.07 ± 0.06 0.06 ± 0.04 0.07 ± 0.04 0.07 ± 0.06 0.17 ± 0.08 0.09 ± 0.05
ViT-S 0.13 ± 0.03 0.08 ± 0.03 0.14 ± 0.05 0.08 ± 0.05 0.19 ± 0.09 0.18 ± 0.04 0.06 ± 0.03 0.19 ± 0.04 0.08 ± 0.08 0.13 ± 0.05
Lunit-DINO 0.08 ± 0.03 0.03 ± 0.03 0.03 ± 0.02 0.02 ± 0.03 0.07 ± 0.04 0.00 ± 0.00 0.06 ± 0.04 0.02 ± 0.02 0.02 ± 0.02 0.04 ± 0.03
ResNet-50 0.15 ± 0.03 0.09 ± 0.04 0.11 ± 0.03 0.01 ± 0.02 0.18 ± 0.08 0.22 ± 0.04 0.11 ± 0.03 0.23 ± 0.07 0.21 ± 0.09 0.15 ± 0.05
RetCCL 0.07 ± 0.03 0.04 ± 0.02 0.04 ± 0.03 0.05 ± 0.03 0.07 ± 0.06 0.08 ± 0.03 0.03 ± 0.02 0.14 ± 0.03 0.06 ± 0.03 0.06 ± 0.03
Lunit-BT 0.13 ± 0.03 0.06 ± 0.04 0.02 ± 0.01 0.13 ± 0.04 0.34 ± 0.15 0.28 ± 0.13 0.03 ± 0.04 0.35 ± 0.13 0.25 ± 0.03 0.18 ± 0.08
Lunit-SwAV 0.06 ± 0.02 0.06 ± 0.03 0.06 ± 0.02 0.13 ± 0.06 0.07 ± 0.05 0.10 ± 0.03 0.13 ± 0.06 0.07 ± 0.07 0.14 ± 0.08 0.09 ± 0.05
Transformer Swin 0.09 ± 0.04 0.11 ± 0.03 0.21 ± 0.04 0.09 ± 0.03 0.16 ± 0.08 0.19 ± 0.07 0.09 ± 0.04 0.17 ± 0.05 0.14 ± 0.05 0.14 ± 0.05
CTransPath 0.01 ± 0.02 0.01 ± 0.02 0.03 ± 0.03 0.08 ± 0.07 0.07 ± 0.07 0.02 ± 0.02 0.04 ± 0.04 0.08 ± 0.06 0.09 ± 0.05 0.05 ± 0.05
ViT-B 0.08 ± 0.03 0.10 ± 0.02 0.17 ± 0.04 0.11 ± 0.02 0.21 ± 0.07 0.18 ± 0.05 0.13 ± 0.05 0.20 ± 0.08 0.06 ± 0.05 0.14 ± 0.05
Phikon 0.13 ± 0.04 0.08 ± 0.05 0.08 ± 0.03 0.05 ± 0.03 0.07 ± 0.05 0.05 ± 0.04 0.05 ± 0.04 0.11 ± 0.07 0.12 ± 0.06 0.08 ± 0.05
ViT-S 0.10 ± 0.02 0.07 ± 0.03 0.22 ± 0.07 0.11 ± 0.06 0.21 ± 0.09 0.16 ± 0.06 0.08 ± 0.04 0.23 ± 0.09 0.03 ± 0.02 0.13 ± 0.06
Lunit-DINO 0.04 ± 0.03 0.06 ± 0.03 0.03 ± 0.02 0.02 ± 0.02 0.05 ± 0.04 0.01 ± 0.01 0.06 ± 0.03 0.02 ± 0.04 0.02 ± 0.03 0.03 ± 0.03
ResNet-50 0.13 ± 0.04 0.10 ± 0.07 0.15 ± 0.03 0.04 ± 0.07 0.19 ± 0.08 0.19 ± 0.07 0.11 ± 0.04 0.19 ± 0.06 0.30 ± 0.11 0.16 ± 0.07
RetCCL 0.09 ± 0.04 0.04 ± 0.04 0.02 ± 0.02 0.09 ± 0.06 0.07 ± 0.06 0.15 ± 0.03 0.12 ± 0.05 0.22 ± 0.11 0.06 ± 0.04 0.10 ± 0.06
Lunit-BT 0.04 ± 0.03 0.05 ± 0.03 0.02 ± 0.02 0.10 ± 0.04 0.07 ± 0.07 0.02 ± 0.02 0.02 ± 0.02 0.13 ± 0.05 0.07 ± 0.02 0.06 ± 0.04
Lunit-SwAV 0.08 ± 0.04 0.04 ± 0.05 0.05 ± 0.03 0.11 ± 0.05 0.07 ± 0.06 0.06 ± 0.03 0.08 ± 0.03 0.07 ± 0.05 0.17 ± 0.07 0.08 ± 0.05
Mean pool Swin 0.08 ± 0.01 0.10 ± 0.04 0.13 ± 0.05 0.05 ± 0.02 0.17 ± 0.12 0.17 ± 0.02 0.02 ± 0.02 0.13 ± 0.03 0.10 ± 0.02 0.11 ± 0.05
CTransPath 0.00 ± 0.00 0.04 ± 0.02 0.02 ± 0.02 0.00 ± 0.01 0.15 ± 0.11 0.03 ± 0.02 0.11 ± 0.05 0.06 ± 0.03 0.09 ± 0.02 0.06 ± 0.05
ViT-B 0.07 ± 0.01 0.08 ± 0.01 0.07 ± 0.02 0.09 ± 0.02 0.15 ± 0.11 0.15 ± 0.02 0.07 ± 0.04 0.18 ± 0.04 0.02 ± 0.02 0.10 ± 0.04
Phikon 0.11 ± 0.02 0.04 ± 0.03 0.13 ± 0.03 0.06 ± 0.03 0.11 ± 0.11 0.07 ± 0.03 0.12 ± 0.03 0.09 ± 0.07 0.11 ± 0.05 0.09 ± 0.05
ViT-S 0.11 ± 0.01 0.03 ± 0.03 0.13 ± 0.02 0.07 ± 0.03 0.15 ± 0.11 0.19 ± 0.03 0.03 ± 0.02 0.21 ± 0.04 0.07 ± 0.03 0.11 ± 0.04
Lunit-DINO 0.08 ± 0.01 0.04 ± 0.02 0.01 ± 0.02 0.05 ± 0.03 0.09 ± 0.09 0.00 ± 0.00 0.09 ± 0.02 0.00 ± 0.00 0.01 ± 0.02 0.04 ± 0.04
ResNet-50 0.08 ± 0.01 0.00 ± 0.01 0.09 ± 0.02 0.03 ± 0.02 0.21 ± 0.09 0.22 ± 0.03 0.03 ± 0.04 0.24 ± 0.02 0.13 ± 0.05 0.11 ± 0.04
RetCCL 0.01 ± 0.00 0.03 ± 0.01 0.06 ± 0.02 0.06 ± 0.02 0.15 ± 0.11 0.10 ± 0.04 0.03 ± 0.03 0.15 ± 0.01 0.06 ± 0.02 0.07 ± 0.04
Lunit-BT 0.06 ± 0.03 0.04 ± 0.01 0.06 ± 0.04 0.07 ± 0.02 0.18 ± 0.11 0.08 ± 0.02 0.03 ± 0.03 0.21 ± 0.09 0.03 ± 0.02 0.08 ± 0.05
Lunit-SwAV 0.07 ± 0.00 0.03 ± 0.02 0.04 ± 0.02 0.11 ± 0.02 0.13 ± 0.13 0.05 ± 0.02 0.13 ± 0.03 0.03 ± 0.01 0.13 ± 0.04 0.08 ± 0.05

Table 4. Normalised differential AUROC scores for all tasks, feature extractors, downstream models, when employing no augmentations.

Target -Subtype -CDH1 -TP53 -PIK3CA -LN status -MSI -KRAS -BRAF -SMAD4 Average
Model Feature extractor
AttMIL Swin 0.08 ± 0.04 0.23 ± 0.03 0.27 ± 0.03 0.07 ± 0.05 0.18 ± 0.08 0.19 ± 0.05 0.15 ± 0.02 0.11 ± 0.07 0.16 ± 0.05 0.16 ± 0.05
CTransPath 0.00 ± 0.00 0.04 ± 0.04 0.04 ± 0.03 0.03 ± 0.02 0.06 ± 0.08 0.08 ± 0.03 0.08 ± 0.04 0.07 ± 0.06 0.08 ± 0.03 0.05 ± 0.04
ViT-B 0.08 ± 0.04 0.16 ± 0.03 0.12 ± 0.02 0.04 ± 0.03 0.16 ± 0.07 0.15 ± 0.04 0.10 ± 0.04 0.10 ± 0.05 0.01 ± 0.02 0.10 ± 0.04
Phikon 0.09 ± 0.04 0.08 ± 0.04 0.08 ± 0.04 0.10 ± 0.03 0.09 ± 0.08 0.06 ± 0.03 0.13 ± 0.04 0.12 ± 0.05 0.07 ± 0.05 0.09 ± 0.05
ViT-S 0.12 ± 0.05 0.13 ± 0.04 0.11 ± 0.04 0.04 ± 0.03 0.17 ± 0.09 0.20 ± 0.06 0.09 ± 0.04 0.15 ± 0.05 0.10 ± 0.11 0.12 ± 0.06
Lunit-DINO 0.04 ± 0.04 0.03 ± 0.02 0.04 ± 0.03 0.02 ± 0.02 0.06 ± 0.07 0.00 ± 0.01 0.05 ± 0.04 0.01 ± 0.02 0.09 ± 0.06 0.04 ± 0.04
ResNet-50 0.13 ± 0.04 0.20 ± 0.04 0.16 ± 0.03 0.03 ± 0.02 0.16 ± 0.08 0.22 ± 0.04 0.17 ± 0.05 0.15 ± 0.05 0.11 ± 0.07 0.15 ± 0.05
RetCCL 0.07 ± 0.04 0.02 ± 0.02 0.03 ± 0.03 0.05 ± 0.03 0.09 ± 0.06 0.06 ± 0.03 0.01 ± 0.02 0.13 ± 0.05 0.08 ± 0.02 0.06 ± 0.04
Lunit-BT 0.13 ± 0.06 0.04 ± 0.03 0.06 ± 0.08 0.12 ± 0.03 0.27 ± 0.17 0.17 ± 0.15 0.06 ± 0.06 0.34 ± 0.07 0.23 ± 0.07 0.16 ± 0.09
Lunit-SwAV 0.07 ± 0.04 0.02 ± 0.02 0.02 ± 0.02 0.05 ± 0.04 0.07 ± 0.06 0.13 ± 0.04 0.12 ± 0.04 0.05 ± 0.05 0.13 ± 0.06 0.07 ± 0.04
Transformer Swin 0.09 ± 0.03 0.15 ± 0.04 0.20 ± 0.03 0.04 ± 0.03 0.17 ± 0.09 0.21 ± 0.09 0.12 ± 0.06 0.21 ± 0.05 0.16 ± 0.08 0.15 ± 0.06
CTransPath 0.04 ± 0.03 0.01 ± 0.02 0.05 ± 0.05 0.08 ± 0.04 0.05 ± 0.07 0.02 ± 0.02 0.06 ± 0.03 0.03 ± 0.03 0.17 ± 0.09 0.06 ± 0.05
ViT-B 0.12 ± 0.04 0.14 ± 0.03 0.17 ± 0.03 0.02 ± 0.02 0.20 ± 0.08 0.22 ± 0.06 0.11 ± 0.04 0.23 ± 0.11 0.04 ± 0.03 0.14 ± 0.06
Phikon 0.11 ± 0.02 0.08 ± 0.02 0.09 ± 0.04 0.03 ± 0.02 0.09 ± 0.08 0.04 ± 0.03 0.06 ± 0.06 0.09 ± 0.08 0.03 ± 0.03 0.07 ± 0.05
ViT-S 0.09 ± 0.02 0.15 ± 0.04 0.15 ± 0.05 0.05 ± 0.03 0.15 ± 0.09 0.22 ± 0.08 0.10 ± 0.03 0.15 ± 0.04 0.04 ± 0.03 0.12 ± 0.05
Lunit-DINO 0.02 ± 0.03 0.06 ± 0.04 0.02 ± 0.03 0.02 ± 0.02 0.06 ± 0.05 0.01 ± 0.02 0.10 ± 0.05 0.04 ± 0.05 0.07 ± 0.07 0.04 ± 0.04
ResNet-50 0.15 ± 0.03 0.20 ± 0.07 0.16 ± 0.04 0.03 ± 0.02 0.22 ± 0.07 0.21 ± 0.04 0.13 ± 0.03 0.13 ± 0.07 0.20 ± 0.13 0.16 ± 0.06
RetCCL 0.07 ± 0.05 0.06 ± 0.03 0.03 ± 0.02 0.06 ± 0.04 0.10 ± 0.04 0.09 ± 0.03 0.11 ± 0.04 0.21 ± 0.09 0.08 ± 0.04 0.09 ± 0.05
Lunit-BT 0.03 ± 0.02 0.03 ± 0.02 0.02 ± 0.03 0.05 ± 0.02 0.06 ± 0.06 0.04 ± 0.03 0.02 ± 0.02 0.15 ± 0.07 0.05 ± 0.03 0.05 ± 0.04
Lunit-SwAV 0.07 ± 0.03 0.02 ± 0.02 0.04 ± 0.04 0.06 ± 0.05 0.08 ± 0.09 0.13 ± 0.06 0.14 ± 0.05 0.15 ± 0.10 0.18 ± 0.08 0.10 ± 0.06
Mean pool Swin 0.07 ± 0.01 0.14 ± 0.02 0.15 ± 0.04 0.03 ± 0.01 0.20 ± 0.09 0.18 ± 0.03 0.05 ± 0.05 0.13 ± 0.06 0.08 ± 0.03 0.11 ± 0.05
CTransPath 0.00 ± 0.00 0.02 ± 0.01 0.03 ± 0.03 0.00 ± 0.00 0.14 ± 0.10 0.03 ± 0.02 0.07 ± 0.05 0.06 ± 0.03 0.08 ± 0.02 0.05 ± 0.04
ViT-B 0.05 ± 0.01 0.08 ± 0.01 0.08 ± 0.03 0.04 ± 0.01 0.14 ± 0.11 0.16 ± 0.02 0.08 ± 0.03 0.13 ± 0.07 0.00 ± 0.01 0.08 ± 0.05
Phikon 0.13 ± 0.01 0.03 ± 0.02 0.11 ± 0.04 0.06 ± 0.02 0.12 ± 0.11 0.01 ± 0.01 0.02 ± 0.02 0.11 ± 0.05 0.07 ± 0.03 0.07 ± 0.05
ViT-S 0.08 ± 0.01 0.09 ± 0.02 0.12 ± 0.04 0.05 ± 0.03 0.18 ± 0.11 0.17 ± 0.06 0.02 ± 0.02 0.21 ± 0.04 0.03 ± 0.03 0.11 ± 0.05
Lunit-DINO 0.06 ± 0.01 0.02 ± 0.02 0.05 ± 0.04 0.05 ± 0.02 0.07 ± 0.08 0.02 ± 0.01 0.05 ± 0.03 0.03 ± 0.04 0.05 ± 0.02 0.04 ± 0.04
ResNet-50 0.08 ± 0.01 0.11 ± 0.04 0.15 ± 0.03 0.03 ± 0.01 0.21 ± 0.10 0.18 ± 0.04 0.07 ± 0.04 0.15 ± 0.03 0.06 ± 0.06 0.12 ± 0.05
RetCCL 0.02 ± 0.00 0.01 ± 0.01 0.07 ± 0.04 0.05 ± 0.01 0.12 ± 0.10 0.05 ± 0.02 0.02 ± 0.02 0.13 ± 0.03 0.05 ± 0.01 0.06 ± 0.04
Lunit-BT 0.09 ± 0.03 0.01 ± 0.00 0.04 ± 0.03 0.07 ± 0.01 0.21 ± 0.10 0.16 ± 0.04 0.06 ± 0.05 0.20 ± 0.08 0.02 ± 0.01 0.10 ± 0.05
Lunit-SwAV 0.08 ± 0.01 0.01 ± 0.01 0.02 ± 0.03 0.15 ± 0.02 0.13 ± 0.11 0.15 ± 0.01 0.17 ± 0.02 0.01 ± 0.02 0.13 ± 0.03 0.10 ± 0.04

Table 5. Normalised differential AUROC scores for all tasks, feature extractors, downstream models, when employing slidewise stain
normalisation [58].

7
Target -Subtype -CDH1 -TP53 -PIK3CA -LN status -MSI -KRAS -BRAF -SMAD4 Average
Model Feature extractor
AttMIL Swin 0.08 ± 0.03 0.20 ± 0.04 0.24 ± 0.03 0.05 ± 0.03 0.16 ± 0.07 0.14 ± 0.03 0.11 ± 0.04 0.11 ± 0.07 0.20 ± 0.03 0.14 ± 0.04
CTransPath 0.00 ± 0.00 0.02 ± 0.03 0.03 ± 0.02 0.04 ± 0.01 0.04 ± 0.05 0.07 ± 0.06 0.07 ± 0.03 0.06 ± 0.04 0.06 ± 0.03 0.04 ± 0.03
ViT-B 0.11 ± 0.04 0.12 ± 0.03 0.12 ± 0.02 0.04 ± 0.04 0.16 ± 0.12 0.14 ± 0.04 0.10 ± 0.04 0.13 ± 0.06 0.02 ± 0.02 0.10 ± 0.05
Phikon 0.11 ± 0.03 0.04 ± 0.01 0.09 ± 0.04 0.06 ± 0.02 0.09 ± 0.09 0.03 ± 0.03 0.05 ± 0.05 0.09 ± 0.05 0.06 ± 0.06 0.07 ± 0.05
ViT-S 0.10 ± 0.03 0.10 ± 0.03 0.10 ± 0.04 0.01 ± 0.02 0.19 ± 0.07 0.16 ± 0.04 0.06 ± 0.05 0.18 ± 0.07 0.07 ± 0.03 0.11 ± 0.05
Lunit-DINO 0.04 ± 0.03 0.01 ± 0.01 0.04 ± 0.03 0.02 ± 0.02 0.06 ± 0.06 0.01 ± 0.02 0.07 ± 0.04 0.02 ± 0.04 0.05 ± 0.03 0.04 ± 0.03
ResNet-50 0.17 ± 0.04 0.18 ± 0.04 0.17 ± 0.05 0.02 ± 0.01 0.17 ± 0.07 0.18 ± 0.03 0.13 ± 0.03 0.17 ± 0.07 0.15 ± 0.07 0.15 ± 0.05
RetCCL 0.09 ± 0.04 0.03 ± 0.02 0.03 ± 0.04 0.03 ± 0.02 0.10 ± 0.09 0.07 ± 0.03 0.02 ± 0.03 0.14 ± 0.04 0.07 ± 0.03 0.06 ± 0.04
Lunit-BT 0.11 ± 0.04 0.04 ± 0.01 0.02 ± 0.02 0.13 ± 0.02 0.25 ± 0.13 0.33 ± 0.07 0.08 ± 0.06 0.28 ± 0.10 0.15 ± 0.08 0.16 ± 0.07
Lunit-SwAV 0.05 ± 0.03 0.03 ± 0.03 0.04 ± 0.02 0.05 ± 0.03 0.08 ± 0.07 0.12 ± 0.04 0.12 ± 0.08 0.07 ± 0.05 0.11 ± 0.05 0.08 ± 0.05
Transformer Swin 0.11 ± 0.04 0.19 ± 0.05 0.20 ± 0.05 0.09 ± 0.04 0.19 ± 0.08 0.19 ± 0.04 0.15 ± 0.04 0.22 ± 0.07 0.09 ± 0.06 0.16 ± 0.05
CTransPath 0.01 ± 0.02 0.05 ± 0.04 0.02 ± 0.02 0.06 ± 0.04 0.06 ± 0.07 0.04 ± 0.04 0.08 ± 0.05 0.07 ± 0.07 0.08 ± 0.05 0.05 ± 0.05
ViT-B 0.11 ± 0.03 0.13 ± 0.03 0.18 ± 0.03 0.08 ± 0.03 0.16 ± 0.10 0.21 ± 0.07 0.13 ± 0.07 0.21 ± 0.04 0.09 ± 0.03 0.14 ± 0.05
Phikon 0.08 ± 0.03 0.09 ± 0.03 0.07 ± 0.03 0.07 ± 0.04 0.07 ± 0.06 0.04 ± 0.02 0.05 ± 0.06 0.08 ± 0.04 0.04 ± 0.03 0.06 ± 0.04
ViT-S 0.11 ± 0.03 0.11 ± 0.06 0.18 ± 0.03 0.07 ± 0.05 0.16 ± 0.09 0.16 ± 0.02 0.04 ± 0.05 0.19 ± 0.04 0.05 ± 0.06 0.12 ± 0.05
Lunit-DINO 0.04 ± 0.03 0.05 ± 0.04 0.02 ± 0.02 0.03 ± 0.03 0.04 ± 0.05 0.02 ± 0.03 0.10 ± 0.04 0.09 ± 0.07 0.06 ± 0.06 0.05 ± 0.04
ResNet-50 0.16 ± 0.05 0.18 ± 0.10 0.23 ± 0.04 0.04 ± 0.05 0.14 ± 0.08 0.21 ± 0.06 0.13 ± 0.05 0.16 ± 0.05 0.30 ± 0.11 0.17 ± 0.07
RetCCL 0.06 ± 0.03 0.06 ± 0.04 0.04 ± 0.04 0.06 ± 0.02 0.08 ± 0.06 0.08 ± 0.04 0.09 ± 0.06 0.15 ± 0.08 0.07 ± 0.04 0.08 ± 0.05
Lunit-BT 0.03 ± 0.03 0.04 ± 0.03 0.05 ± 0.03 0.07 ± 0.04 0.05 ± 0.06 0.05 ± 0.03 0.09 ± 0.06 0.15 ± 0.04 0.07 ± 0.06 0.07 ± 0.04
Lunit-SwAV 0.06 ± 0.03 0.01 ± 0.01 0.03 ± 0.02 0.08 ± 0.04 0.07 ± 0.06 0.08 ± 0.04 0.15 ± 0.05 0.07 ± 0.10 0.12 ± 0.02 0.08 ± 0.05
Mean pool Swin 0.06 ± 0.01 0.12 ± 0.02 0.11 ± 0.04 0.01 ± 0.01 0.20 ± 0.11 0.11 ± 0.03 0.04 ± 0.03 0.15 ± 0.04 0.04 ± 0.01 0.09 ± 0.04
CTransPath 0.00 ± 0.00 0.01 ± 0.01 0.02 ± 0.02 0.01 ± 0.01 0.18 ± 0.10 0.03 ± 0.03 0.09 ± 0.05 0.07 ± 0.04 0.05 ± 0.02 0.05 ± 0.04
ViT-B 0.03 ± 0.00 0.09 ± 0.01 0.07 ± 0.03 0.03 ± 0.01 0.17 ± 0.10 0.17 ± 0.04 0.10 ± 0.05 0.16 ± 0.06 0.02 ± 0.02 0.09 ± 0.05
Phikon 0.11 ± 0.01 0.01 ± 0.01 0.11 ± 0.03 0.08 ± 0.04 0.16 ± 0.15 0.02 ± 0.03 0.05 ± 0.03 0.09 ± 0.03 0.07 ± 0.06 0.08 ± 0.06
ViT-S 0.06 ± 0.01 0.05 ± 0.04 0.09 ± 0.05 0.02 ± 0.02 0.17 ± 0.12 0.17 ± 0.03 0.02 ± 0.01 0.22 ± 0.06 0.07 ± 0.04 0.10 ± 0.05
Lunit-DINO 0.05 ± 0.01 0.02 ± 0.01 0.04 ± 0.04 0.04 ± 0.02 0.11 ± 0.12 0.04 ± 0.04 0.07 ± 0.04 0.00 ± 0.00 0.03 ± 0.02 0.04 ± 0.05
ResNet-50 0.08 ± 0.00 0.11 ± 0.04 0.07 ± 0.03 0.03 ± 0.01 0.22 ± 0.11 0.15 ± 0.05 0.03 ± 0.03 0.21 ± 0.04 0.11 ± 0.10 0.11 ± 0.06
RetCCL 0.01 ± 0.00 0.02 ± 0.01 0.05 ± 0.03 0.03 ± 0.01 0.14 ± 0.10 0.04 ± 0.03 0.05 ± 0.05 0.14 ± 0.05 0.03 ± 0.01 0.06 ± 0.04
Lunit-BT 0.06 ± 0.03 0.02 ± 0.01 0.03 ± 0.03 0.05 ± 0.01 0.18 ± 0.12 0.11 ± 0.04 0.02 ± 0.03 0.18 ± 0.03 0.00 ± 0.01 0.07 ± 0.05
Lunit-SwAV 0.06 ± 0.00 0.02 ± 0.01 0.04 ± 0.02 0.12 ± 0.01 0.12 ± 0.11 0.12 ± 0.03 0.15 ± 0.02 0.04 ± 0.03 0.09 ± 0.02 0.08 ± 0.04

Table 6. Normalised differential AUROC scores for all tasks, feature extractors, downstream models, when employing patchwise stain
normalisation [58].

Target -Subtype -CDH1 -TP53 -PIK3CA -LN status -MSI -KRAS -BRAF -SMAD4 Average
Model Feature extractor
AttMIL Swin 0.05 ± 0.03 0.16 ± 0.05 0.28 ± 0.03 0.07 ± 0.02 0.14 ± 0.08 0.11 ± 0.03 0.13 ± 0.03 0.10 ± 0.04 0.20 ± 0.03 0.14 ± 0.04
CTransPath 0.00 ± 0.00 0.02 ± 0.03 0.03 ± 0.02 0.01 ± 0.01 0.05 ± 0.05 0.04 ± 0.03 0.07 ± 0.05 0.07 ± 0.02 0.06 ± 0.03 0.04 ± 0.03
ViT-B 0.07 ± 0.05 0.10 ± 0.02 0.15 ± 0.03 0.08 ± 0.03 0.16 ± 0.06 0.13 ± 0.03 0.09 ± 0.08 0.13 ± 0.04 0.01 ± 0.02 0.10 ± 0.04
Phikon 0.07 ± 0.03 0.07 ± 0.03 0.06 ± 0.06 0.11 ± 0.03 0.07 ± 0.06 0.04 ± 0.03 0.07 ± 0.04 0.09 ± 0.08 0.19 ± 0.09 0.08 ± 0.06
ViT-S 0.06 ± 0.03 0.04 ± 0.02 0.14 ± 0.04 0.06 ± 0.04 0.21 ± 0.10 0.19 ± 0.06 0.05 ± 0.04 0.19 ± 0.05 0.07 ± 0.08 0.11 ± 0.05
Lunit-DINO 0.06 ± 0.03 0.04 ± 0.03 0.02 ± 0.02 0.01 ± 0.02 0.05 ± 0.06 0.00 ± 0.00 0.06 ± 0.02 0.01 ± 0.02 0.04 ± 0.03 0.03 ± 0.03
ResNet-50 0.13 ± 0.03 0.10 ± 0.04 0.13 ± 0.04 0.03 ± 0.03 0.15 ± 0.10 0.22 ± 0.05 0.14 ± 0.05 0.22 ± 0.06 0.29 ± 0.08 0.16 ± 0.06
RetCCL 0.05 ± 0.03 0.04 ± 0.03 0.03 ± 0.03 0.04 ± 0.03 0.07 ± 0.07 0.06 ± 0.03 0.03 ± 0.04 0.16 ± 0.03 0.06 ± 0.03 0.06 ± 0.04
Lunit-BT 0.08 ± 0.04 0.03 ± 0.03 0.04 ± 0.05 0.12 ± 0.03 0.29 ± 0.20 0.25 ± 0.12 0.08 ± 0.08 0.34 ± 0.11 0.21 ± 0.05 0.16 ± 0.09
Lunit-SwAV 0.06 ± 0.03 0.06 ± 0.03 0.07 ± 0.04 0.10 ± 0.05 0.07 ± 0.06 0.06 ± 0.03 0.08 ± 0.05 0.05 ± 0.05 0.11 ± 0.04 0.07 ± 0.04
Transformer Swin 0.06 ± 0.03 0.10 ± 0.04 0.24 ± 0.04 0.04 ± 0.03 0.15 ± 0.09 0.10 ± 0.03 0.05 ± 0.05 0.19 ± 0.08 0.13 ± 0.06 0.12 ± 0.05
CTransPath 0.01 ± 0.01 0.02 ± 0.02 0.04 ± 0.04 0.08 ± 0.03 0.05 ± 0.05 0.08 ± 0.05 0.07 ± 0.05 0.13 ± 0.08 0.06 ± 0.03 0.06 ± 0.05
ViT-B 0.07 ± 0.03 0.08 ± 0.04 0.15 ± 0.04 0.08 ± 0.04 0.17 ± 0.06 0.20 ± 0.03 0.11 ± 0.02 0.14 ± 0.06 0.05 ± 0.04 0.12 ± 0.04
Phikon 0.06 ± 0.04 0.09 ± 0.03 0.08 ± 0.02 0.08 ± 0.03 0.06 ± 0.04 0.05 ± 0.04 0.04 ± 0.03 0.05 ± 0.04 0.15 ± 0.05 0.07 ± 0.04
ViT-S 0.06 ± 0.03 0.04 ± 0.03 0.17 ± 0.04 0.10 ± 0.06 0.18 ± 0.07 0.19 ± 0.02 0.10 ± 0.04 0.16 ± 0.05 0.04 ± 0.04 0.12 ± 0.04
Lunit-DINO 0.03 ± 0.03 0.08 ± 0.03 0.03 ± 0.02 0.02 ± 0.02 0.04 ± 0.04 0.00 ± 0.01 0.07 ± 0.03 0.02 ± 0.02 0.06 ± 0.06 0.04 ± 0.03
ResNet-50 0.09 ± 0.02 0.09 ± 0.05 0.18 ± 0.04 0.04 ± 0.05 0.18 ± 0.06 0.24 ± 0.05 0.09 ± 0.04 0.17 ± 0.07 0.32 ± 0.05 0.16 ± 0.05
RetCCL 0.07 ± 0.05 0.06 ± 0.04 0.02 ± 0.02 0.09 ± 0.04 0.06 ± 0.05 0.19 ± 0.05 0.12 ± 0.07 0.16 ± 0.06 0.11 ± 0.08 0.10 ± 0.06
Lunit-BT 0.02 ± 0.02 0.05 ± 0.04 0.05 ± 0.04 0.06 ± 0.03 0.07 ± 0.06 0.04 ± 0.04 0.02 ± 0.03 0.12 ± 0.04 0.05 ± 0.02 0.05 ± 0.04
Lunit-SwAV 0.06 ± 0.04 0.04 ± 0.03 0.05 ± 0.02 0.12 ± 0.04 0.07 ± 0.05 0.08 ± 0.05 0.10 ± 0.03 0.05 ± 0.06 0.18 ± 0.06 0.08 ± 0.05
Mean pool Swin 0.08 ± 0.01 0.10 ± 0.03 0.17 ± 0.04 0.06 ± 0.03 0.15 ± 0.11 0.14 ± 0.02 0.05 ± 0.05 0.13 ± 0.02 0.13 ± 0.03 0.11 ± 0.05
CTransPath 0.00 ± 0.00 0.04 ± 0.02 0.04 ± 0.03 0.01 ± 0.02 0.16 ± 0.11 0.03 ± 0.02 0.10 ± 0.03 0.04 ± 0.02 0.06 ± 0.03 0.05 ± 0.04
ViT-B 0.07 ± 0.01 0.08 ± 0.01 0.10 ± 0.02 0.09 ± 0.02 0.17 ± 0.09 0.13 ± 0.03 0.09 ± 0.05 0.16 ± 0.03 0.01 ± 0.01 0.10 ± 0.04
Phikon 0.08 ± 0.01 0.05 ± 0.02 0.13 ± 0.03 0.03 ± 0.03 0.12 ± 0.12 0.01 ± 0.01 0.13 ± 0.04 0.08 ± 0.08 0.09 ± 0.02 0.08 ± 0.05
ViT-S 0.08 ± 0.01 0.03 ± 0.02 0.15 ± 0.03 0.09 ± 0.02 0.14 ± 0.08 0.15 ± 0.02 0.02 ± 0.02 0.21 ± 0.05 0.07 ± 0.03 0.10 ± 0.03
Lunit-DINO 0.09 ± 0.01 0.04 ± 0.02 0.00 ± 0.01 0.05 ± 0.03 0.09 ± 0.09 0.01 ± 0.02 0.10 ± 0.04 0.00 ± 0.01 0.01 ± 0.02 0.04 ± 0.03
ResNet-50 0.08 ± 0.01 0.00 ± 0.01 0.12 ± 0.02 0.04 ± 0.03 0.19 ± 0.10 0.21 ± 0.05 0.04 ± 0.03 0.23 ± 0.04 0.12 ± 0.04 0.12 ± 0.04
RetCCL 0.01 ± 0.00 0.04 ± 0.01 0.08 ± 0.03 0.07 ± 0.03 0.14 ± 0.12 0.11 ± 0.04 0.07 ± 0.05 0.14 ± 0.01 0.05 ± 0.01 0.08 ± 0.05
Lunit-BT 0.06 ± 0.02 0.04 ± 0.01 0.06 ± 0.04 0.08 ± 0.02 0.22 ± 0.08 0.09 ± 0.05 0.02 ± 0.02 0.16 ± 0.01 0.02 ± 0.01 0.08 ± 0.04
Lunit-SwAV 0.07 ± 0.00 0.04 ± 0.02 0.08 ± 0.04 0.11 ± 0.03 0.13 ± 0.12 0.05 ± 0.02 0.13 ± 0.03 0.03 ± 0.02 0.11 ± 0.04 0.08 ± 0.05

Table 7. Normalised differential AUROC scores for all tasks, feature extractors, downstream models, when employing rotation/flipping
augmentations.

8
Target -Subtype -CDH1 -TP53 -PIK3CA -LN status -MSI -KRAS -BRAF -SMAD4 Average
Model Feature extractor
AttMIL Swin 0.04 ± 0.03 0.14 ± 0.02 0.21 ± 0.02 0.07 ± 0.04 0.13 ± 0.08 0.15 ± 0.04 0.10 ± 0.05 0.16 ± 0.08 0.17 ± 0.05 0.13 ± 0.05
CTransPath 0.00 ± 0.00 0.01 ± 0.02 0.00 ± 0.01 0.04 ± 0.03 0.03 ± 0.03 0.10 ± 0.04 0.06 ± 0.03 0.09 ± 0.07 0.07 ± 0.03 0.04 ± 0.03
ViT-B 0.04 ± 0.03 0.10 ± 0.04 0.12 ± 0.03 0.08 ± 0.04 0.14 ± 0.06 0.13 ± 0.04 0.06 ± 0.03 0.16 ± 0.05 0.02 ± 0.02 0.09 ± 0.04
Phikon 0.13 ± 0.05 0.09 ± 0.03 0.10 ± 0.05 0.12 ± 0.05 0.07 ± 0.07 0.06 ± 0.03 0.10 ± 0.05 0.18 ± 0.08 0.13 ± 0.06 0.11 ± 0.05
ViT-S 0.08 ± 0.03 0.07 ± 0.02 0.14 ± 0.02 0.08 ± 0.04 0.17 ± 0.09 0.15 ± 0.03 0.04 ± 0.03 0.19 ± 0.06 0.06 ± 0.05 0.11 ± 0.05
Lunit-DINO 0.05 ± 0.04 0.03 ± 0.03 0.04 ± 0.02 0.04 ± 0.04 0.06 ± 0.06 0.00 ± 0.01 0.07 ± 0.04 0.01 ± 0.02 0.05 ± 0.05 0.04 ± 0.04
ResNet-50 0.09 ± 0.03 0.07 ± 0.03 0.14 ± 0.03 0.01 ± 0.01 0.16 ± 0.08 0.24 ± 0.05 0.14 ± 0.03 0.24 ± 0.08 0.30 ± 0.11 0.15 ± 0.06
RetCCL 0.06 ± 0.04 0.03 ± 0.03 0.03 ± 0.02 0.07 ± 0.03 0.06 ± 0.05 0.11 ± 0.06 0.04 ± 0.05 0.18 ± 0.05 0.06 ± 0.02 0.07 ± 0.04
Lunit-BT 0.17 ± 0.05 0.11 ± 0.07 0.20 ± 0.20 0.17 ± 0.03 0.40 ± 0.07 0.21 ± 0.10 0.12 ± 0.05 0.24 ± 0.09 0.20 ± 0.05 0.20 ± 0.09
Lunit-SwAV 0.07 ± 0.03 0.03 ± 0.02 0.07 ± 0.04 0.10 ± 0.04 0.07 ± 0.06 0.08 ± 0.03 0.07 ± 0.05 0.13 ± 0.07 0.11 ± 0.05 0.08 ± 0.04
Transformer Swin 0.07 ± 0.02 0.13 ± 0.06 0.21 ± 0.03 0.03 ± 0.03 0.13 ± 0.09 0.13 ± 0.03 0.06 ± 0.06 0.07 ± 0.04 0.11 ± 0.03 0.10 ± 0.05
CTransPath 0.02 ± 0.02 0.06 ± 0.02 0.03 ± 0.02 0.04 ± 0.03 0.04 ± 0.04 0.06 ± 0.04 0.08 ± 0.03 0.07 ± 0.08 0.13 ± 0.06 0.06 ± 0.04
ViT-B 0.04 ± 0.03 0.11 ± 0.04 0.15 ± 0.03 0.09 ± 0.02 0.18 ± 0.13 0.15 ± 0.02 0.16 ± 0.05 0.23 ± 0.07 0.03 ± 0.03 0.13 ± 0.06
Phikon 0.12 ± 0.03 0.10 ± 0.04 0.06 ± 0.03 0.11 ± 0.03 0.08 ± 0.05 0.05 ± 0.04 0.04 ± 0.03 0.01 ± 0.02 0.15 ± 0.05 0.08 ± 0.04
ViT-S 0.06 ± 0.03 0.06 ± 0.03 0.14 ± 0.05 0.08 ± 0.03 0.19 ± 0.05 0.17 ± 0.05 0.06 ± 0.04 0.20 ± 0.04 0.02 ± 0.02 0.11 ± 0.04
Lunit-DINO 0.04 ± 0.03 0.05 ± 0.03 0.02 ± 0.01 0.04 ± 0.03 0.06 ± 0.06 0.01 ± 0.01 0.09 ± 0.05 0.06 ± 0.04 0.02 ± 0.02 0.04 ± 0.04
ResNet-50 0.09 ± 0.03 0.12 ± 0.04 0.10 ± 0.03 0.01 ± 0.02 0.18 ± 0.08 0.18 ± 0.03 0.04 ± 0.03 0.18 ± 0.05 0.27 ± 0.07 0.13 ± 0.05
RetCCL 0.03 ± 0.03 0.06 ± 0.04 0.01 ± 0.01 0.09 ± 0.04 0.08 ± 0.07 0.12 ± 0.07 0.13 ± 0.05 0.24 ± 0.08 0.13 ± 0.07 0.10 ± 0.06
Lunit-BT 0.03 ± 0.03 0.03 ± 0.03 0.04 ± 0.03 0.10 ± 0.03 0.09 ± 0.08 0.07 ± 0.06 0.03 ± 0.02 0.13 ± 0.04 0.05 ± 0.02 0.06 ± 0.04
Lunit-SwAV 0.08 ± 0.03 0.02 ± 0.03 0.03 ± 0.03 0.10 ± 0.03 0.07 ± 0.06 0.10 ± 0.04 0.10 ± 0.04 0.06 ± 0.04 0.16 ± 0.06 0.08 ± 0.04
Mean pool Swin 0.06 ± 0.01 0.10 ± 0.03 0.16 ± 0.03 0.04 ± 0.01 0.19 ± 0.12 0.15 ± 0.02 0.03 ± 0.04 0.18 ± 0.05 0.13 ± 0.04 0.12 ± 0.05
CTransPath 0.00 ± 0.00 0.03 ± 0.02 0.04 ± 0.02 0.00 ± 0.00 0.15 ± 0.11 0.04 ± 0.03 0.08 ± 0.03 0.04 ± 0.02 0.09 ± 0.03 0.05 ± 0.04
ViT-B 0.07 ± 0.01 0.08 ± 0.01 0.10 ± 0.02 0.08 ± 0.01 0.18 ± 0.08 0.17 ± 0.02 0.11 ± 0.05 0.20 ± 0.02 0.02 ± 0.02 0.11 ± 0.04
Phikon 0.11 ± 0.01 0.02 ± 0.02 0.13 ± 0.03 0.07 ± 0.04 0.12 ± 0.11 0.02 ± 0.02 0.11 ± 0.05 0.09 ± 0.07 0.12 ± 0.03 0.09 ± 0.05
ViT-S 0.11 ± 0.01 0.03 ± 0.02 0.16 ± 0.02 0.06 ± 0.01 0.16 ± 0.11 0.20 ± 0.03 0.04 ± 0.02 0.23 ± 0.03 0.06 ± 0.03 0.12 ± 0.04
Lunit-DINO 0.09 ± 0.01 0.02 ± 0.02 0.01 ± 0.02 0.04 ± 0.03 0.08 ± 0.09 0.01 ± 0.02 0.09 ± 0.02 0.00 ± 0.00 0.00 ± 0.01 0.04 ± 0.03
ResNet-50 0.08 ± 0.01 0.01 ± 0.01 0.11 ± 0.02 0.02 ± 0.01 0.23 ± 0.10 0.22 ± 0.03 0.01 ± 0.01 0.27 ± 0.05 0.15 ± 0.06 0.12 ± 0.04
RetCCL 0.01 ± 0.01 0.03 ± 0.01 0.07 ± 0.02 0.06 ± 0.01 0.14 ± 0.11 0.10 ± 0.05 0.08 ± 0.07 0.16 ± 0.03 0.06 ± 0.02 0.08 ± 0.05
Lunit-BT 0.08 ± 0.04 0.04 ± 0.01 0.10 ± 0.05 0.09 ± 0.02 0.29 ± 0.09 0.12 ± 0.07 0.03 ± 0.02 0.19 ± 0.02 0.09 ± 0.14 0.11 ± 0.07
Lunit-SwAV 0.07 ± 0.00 0.02 ± 0.01 0.03 ± 0.02 0.10 ± 0.02 0.15 ± 0.13 0.05 ± 0.02 0.13 ± 0.04 0.11 ± 0.05 0.13 ± 0.05 0.09 ± 0.05

Table 8. Normalised differential AUROC scores for all tasks, feature extractors, downstream models, when employing all augmentations.

Target -Subtype -CDH1 -TP53 -PIK3CA -LN status -MSI -KRAS -BRAF -SMAD4
Model Feature extractor
AttMIL Swin 0.75 ± 0.01 0.65 ± 0.02 0.54 ± 0.02 0.60 ± 0.02 0.74 ± 0.09 0.72 ± 0.04 0.51 ± 0.05 0.63 ± 0.07 0.55 ± 0.05
CTransPath 0.82 ± 0.02 0.81 ± 0.02 0.80 ± 0.02 0.62 ± 0.01 0.86 ± 0.08 0.82 ± 0.03 0.60 ± 0.03 0.71 ± 0.01 0.65 ± 0.02
ViT-B 0.74 ± 0.04 0.70 ± 0.01 0.66 ± 0.03 0.59 ± 0.01 0.74 ± 0.06 0.75 ± 0.03 0.62 ± 0.05 0.59 ± 0.08 0.70 ± 0.03
Phikon 0.73 ± 0.01 0.73 ± 0.02 0.72 ± 0.03 0.57 ± 0.02 0.85 ± 0.08 0.84 ± 0.05 0.59 ± 0.05 0.70 ± 0.06 0.54 ± 0.08
ViT-S 0.69 ± 0.02 0.73 ± 0.02 0.68 ± 0.06 0.58 ± 0.04 0.73 ± 0.10 0.72 ± 0.04 0.59 ± 0.03 0.58 ± 0.03 0.63 ± 0.08
Lunit-DINO 0.74 ± 0.02 0.78 ± 0.04 0.79 ± 0.03 0.64 ± 0.02 0.85 ± 0.03 0.90 ± 0.02 0.59 ± 0.04 0.76 ± 0.04 0.69 ± 0.02
ResNet-50 0.67 ± 0.02 0.73 ± 0.04 0.70 ± 0.03 0.65 ± 0.04 0.74 ± 0.09 0.68 ± 0.04 0.54 ± 0.04 0.55 ± 0.07 0.50 ± 0.10
RetCCL 0.76 ± 0.03 0.78 ± 0.01 0.78 ± 0.03 0.62 ± 0.01 0.85 ± 0.07 0.82 ± 0.03 0.63 ± 0.03 0.63 ± 0.02 0.66 ± 0.02
Lunit-BT 0.69 ± 0.03 0.75 ± 0.04 0.80 ± 0.00 0.54 ± 0.03 0.58 ± 0.17 0.62 ± 0.15 0.62 ± 0.05 0.43 ± 0.15 0.46 ± 0.03
Lunit-SwAV 0.76 ± 0.01 0.75 ± 0.03 0.76 ± 0.02 0.54 ± 0.06 0.84 ± 0.06 0.80 ± 0.03 0.53 ± 0.06 0.70 ± 0.08 0.58 ± 0.09
Transformer Swin 0.74 ± 0.04 0.70 ± 0.02 0.61 ± 0.03 0.54 ± 0.04 0.76 ± 0.09 0.69 ± 0.08 0.56 ± 0.03 0.60 ± 0.04 0.57 ± 0.05
CTransPath 0.81 ± 0.03 0.80 ± 0.01 0.80 ± 0.03 0.55 ± 0.08 0.85 ± 0.09 0.86 ± 0.02 0.60 ± 0.04 0.68 ± 0.07 0.62 ± 0.05
ViT-B 0.74 ± 0.03 0.71 ± 0.02 0.65 ± 0.03 0.52 ± 0.01 0.71 ± 0.07 0.70 ± 0.06 0.51 ± 0.05 0.56 ± 0.08 0.65 ± 0.06
Phikon 0.69 ± 0.04 0.73 ± 0.05 0.75 ± 0.02 0.59 ± 0.03 0.85 ± 0.06 0.83 ± 0.04 0.60 ± 0.04 0.65 ± 0.07 0.59 ± 0.06
ViT-S 0.72 ± 0.01 0.74 ± 0.03 0.60 ± 0.08 0.52 ± 0.07 0.71 ± 0.10 0.72 ± 0.07 0.57 ± 0.04 0.53 ± 0.10 0.68 ± 0.03
Lunit-DINO 0.78 ± 0.04 0.75 ± 0.03 0.79 ± 0.01 0.62 ± 0.02 0.87 ± 0.05 0.87 ± 0.02 0.59 ± 0.02 0.74 ± 0.05 0.69 ± 0.03
ResNet-50 0.69 ± 0.04 0.71 ± 0.08 0.67 ± 0.02 0.59 ± 0.08 0.73 ± 0.09 0.69 ± 0.07 0.54 ± 0.03 0.57 ± 0.06 0.41 ± 0.12
RetCCL 0.73 ± 0.03 0.77 ± 0.05 0.80 ± 0.04 0.55 ± 0.06 0.85 ± 0.07 0.73 ± 0.03 0.53 ± 0.05 0.55 ± 0.11 0.65 ± 0.06
Lunit-BT 0.78 ± 0.03 0.76 ± 0.03 0.80 ± 0.01 0.53 ± 0.05 0.85 ± 0.08 0.86 ± 0.02 0.63 ± 0.03 0.63 ± 0.04 0.65 ± 0.02
Lunit-SwAV 0.74 ± 0.05 0.77 ± 0.06 0.77 ± 0.02 0.53 ± 0.06 0.85 ± 0.06 0.82 ± 0.03 0.57 ± 0.03 0.69 ± 0.05 0.54 ± 0.07
Mean pool Swin 0.73 ± 0.01 0.68 ± 0.04 0.62 ± 0.05 0.59 ± 0.02 0.67 ± 0.13 0.72 ± 0.02 0.66 ± 0.02 0.67 ± 0.03 0.61 ± 0.02
CTransPath 0.82 ± 0.00 0.74 ± 0.02 0.72 ± 0.02 0.64 ± 0.02 0.69 ± 0.12 0.86 ± 0.02 0.58 ± 0.06 0.73 ± 0.04 0.62 ± 0.02
ViT-B 0.75 ± 0.01 0.71 ± 0.01 0.68 ± 0.02 0.56 ± 0.01 0.69 ± 0.11 0.74 ± 0.02 0.61 ± 0.04 0.61 ± 0.04 0.69 ± 0.02
Phikon 0.71 ± 0.02 0.74 ± 0.03 0.61 ± 0.03 0.59 ± 0.03 0.73 ± 0.12 0.82 ± 0.04 0.57 ± 0.03 0.70 ± 0.07 0.60 ± 0.05
ViT-S 0.71 ± 0.01 0.76 ± 0.04 0.61 ± 0.01 0.57 ± 0.02 0.69 ± 0.11 0.70 ± 0.04 0.65 ± 0.03 0.58 ± 0.05 0.64 ± 0.02
Lunit-DINO 0.74 ± 0.01 0.74 ± 0.02 0.73 ± 0.02 0.60 ± 0.03 0.75 ± 0.12 0.89 ± 0.02 0.60 ± 0.01 0.79 ± 0.01 0.70 ± 0.03
ResNet-50 0.74 ± 0.01 0.78 ± 0.01 0.65 ± 0.01 0.61 ± 0.01 0.63 ± 0.09 0.67 ± 0.03 0.66 ± 0.04 0.56 ± 0.03 0.58 ± 0.05
RetCCL 0.81 ± 0.00 0.75 ± 0.01 0.68 ± 0.02 0.58 ± 0.01 0.69 ± 0.12 0.79 ± 0.05 0.66 ± 0.03 0.64 ± 0.01 0.65 ± 0.00
Lunit-BT 0.76 ± 0.03 0.75 ± 0.00 0.69 ± 0.05 0.57 ± 0.01 0.66 ± 0.12 0.81 ± 0.02 0.66 ± 0.03 0.58 ± 0.10 0.68 ± 0.01
Lunit-SwAV 0.75 ± 0.00 0.75 ± 0.02 0.70 ± 0.02 0.53 ± 0.01 0.71 ± 0.15 0.84 ± 0.01 0.56 ± 0.03 0.76 ± 0.01 0.58 ± 0.05

Table 9. Test AUROC scores (averaged across the five seeds) for all tasks, feature extractors, and downstream models, when employing no
augmentations.

9
Target -Subtype -CDH1 -TP53 -PIK3CA -LN status -MSI -KRAS -BRAF -SMAD4
Model Feature extractor
AttMIL Swin 0.74 ± 0.02 0.58 ± 0.03 0.55 ± 0.03 0.58 ± 0.05 0.73 ± 0.09 0.72 ± 0.04 0.55 ± 0.01 0.66 ± 0.05 0.55 ± 0.05
CTransPath 0.82 ± 0.04 0.77 ± 0.05 0.78 ± 0.03 0.62 ± 0.01 0.84 ± 0.10 0.84 ± 0.00 0.61 ± 0.04 0.69 ± 0.05 0.63 ± 0.03
ViT-B 0.74 ± 0.02 0.65 ± 0.03 0.69 ± 0.02 0.61 ± 0.04 0.75 ± 0.08 0.77 ± 0.03 0.60 ± 0.03 0.67 ± 0.02 0.70 ± 0.04
Phikon 0.73 ± 0.02 0.72 ± 0.04 0.73 ± 0.04 0.55 ± 0.03 0.82 ± 0.08 0.85 ± 0.03 0.57 ± 0.04 0.65 ± 0.01 0.65 ± 0.05
ViT-S 0.70 ± 0.04 0.68 ± 0.04 0.70 ± 0.04 0.61 ± 0.03 0.74 ± 0.10 0.72 ± 0.06 0.61 ± 0.04 0.62 ± 0.03 0.62 ± 0.13
Lunit-DINO 0.78 ± 0.02 0.77 ± 0.02 0.78 ± 0.03 0.63 ± 0.01 0.84 ± 0.08 0.91 ± 0.04 0.65 ± 0.04 0.76 ± 0.06 0.62 ± 0.07
ResNet-50 0.69 ± 0.03 0.61 ± 0.05 0.66 ± 0.04 0.62 ± 0.02 0.74 ± 0.08 0.69 ± 0.03 0.53 ± 0.05 0.62 ± 0.02 0.60 ± 0.07
RetCCL 0.76 ± 0.03 0.78 ± 0.03 0.78 ± 0.03 0.60 ± 0.03 0.82 ± 0.06 0.85 ± 0.03 0.69 ± 0.01 0.63 ± 0.02 0.64 ± 0.01
Lunit-BT 0.69 ± 0.05 0.76 ± 0.04 0.75 ± 0.10 0.53 ± 0.02 0.64 ± 0.19 0.75 ± 0.17 0.63 ± 0.08 0.42 ± 0.07 0.49 ± 0.07
Lunit-SwAV 0.75 ± 0.03 0.78 ± 0.02 0.79 ± 0.02 0.60 ± 0.05 0.83 ± 0.06 0.79 ± 0.04 0.58 ± 0.03 0.71 ± 0.04 0.58 ± 0.07
Transformer Swin 0.73 ± 0.03 0.66 ± 0.04 0.61 ± 0.03 0.58 ± 0.03 0.74 ± 0.10 0.69 ± 0.10 0.57 ± 0.06 0.53 ± 0.04 0.55 ± 0.09
CTransPath 0.79 ± 0.03 0.79 ± 0.03 0.76 ± 0.05 0.54 ± 0.05 0.87 ± 0.08 0.88 ± 0.02 0.63 ± 0.03 0.71 ± 0.05 0.54 ± 0.09
ViT-B 0.70 ± 0.04 0.67 ± 0.03 0.64 ± 0.03 0.60 ± 0.03 0.71 ± 0.09 0.68 ± 0.07 0.58 ± 0.04 0.52 ± 0.11 0.67 ± 0.04
Phikon 0.72 ± 0.01 0.73 ± 0.01 0.72 ± 0.04 0.59 ± 0.01 0.82 ± 0.09 0.86 ± 0.03 0.63 ± 0.07 0.66 ± 0.08 0.68 ± 0.04
ViT-S 0.73 ± 0.01 0.65 ± 0.05 0.66 ± 0.06 0.57 ± 0.03 0.76 ± 0.10 0.68 ± 0.09 0.59 ± 0.03 0.60 ± 0.02 0.67 ± 0.03
Lunit-DINO 0.81 ± 0.03 0.74 ± 0.04 0.79 ± 0.03 0.60 ± 0.03 0.86 ± 0.06 0.89 ± 0.03 0.59 ± 0.07 0.71 ± 0.06 0.64 ± 0.07
ResNet-50 0.68 ± 0.03 0.61 ± 0.07 0.64 ± 0.04 0.59 ± 0.02 0.70 ± 0.08 0.69 ± 0.04 0.56 ± 0.03 0.62 ± 0.06 0.51 ± 0.14
RetCCL 0.76 ± 0.05 0.75 ± 0.04 0.78 ± 0.02 0.56 ± 0.05 0.81 ± 0.04 0.81 ± 0.02 0.58 ± 0.04 0.54 ± 0.09 0.63 ± 0.03
Lunit-BT 0.80 ± 0.03 0.78 ± 0.02 0.78 ± 0.03 0.57 ± 0.01 0.85 ± 0.06 0.86 ± 0.02 0.67 ± 0.02 0.60 ± 0.07 0.66 ± 0.01
Lunit-SwAV 0.76 ± 0.03 0.79 ± 0.01 0.77 ± 0.04 0.56 ± 0.06 0.83 ± 0.10 0.78 ± 0.06 0.55 ± 0.05 0.59 ± 0.11 0.53 ± 0.09
Mean pool Swin 0.76 ± 0.01 0.62 ± 0.02 0.60 ± 0.04 0.61 ± 0.01 0.62 ± 0.09 0.73 ± 0.03 0.63 ± 0.05 0.67 ± 0.07 0.63 ± 0.03
CTransPath 0.83 ± 0.00 0.74 ± 0.00 0.71 ± 0.01 0.64 ± 0.01 0.67 ± 0.09 0.89 ± 0.01 0.60 ± 0.05 0.74 ± 0.03 0.62 ± 0.02
ViT-B 0.78 ± 0.01 0.68 ± 0.01 0.67 ± 0.02 0.60 ± 0.01 0.67 ± 0.12 0.75 ± 0.02 0.60 ± 0.04 0.67 ± 0.07 0.70 ± 0.01
Phikon 0.70 ± 0.01 0.73 ± 0.02 0.64 ± 0.03 0.58 ± 0.02 0.69 ± 0.12 0.91 ± 0.02 0.65 ± 0.03 0.69 ± 0.06 0.63 ± 0.03
ViT-S 0.75 ± 0.01 0.68 ± 0.02 0.63 ± 0.03 0.59 ± 0.03 0.63 ± 0.11 0.74 ± 0.06 0.65 ± 0.03 0.59 ± 0.04 0.67 ± 0.03
Lunit-DINO 0.76 ± 0.01 0.74 ± 0.03 0.70 ± 0.05 0.60 ± 0.01 0.75 ± 0.12 0.89 ± 0.01 0.63 ± 0.03 0.77 ± 0.05 0.65 ± 0.02
ResNet-50 0.74 ± 0.01 0.65 ± 0.05 0.60 ± 0.02 0.61 ± 0.01 0.61 ± 0.10 0.73 ± 0.04 0.61 ± 0.04 0.65 ± 0.02 0.65 ± 0.06
RetCCL 0.80 ± 0.00 0.76 ± 0.01 0.68 ± 0.03 0.59 ± 0.00 0.69 ± 0.10 0.86 ± 0.01 0.65 ± 0.02 0.67 ± 0.03 0.66 ± 0.00
Lunit-BT 0.73 ± 0.03 0.75 ± 0.00 0.71 ± 0.04 0.57 ± 0.00 0.60 ± 0.10 0.76 ± 0.04 0.61 ± 0.05 0.60 ± 0.08 0.68 ± 0.01
Lunit-SwAV 0.74 ± 0.01 0.75 ± 0.01 0.72 ± 0.02 0.49 ± 0.02 0.69 ± 0.11 0.76 ± 0.01 0.51 ± 0.02 0.78 ± 0.02 0.57 ± 0.04

Table 10. Test AUROC scores (averaged across the five seeds) for all tasks, feature extractors, and downstream models, when employing
slidewise stain normalisation [58].

Target -Subtype -CDH1 -TP53 -PIK3CA -LN status -MSI -KRAS -BRAF -SMAD4
Model Feature extractor
AttMIL Swin 0.73 ± 0.02 0.61 ± 0.05 0.57 ± 0.03 0.60 ± 0.03 0.75 ± 0.08 0.76 ± 0.02 0.57 ± 0.04 0.65 ± 0.08 0.51 ± 0.02
CTransPath 0.81 ± 0.03 0.78 ± 0.04 0.78 ± 0.02 0.60 ± 0.01 0.88 ± 0.07 0.83 ± 0.06 0.61 ± 0.03 0.70 ± 0.02 0.65 ± 0.02
ViT-B 0.71 ± 0.03 0.69 ± 0.03 0.69 ± 0.01 0.60 ± 0.05 0.75 ± 0.13 0.76 ± 0.04 0.58 ± 0.04 0.63 ± 0.06 0.69 ± 0.02
Phikon 0.70 ± 0.02 0.76 ± 0.01 0.72 ± 0.04 0.59 ± 0.02 0.82 ± 0.10 0.87 ± 0.03 0.62 ± 0.05 0.66 ± 0.03 0.65 ± 0.06
ViT-S 0.72 ± 0.02 0.70 ± 0.03 0.71 ± 0.04 0.63 ± 0.02 0.72 ± 0.07 0.75 ± 0.04 0.62 ± 0.06 0.58 ± 0.07 0.64 ± 0.03
Lunit-DINO 0.77 ± 0.02 0.79 ± 0.01 0.77 ± 0.03 0.62 ± 0.02 0.85 ± 0.07 0.89 ± 0.03 0.61 ± 0.04 0.73 ± 0.07 0.66 ± 0.03
ResNet-50 0.64 ± 0.03 0.62 ± 0.04 0.64 ± 0.06 0.63 ± 0.01 0.75 ± 0.07 0.72 ± 0.02 0.55 ± 0.03 0.59 ± 0.07 0.57 ± 0.07
RetCCL 0.73 ± 0.03 0.77 ± 0.02 0.78 ± 0.05 0.62 ± 0.02 0.82 ± 0.10 0.83 ± 0.03 0.66 ± 0.04 0.62 ± 0.02 0.64 ± 0.03
Lunit-BT 0.70 ± 0.03 0.76 ± 0.01 0.79 ± 0.03 0.51 ± 0.02 0.66 ± 0.14 0.57 ± 0.08 0.60 ± 0.06 0.48 ± 0.10 0.56 ± 0.11
Lunit-SwAV 0.76 ± 0.01 0.78 ± 0.03 0.77 ± 0.01 0.59 ± 0.03 0.83 ± 0.08 0.78 ± 0.04 0.55 ± 0.08 0.69 ± 0.05 0.60 ± 0.05
Transformer Swin 0.71 ± 0.04 0.63 ± 0.05 0.61 ± 0.05 0.56 ± 0.03 0.72 ± 0.09 0.71 ± 0.04 0.53 ± 0.02 0.55 ± 0.07 0.61 ± 0.07
CTransPath 0.80 ± 0.02 0.76 ± 0.04 0.80 ± 0.02 0.59 ± 0.04 0.85 ± 0.08 0.86 ± 0.05 0.60 ± 0.04 0.69 ± 0.08 0.62 ± 0.06
ViT-B 0.70 ± 0.03 0.69 ± 0.02 0.64 ± 0.03 0.57 ± 0.02 0.75 ± 0.11 0.69 ± 0.08 0.54 ± 0.07 0.55 ± 0.03 0.61 ± 0.03
Phikon 0.74 ± 0.03 0.73 ± 0.03 0.74 ± 0.03 0.58 ± 0.03 0.84 ± 0.07 0.86 ± 0.02 0.62 ± 0.06 0.69 ± 0.03 0.67 ± 0.04
ViT-S 0.71 ± 0.03 0.70 ± 0.06 0.63 ± 0.02 0.59 ± 0.05 0.75 ± 0.10 0.74 ± 0.02 0.63 ± 0.08 0.57 ± 0.03 0.65 ± 0.07
Lunit-DINO 0.78 ± 0.03 0.77 ± 0.04 0.79 ± 0.01 0.62 ± 0.04 0.87 ± 0.06 0.88 ± 0.04 0.58 ± 0.03 0.68 ± 0.09 0.64 ± 0.07
ResNet-50 0.66 ± 0.05 0.64 ± 0.11 0.58 ± 0.04 0.61 ± 0.07 0.77 ± 0.09 0.69 ± 0.06 0.54 ± 0.04 0.61 ± 0.04 0.40 ± 0.12
RetCCL 0.76 ± 0.03 0.76 ± 0.05 0.77 ± 0.04 0.59 ± 0.01 0.83 ± 0.07 0.82 ± 0.05 0.58 ± 0.05 0.62 ± 0.08 0.64 ± 0.05
Lunit-BT 0.78 ± 0.03 0.77 ± 0.03 0.77 ± 0.03 0.58 ± 0.04 0.86 ± 0.07 0.85 ± 0.03 0.59 ± 0.06 0.62 ± 0.02 0.63 ± 0.07
Lunit-SwAV 0.75 ± 0.03 0.80 ± 0.02 0.78 ± 0.04 0.57 ± 0.04 0.84 ± 0.06 0.82 ± 0.04 0.52 ± 0.04 0.69 ± 0.13 0.59 ± 0.01
Mean pool Swin 0.74 ± 0.01 0.65 ± 0.02 0.61 ± 0.04 0.61 ± 0.01 0.65 ± 0.11 0.78 ± 0.02 0.64 ± 0.04 0.65 ± 0.03 0.64 ± 0.01
CTransPath 0.80 ± 0.00 0.77 ± 0.01 0.70 ± 0.02 0.62 ± 0.02 0.67 ± 0.11 0.87 ± 0.02 0.59 ± 0.06 0.72 ± 0.03 0.64 ± 0.02
ViT-B 0.77 ± 0.00 0.68 ± 0.01 0.65 ± 0.02 0.60 ± 0.01 0.68 ± 0.11 0.73 ± 0.03 0.58 ± 0.06 0.63 ± 0.06 0.66 ± 0.03
Phikon 0.69 ± 0.01 0.76 ± 0.01 0.61 ± 0.02 0.55 ± 0.04 0.68 ± 0.16 0.88 ± 0.05 0.63 ± 0.03 0.70 ± 0.03 0.62 ± 0.07
ViT-S 0.74 ± 0.01 0.72 ± 0.04 0.63 ± 0.05 0.61 ± 0.02 0.67 ± 0.13 0.73 ± 0.02 0.67 ± 0.02 0.58 ± 0.06 0.61 ± 0.04
Lunit-DINO 0.76 ± 0.01 0.75 ± 0.02 0.68 ± 0.05 0.59 ± 0.02 0.73 ± 0.15 0.85 ± 0.03 0.61 ± 0.04 0.79 ± 0.03 0.65 ± 0.03
ResNet-50 0.73 ± 0.00 0.66 ± 0.05 0.65 ± 0.01 0.60 ± 0.01 0.63 ± 0.11 0.75 ± 0.05 0.66 ± 0.03 0.58 ± 0.04 0.58 ± 0.11
RetCCL 0.79 ± 0.00 0.75 ± 0.01 0.67 ± 0.02 0.60 ± 0.01 0.71 ± 0.10 0.85 ± 0.01 0.63 ± 0.05 0.66 ± 0.05 0.65 ± 0.01
Lunit-BT 0.75 ± 0.04 0.75 ± 0.01 0.69 ± 0.05 0.57 ± 0.01 0.67 ± 0.12 0.79 ± 0.03 0.66 ± 0.03 0.61 ± 0.01 0.68 ± 0.01
Lunit-SwAV 0.74 ± 0.00 0.75 ± 0.01 0.68 ± 0.01 0.51 ± 0.01 0.73 ± 0.14 0.78 ± 0.02 0.53 ± 0.01 0.75 ± 0.02 0.60 ± 0.02

Table 11. Test AUROC scores (averaged across the five seeds) for all tasks, feature extractors, and downstream models, when employing
patchwise stain normalisation [58].

10
Target -Subtype -CDH1 -TP53 -PIK3CA -LN status -MSI -KRAS -BRAF -SMAD4
Model Feature extractor
AttMIL Swin 0.76 ± 0.02 0.66 ± 0.06 0.54 ± 0.03 0.59 ± 0.01 0.77 ± 0.08 0.77 ± 0.03 0.53 ± 0.02 0.68 ± 0.04 0.52 ± 0.03
CTransPath 0.81 ± 0.04 0.79 ± 0.03 0.80 ± 0.01 0.65 ± 0.03 0.86 ± 0.06 0.85 ± 0.03 0.59 ± 0.05 0.71 ± 0.01 0.65 ± 0.02
ViT-B 0.74 ± 0.04 0.71 ± 0.01 0.67 ± 0.02 0.58 ± 0.03 0.76 ± 0.06 0.76 ± 0.03 0.57 ± 0.09 0.66 ± 0.03 0.70 ± 0.04
Phikon 0.74 ± 0.02 0.74 ± 0.02 0.76 ± 0.08 0.55 ± 0.03 0.84 ± 0.07 0.85 ± 0.03 0.59 ± 0.04 0.70 ± 0.09 0.52 ± 0.10
ViT-S 0.75 ± 0.01 0.77 ± 0.02 0.69 ± 0.04 0.59 ± 0.03 0.71 ± 0.11 0.70 ± 0.06 0.61 ± 0.04 0.60 ± 0.05 0.64 ± 0.09
Lunit-DINO 0.76 ± 0.02 0.77 ± 0.03 0.80 ± 0.01 0.64 ± 0.01 0.86 ± 0.07 0.88 ± 0.02 0.59 ± 0.02 0.77 ± 0.04 0.68 ± 0.03
ResNet-50 0.68 ± 0.02 0.71 ± 0.04 0.69 ± 0.04 0.63 ± 0.02 0.76 ± 0.11 0.66 ± 0.05 0.52 ± 0.06 0.57 ± 0.06 0.43 ± 0.08
RetCCL 0.77 ± 0.02 0.77 ± 0.03 0.80 ± 0.02 0.61 ± 0.02 0.84 ± 0.08 0.82 ± 0.03 0.62 ± 0.05 0.63 ± 0.02 0.65 ± 0.01
Lunit-BT 0.73 ± 0.02 0.78 ± 0.02 0.78 ± 0.05 0.53 ± 0.02 0.62 ± 0.22 0.64 ± 0.14 0.57 ± 0.10 0.44 ± 0.12 0.50 ± 0.05
Lunit-SwAV 0.75 ± 0.01 0.75 ± 0.03 0.75 ± 0.04 0.56 ± 0.05 0.84 ± 0.07 0.82 ± 0.02 0.57 ± 0.06 0.73 ± 0.05 0.60 ± 0.04
Transformer Swin 0.74 ± 0.04 0.70 ± 0.03 0.58 ± 0.03 0.60 ± 0.02 0.76 ± 0.09 0.79 ± 0.04 0.61 ± 0.06 0.56 ± 0.09 0.59 ± 0.07
CTransPath 0.79 ± 0.02 0.78 ± 0.02 0.78 ± 0.05 0.57 ± 0.02 0.87 ± 0.06 0.82 ± 0.06 0.59 ± 0.06 0.62 ± 0.09 0.66 ± 0.01
ViT-B 0.74 ± 0.04 0.72 ± 0.03 0.67 ± 0.04 0.57 ± 0.04 0.74 ± 0.06 0.70 ± 0.04 0.54 ± 0.01 0.61 ± 0.07 0.67 ± 0.05
Phikon 0.75 ± 0.05 0.71 ± 0.02 0.74 ± 0.02 0.57 ± 0.02 0.86 ± 0.04 0.84 ± 0.04 0.61 ± 0.02 0.70 ± 0.05 0.57 ± 0.04
ViT-S 0.75 ± 0.03 0.76 ± 0.01 0.65 ± 0.04 0.55 ± 0.06 0.74 ± 0.08 0.71 ± 0.01 0.55 ± 0.04 0.59 ± 0.05 0.68 ± 0.04
Lunit-DINO 0.78 ± 0.03 0.72 ± 0.03 0.79 ± 0.02 0.63 ± 0.03 0.87 ± 0.04 0.89 ± 0.02 0.59 ± 0.03 0.73 ± 0.03 0.66 ± 0.07
ResNet-50 0.72 ± 0.01 0.71 ± 0.05 0.64 ± 0.04 0.61 ± 0.07 0.74 ± 0.07 0.65 ± 0.05 0.57 ± 0.03 0.58 ± 0.07 0.39 ± 0.05
RetCCL 0.74 ± 0.06 0.74 ± 0.04 0.80 ± 0.04 0.55 ± 0.04 0.86 ± 0.07 0.71 ± 0.06 0.54 ± 0.08 0.59 ± 0.06 0.61 ± 0.09
Lunit-BT 0.79 ± 0.02 0.75 ± 0.04 0.77 ± 0.04 0.58 ± 0.02 0.84 ± 0.06 0.86 ± 0.04 0.63 ± 0.04 0.63 ± 0.03 0.67 ± 0.01
Lunit-SwAV 0.74 ± 0.05 0.76 ± 0.05 0.77 ± 0.01 0.53 ± 0.04 0.84 ± 0.05 0.82 ± 0.05 0.56 ± 0.03 0.70 ± 0.08 0.54 ± 0.06
Mean pool Swin 0.75 ± 0.01 0.69 ± 0.03 0.60 ± 0.04 0.59 ± 0.02 0.69 ± 0.12 0.74 ± 0.02 0.63 ± 0.06 0.65 ± 0.01 0.57 ± 0.03
CTransPath 0.82 ± 0.00 0.75 ± 0.02 0.73 ± 0.02 0.64 ± 0.03 0.69 ± 0.12 0.85 ± 0.02 0.59 ± 0.03 0.75 ± 0.02 0.64 ± 0.03
ViT-B 0.76 ± 0.01 0.71 ± 0.01 0.67 ± 0.01 0.56 ± 0.01 0.68 ± 0.09 0.75 ± 0.03 0.59 ± 0.06 0.63 ± 0.03 0.69 ± 0.01
Phikon 0.74 ± 0.01 0.74 ± 0.02 0.64 ± 0.02 0.61 ± 0.03 0.73 ± 0.13 0.87 ± 0.01 0.56 ± 0.04 0.71 ± 0.09 0.61 ± 0.02
ViT-S 0.74 ± 0.01 0.77 ± 0.03 0.62 ± 0.02 0.56 ± 0.01 0.70 ± 0.08 0.73 ± 0.01 0.66 ± 0.03 0.57 ± 0.05 0.63 ± 0.03
Lunit-DINO 0.73 ± 0.01 0.75 ± 0.02 0.77 ± 0.02 0.60 ± 0.02 0.76 ± 0.11 0.87 ± 0.02 0.58 ± 0.04 0.78 ± 0.02 0.69 ± 0.02
ResNet-50 0.74 ± 0.01 0.79 ± 0.02 0.65 ± 0.01 0.61 ± 0.03 0.66 ± 0.10 0.67 ± 0.05 0.64 ± 0.03 0.55 ± 0.04 0.58 ± 0.04
RetCCL 0.81 ± 0.00 0.75 ± 0.00 0.69 ± 0.02 0.58 ± 0.02 0.70 ± 0.13 0.77 ± 0.04 0.61 ± 0.05 0.65 ± 0.01 0.65 ± 0.00
Lunit-BT 0.76 ± 0.02 0.75 ± 0.00 0.71 ± 0.05 0.57 ± 0.01 0.63 ± 0.08 0.80 ± 0.05 0.66 ± 0.01 0.62 ± 0.00 0.68 ± 0.00
Lunit-SwAV 0.75 ± 0.00 0.75 ± 0.01 0.69 ± 0.04 0.53 ± 0.01 0.71 ± 0.15 0.83 ± 0.02 0.55 ± 0.03 0.76 ± 0.02 0.59 ± 0.05

Table 12. Test AUROC scores (averaged across the five seeds) for all tasks, feature extractors, and downstream models, when employing
rotation/flipping augmentations.

Target -Subtype -CDH1 -TP53 -PIK3CA -LN status -MSI -KRAS -BRAF -SMAD4
Model Feature extractor
AttMIL Swin 0.77 ± 0.01 0.66 ± 0.02 0.61 ± 0.02 0.59 ± 0.03 0.79 ± 0.09 0.74 ± 0.04 0.56 ± 0.06 0.63 ± 0.06 0.54 ± 0.04
CTransPath 0.81 ± 0.03 0.79 ± 0.02 0.82 ± 0.01 0.62 ± 0.02 0.89 ± 0.05 0.79 ± 0.03 0.60 ± 0.03 0.70 ± 0.05 0.65 ± 0.02
ViT-B 0.77 ± 0.01 0.70 ± 0.05 0.70 ± 0.03 0.58 ± 0.03 0.78 ± 0.06 0.76 ± 0.04 0.60 ± 0.02 0.63 ± 0.02 0.70 ± 0.04
Phikon 0.68 ± 0.05 0.71 ± 0.03 0.72 ± 0.06 0.54 ± 0.04 0.84 ± 0.07 0.84 ± 0.03 0.56 ± 0.08 0.61 ± 0.06 0.59 ± 0.07
ViT-S 0.73 ± 0.02 0.73 ± 0.02 0.68 ± 0.02 0.58 ± 0.04 0.74 ± 0.10 0.75 ± 0.02 0.61 ± 0.03 0.60 ± 0.03 0.65 ± 0.06
Lunit-DINO 0.76 ± 0.03 0.77 ± 0.03 0.78 ± 0.03 0.62 ± 0.03 0.86 ± 0.06 0.89 ± 0.03 0.59 ± 0.03 0.78 ± 0.07 0.67 ± 0.06
ResNet-50 0.72 ± 0.01 0.74 ± 0.03 0.68 ± 0.03 0.65 ± 0.04 0.76 ± 0.09 0.65 ± 0.04 0.52 ± 0.02 0.55 ± 0.06 0.41 ± 0.13
RetCCL 0.75 ± 0.03 0.77 ± 0.04 0.79 ± 0.03 0.59 ± 0.01 0.85 ± 0.05 0.79 ± 0.07 0.62 ± 0.06 0.61 ± 0.02 0.65 ± 0.01
Lunit-BT 0.64 ± 0.05 0.69 ± 0.07 0.62 ± 0.22 0.49 ± 0.01 0.51 ± 0.07 0.68 ± 0.11 0.54 ± 0.05 0.55 ± 0.08 0.52 ± 0.06
Lunit-SwAV 0.74 ± 0.01 0.77 ± 0.01 0.75 ± 0.04 0.56 ± 0.04 0.84 ± 0.06 0.82 ± 0.02 0.58 ± 0.05 0.66 ± 0.05 0.61 ± 0.05
Transformer Swin 0.73 ± 0.01 0.67 ± 0.06 0.60 ± 0.03 0.62 ± 0.03 0.80 ± 0.10 0.76 ± 0.03 0.60 ± 0.08 0.69 ± 0.03 0.60 ± 0.03
CTransPath 0.79 ± 0.04 0.74 ± 0.01 0.78 ± 0.03 0.61 ± 0.03 0.89 ± 0.04 0.83 ± 0.04 0.58 ± 0.03 0.69 ± 0.08 0.58 ± 0.07
ViT-B 0.76 ± 0.02 0.70 ± 0.03 0.66 ± 0.03 0.56 ± 0.02 0.75 ± 0.14 0.74 ± 0.01 0.50 ± 0.06 0.53 ± 0.08 0.68 ± 0.04
Phikon 0.69 ± 0.03 0.71 ± 0.03 0.75 ± 0.03 0.54 ± 0.03 0.85 ± 0.06 0.84 ± 0.05 0.63 ± 0.04 0.75 ± 0.04 0.56 ± 0.05
ViT-S 0.75 ± 0.02 0.74 ± 0.02 0.66 ± 0.05 0.56 ± 0.03 0.74 ± 0.05 0.72 ± 0.05 0.60 ± 0.05 0.56 ± 0.03 0.69 ± 0.01
Lunit-DINO 0.77 ± 0.02 0.75 ± 0.02 0.79 ± 0.01 0.61 ± 0.03 0.87 ± 0.07 0.88 ± 0.02 0.58 ± 0.05 0.71 ± 0.04 0.69 ± 0.04
ResNet-50 0.71 ± 0.03 0.69 ± 0.04 0.71 ± 0.03 0.64 ± 0.02 0.75 ± 0.08 0.71 ± 0.02 0.63 ± 0.03 0.59 ± 0.05 0.44 ± 0.07
RetCCL 0.78 ± 0.02 0.75 ± 0.06 0.79 ± 0.01 0.56 ± 0.04 0.85 ± 0.08 0.77 ± 0.08 0.54 ± 0.05 0.53 ± 0.09 0.58 ± 0.08
Lunit-BT 0.77 ± 0.03 0.78 ± 0.04 0.76 ± 0.03 0.55 ± 0.03 0.84 ± 0.09 0.82 ± 0.06 0.64 ± 0.03 0.63 ± 0.03 0.65 ± 0.02
Lunit-SwAV 0.72 ± 0.02 0.78 ± 0.02 0.77 ± 0.03 0.55 ± 0.03 0.86 ± 0.07 0.79 ± 0.05 0.57 ± 0.05 0.70 ± 0.03 0.55 ± 0.07
Mean pool Swin 0.77 ± 0.01 0.68 ± 0.04 0.62 ± 0.03 0.60 ± 0.02 0.66 ± 0.12 0.75 ± 0.02 0.65 ± 0.04 0.61 ± 0.05 0.58 ± 0.04
CTransPath 0.83 ± 0.00 0.75 ± 0.02 0.73 ± 0.01 0.64 ± 0.01 0.70 ± 0.12 0.86 ± 0.03 0.61 ± 0.03 0.75 ± 0.02 0.61 ± 0.02
ViT-B 0.76 ± 0.01 0.70 ± 0.01 0.68 ± 0.01 0.56 ± 0.01 0.68 ± 0.08 0.72 ± 0.02 0.58 ± 0.05 0.59 ± 0.01 0.69 ± 0.01
Phikon 0.71 ± 0.01 0.76 ± 0.03 0.65 ± 0.03 0.56 ± 0.04 0.73 ± 0.12 0.88 ± 0.02 0.57 ± 0.05 0.70 ± 0.07 0.59 ± 0.02
ViT-S 0.72 ± 0.02 0.75 ± 0.02 0.62 ± 0.01 0.57 ± 0.00 0.69 ± 0.11 0.69 ± 0.03 0.65 ± 0.04 0.56 ± 0.03 0.65 ± 0.02
Lunit-DINO 0.74 ± 0.01 0.76 ± 0.02 0.77 ± 0.03 0.59 ± 0.03 0.77 ± 0.12 0.88 ± 0.03 0.59 ± 0.02 0.79 ± 0.01 0.70 ± 0.03
ResNet-50 0.75 ± 0.01 0.77 ± 0.01 0.67 ± 0.02 0.61 ± 0.01 0.62 ± 0.10 0.67 ± 0.03 0.68 ± 0.01 0.52 ± 0.05 0.55 ± 0.06
RetCCL 0.82 ± 0.00 0.75 ± 0.01 0.71 ± 0.01 0.57 ± 0.01 0.71 ± 0.12 0.79 ± 0.05 0.61 ± 0.07 0.63 ± 0.03 0.65 ± 0.00
Lunit-BT 0.74 ± 0.04 0.74 ± 0.00 0.68 ± 0.06 0.55 ± 0.02 0.57 ± 0.09 0.77 ± 0.07 0.66 ± 0.01 0.60 ± 0.01 0.61 ± 0.16
Lunit-SwAV 0.76 ± 0.00 0.76 ± 0.02 0.75 ± 0.01 0.54 ± 0.02 0.70 ± 0.15 0.85 ± 0.01 0.55 ± 0.04 0.68 ± 0.05 0.58 ± 0.05

Table 13. Test AUROC scores (averaged across the five seeds) for all tasks, feature extractors, and downstream models, when employing
all augmentations.

11

You might also like