Computer Vision News - October 2021
Computer Vision News - October 2021
Computer Vision News - October 2021
Follow us:
Summary
Summary 3
Computer Vision News Medical Imaging News
04 30
10 32
Overview of UNETR. Our
proposed model consists Best of
of a transformer MICCAI
encoder that directly 2021
utilizes 3D patches and
is connected to a
CNN-based decoder via
skip connection.
16 54 Qualitative comparison
of different baselines.
UNETR has a
significantly better
segmentation accuracy
for left and right adrenal
glands, and UNETR is the
only model to correctly
detect branches of the
adrenal glands.
22 64
encoder to increase the model’s the figure above for more qualitative
capability for learning long-range comparisons between UNETR and
dependencies and effectively capturing other CNN-based and transformer-
global contextual representation at based segmentation models.
multiple scales. For instance, in the UNETR has shown promising
multi-organ segmentation task, UNETR performance on various volumetric
can accurately segment organs with medical image segmentation tasks
complex shapes (e.g. adrenal glands) such as multi-organ segmentation
and low contrast (e.g. portal veins) using Multi Atlas Labeling Beyond The
16 Spleenlab
AI Systems for Autonomous Mobility
54 Transformers in Medical Imaging
by NVIDIA and MONAI
22 RePAIR Pompeii
AI for Archeology
64 Computer Vision for Automated...
ICCV Workshop Preview
2
4 Computer
Summary Vision Research
Removing Diffraction Image Artifacts in Under-Display
Camera via Dynamic Skip Connection Network
by Marica Muffoletto
Dear readers, welcome back to a new issue of Computer Vision
News full of great research!
This month we are reviewing a paper from CVPR 2021, Removing
Diffraction Image Artifacts in Under-Display Camera via
Dynamic Skip Connection Network, written by Ruicheng Feng,
Chongyi Li, Huaijin Chen, Shuai Li, Chen Change Loy, Jinwei Gu.
We are indebted to the authors for allowing us to use their
images to illustrate this review. Their paper can be found here.
If you like to play around with the quality of cameras in your tech devices, this
might become your favourite CVPR paper of the year. The subject of this research
is indeed how to remove image artifacts in the newly defined imaging system
called Under-Display Camera (UDC), which is employed in some smartphones, for
videoconferencing with UDC TV, laptops, or tablets. UDC introduces a new class of
complex image degradation problems (strong flare, haze, blur, and noise), which
still need to be satisfactorily dealt with by the computer vision community. A typical
UDC system is constituted by a camera module placed underneath and closely
attached to the semi-transparent Organic Light-Emitting Diode (OLED) display.
The first contribution consists in the formulation of an image formation model for
UDC systems which considers dynamic range and saturation and could simulate
more complex and realistic degradation compared to the State-of-the-Art.
where x represents the real scene irradiance that has a high dynamic range (HDR), k
𝑦𝑦̂ = 𝜙𝜙[𝐶𝐶
is the known convolution kernel (PSF), (𝑥𝑥 ∗ 𝑘𝑘 +
denotes 𝑛𝑛)2D
the ] convolution operator, and
n models the camera noise. C(·) is a clipping operation with a set threshold and φ(·)
is a non-linear tone mapping. These two elements add a saturation effect derived
from the limited dynamic range of digital sensors and make the model closer to the
human perception of the scene.
The second element in the paper consists in defining the PSF. This can be simulated
but it was found that the real-measured PSF slightly differs in colours and contrast
due to model approximations and manufacturing imperfections.
Hence, the authors measure the real-word PSF by placing a white point light
source 1-meter away from the OLED display. The PSF is used as part of a
model-based data synthesis pipeline to generate realistic degraded images.
To do this, the objects considered are real scenes with high dynamic range. This is
essential because 1) the spike-shaped sidelobes (typical of the PSF) can be amplified
to be visible (flares) in the degraded image, and 2) due to the high dynamic range
of the input scene, the digital sensor (usually 10-bit) will inevitably get saturated in
real applications, resulting in an information loss.
Hence, images captured by UDC systems in real scenes will exhibit structured flares
near strong light sources. The previous imaging system, however, cannot model
this degradation, because it captures images displayed on an LCD monitor, which
commonly has limited dynamic range.
This is shown below, where is demonstrated that the real HDR scene captured by the
62 Computer
SummaryVision Research
UDC device (b) shows flare effects near strong light sources, while for the monitor-
generated LDR scene (c) with a limited dynamic range, these are no longer visible.
This is found within a main restoration branch which builds upon an encoder-
decoder architecture with skip connections to restore the degraded images. The
Removing Diffraction
Summary
Image Artifacts in ... 73
extracted features from the encoder are fed into DISCNet which transforms them
into R1, R2, R3. These are then reconstructed back to the final tone-mapped sharp
images.
The network is fed with condition maps of size H x W x (b+C), where b stands for the
kernel code (a b-dimensional vector of the PDF dimensionally reduced by Principal
Component Analysis) and H, W, C represent size and channels of the degraded
images respectively.
Given the condition maps as input, the condition encoder extracts scale-specific
feature maps H1, H2, H3 using 3 blocks like the encoder of the restoration branch.
This manages to recover saturated information from nearby low-light regions in the
degraded images with spatial variability. Then, the extracted features at different
scales are fed into their corresponding filter generators, where each comprises a
3 × 3 convolution layer, two residual blocks, and a 1 × 1 convolution layer to expand
feature dimension. The predicted filters are output and passed into a dynamic
convolution element which finally refines the features and cast them into the
main restoration branch.
Similarly, comparisons with representative methods are shown for the real images
below. The network proposed by the authors manage to remove diffraction image
effects, while leaving least artifacts introduced by the camera. Since the ground-
truth images are inaccessible, another comparison is made with the camera output
of a ZTE phone.
Removing Diffraction
Summary 3
Image Artifacts in ... 9
PyTorch codes and data from this paper are available on github. Best of luck with
making your images look great
See you all next month!
2
10 Coding
SummaryWorkshop
Coding Workshop: Creating a multi-object tracking
model using a pre-trained RNN
This month’s article is about using a pre-trained network (you can find that
implementation on the GitHub MOT Challenge.
MOT Challenge
There is remarkable progress on object detection and association in recent years
which are the core components for multi-object tracking. The main focus though
has been on improving singular networks. In this example (see the original paper
to read more about the topic), the proposed architecture (pre-trained network)
combines two tasks in a single network to improve the inference speed.
Older research has shown degraded results in this combined network, mainly
because the association branch is not appropriately learned. Here, after discovering
the reasons behind the failure, a simple baseline was presented to address the
problems. It was shown that it remarkably outperforms the state-of-the-arts on
the MOT challenge datasets at 30 FPS. Hopefully, this baseline could inspire and
help evaluate new ideas in this field. Now let’s see the implementation of the pre-
trained network!
import os
import sys
import random
import math
import numpy as np
import skimage.io
import matplotlib
import matplotlib.pyplot as plt
import coco
%matplotlib inline
Configuration files
We'll be using a model trained on the MS-COCO dataset. The configurations of this
model are in the CocoConfig class in coco.py.
For inferencing, modify the configurations a bit to fit the task. To do so, sub-class
the CocoConfig class and override the attributes you need to change.
2 Coding
12 Summary
Workshop
class InferenceConfig(coco.CocoConfig):
# Set batch size to 1 since we'll be running inference on
# one image at a time. Batch size = GPU_COUNT * IMAGES_PER_GPU
GPU_COUNT = 1
IMAGES_PER_GPU = 1
config = InferenceConfig()
config.display()
Configurations:
BACKBONE_SHAPES [[256 256]
[128 128]
[ 64 64]
[ 32 32]
[ 16 16]]
BACKBONE_STRIDES [4, 8, 16, 32, 64]
BATCH_SIZE 1
BBOX_STD_DEV [ 0.1 0.1 0.2 0.2]
DETECTION_MAX_INSTANCES 100
DETECTION_MIN_CONFIDENCE 0.5
DETECTION_NMS_THRESHOLD 0.3
GPU_COUNT 1
IMAGES_PER_GPU 1
IMAGE_MAX_DIM 1024
IMAGE_MIN_DIM 800
IMAGE_PADDING True
IMAGE_SHAPE [1024 1024 3]
LEARNING_MOMENTUM 0.9
LEARNING_RATE 0.002
MASK_POOL_SIZE 14
MASK_SHAPE [28, 28]
MAX_GT_INSTANCES 100
MEAN_PIXEL [ 123.7 116.8 103.9]
MINI_MASK_SHAPE (56, 56)
NAME coco
NUM_CLASSES 81
POOL_SIZE 7
POST_NMS_ROIS_INFERENCE 1000
POST_NMS_ROIS_TRAINING 2000
ROI_POSITIVE_RATIO 0.33
RPN_ANCHOR_RATIOS [0.5, 1, 2]
RPN_ANCHOR_SCALES (32, 64, 128, 256, 512)
RPN_ANCHOR_STRIDE 2
RPN_BBOX_STD_DEV [ 0.1 0.1 0.2 0.2]
RPN_TRAIN_ANCHORS_PER_IMAGE 256
STEPS_PER_EPOCH 1000
Creating a multi-object
Summary
tracking model... 3
13
Create Model and Load Trained Weights
Class Names
The model classifies objects and returns class IDs, which are integer value that
identify each class. Some datasets assign integer values to their classes and some
don't. For example, in the MS-COCO dataset, the 'person' class is 1 and 'teddy bear'
is 88. The IDs are often sequential, but not always. The COCO dataset, for example,
has classes associated with class IDs 70 and 72, but not 71.
To get the list of class names, you'd load the dataset and then use the class_
names property like this.
You won't need to download the COCO dataset just to run this demo! We are including
the list of class names below. The index of the class name in the list represents its ID
(first class is 0, second is 1, third is 2, ...etc.)
2 Coding
14 Summary
Workshop
# Run detection
results = model.detect([image], verbose=1)
# Visualize results
r = results[0]
visualize.display_instances(image, r['rois'], r['masks'], r['class_ids'],
class_names, r['scores'])
Processing 1 images
image shape: (476, 640, 3) min: 0.00000 max: 2
55.00000
molded_images shape: (1, 1024, 1024, 3) min: -123.70000 max: 1
20.30000
image_metas shape: (1, 89) min: 0.00000 max: 1
024.00000
Creating a multi-object
Summary
tracking model... 3
15
Wrapping up!
I hope that I inspired you to look more into the world of pre-trained networks and
discover how easy it is to implement a new network, using a pre-trained model!
As always, please let me know if you have any questions, or suggestions for the
article! It would be great to hear more of you and what tools you would like to be
presented; feel free to reach out to me in any of the social media
2
16 AI Systems for
Summary
Autonomous Mobility
Spleenlab.ai is a pure AI software or explainable AI. But for Spleenlab,
company, specializing in safe and working in the field of autonomous
intelligent AI systems for autonomous mobility, safety is king.
mobility, including unmanned aerial
vehicles (UAVs) and autonomous The domains of UAVs, autonomous
driving. Their vision is to be an AI driving, and airborne systems are all
software supplier for air taxis, which regulated by entities such as EASA
are a composition of both UAVs and in Europe and the FAA in the United
autonomous vehicles. Stefan Milz States. There are important industrial
is Head of R&D, Managing Director standards such as ISO 26262 in the
and Founder of Spleenlab GmbH. automotive domain and DO-178 for
He received his PhD in Physics software in airborne systems, as well
from the Technical University of as new standards emerging in the UAV
Munich and has a strong history in domain. Systems are expected to have
professional software development a fully deterministic behaviour.
and automotive. Stefan tells us more “We believe that people need extreme
about Spleenlab. safety,” Stefan tells us.
To date, AI research has tended to “It’s important to foresee what
focus on achieving higher accuracy, state your system can achieve in an
better models, and more trustworthy automated scene. We want to have a
Landing prediction
Uncertainty is bad. People want to transparent AI, but Stefan says a new
know they are safe in their cars and standard that explains its deepest
that drones will not hit other aircraft. behaviors is still many years away,
But why is there uncertainty in the while models are being sold and
system? The models are trained on deployed in the real world now. That
data and validated on data, with is why Spleenlab is focused first and
millions of parameters in the system foremost on safety by design.
that are deterministic, but some
methods at inference time, like Take the example of an automated
dropout and pruning make the system drone on a package delivery which
statistical. A statistical system always encounters an emergency. Its battery
has uncertainty. has dropped to a critical level. The
system has a Caroline
rest range
Essert
of 100 meters.
“Take autonomous driving, for There is no known emergency landing
example,” Stefan says. point in the area. What does it do?
2
18 AISummary
Systems for
Autonomous Mobility
“A computer vision model can predict safe.”
some good landing spots,” Stefan
explains. This idea is not fully new in the
automated domain, but it has not often
“The best method to choose is an AI been combined with AI. The model
model. Semantic segmentation can must be validated with a large amount
semantically parse the scene and find of labeled data, which is expensive.
the best spots. We know the model is Deploying a model in a different
99 per cent accurate, but we still don’t domain, a different country for
want to hit anyone on the head, so example, requires domain adaptation,
we take those landing spots and put which significantly lowers the cost.
a safety goal around them. We use a
second sensor from either the thermal “We have a simulation engine where
imager or a LiDAR sensor and we look all the flights and the labels can be
at that predicted spot and validate if it generated,” Stefan tells us.
is really safe. With the thermal imager
we can validate it in the sense of, are “We also have a lot of labeled data from
there people, animals, or vehicles companies and are working on how
there? Those are the most critical we can transfer this data to different
things we want to avoid. With the domains. We call it cross-domain AI.
LiDAR sensor, we can estimate if the That is our core product. We have been
scene is geometrically flat and if there working on a pure software product
is any dynamic object in sight. With called Visionairy perception software,
two deterministic passes to validate which has several features for UAVs
it, we can then say, okay, this spot is and aviation.”
Other generic products the team yet, but it’s an important question that
are working on for the UAV market we have to keep asking ourselves.”
include a detect and avoid function
for detecting manned aircraft, which Stefan tells us the most difficult part of
heavily needs a vision component, so this work is the airborne certification
it is not solved yet. They also have and the proof of safety in terms of flight
emergency landing functionality hours or driving hours. Even with safety
and tracking functions in the by design or a deterministic approach,
pipeline, and are working on future that is still necessary.
architecture together with big air taxi
manufacturers like Volocopter and Many companies claim to have the
DLR. best drone or UAV on the market,
and that they will solve problems like
There have been some high-profile package delivery in the near future,
concerns about the use of AI in but there is one very important thing
everyday life, including fears that the that is not solved yet.
technology could be exploited by bad
actors. Does Stefan think this could be “It's the positioning system,” Stefan
an issue for Spleenlab? points out.
“At the moment, we are in a very early “You need a GPS redundant positioning
stage. We don’t see any potential issues system, which is safe, but if you want
2
20 AI Systems for
Autonomous Mobility
simply to fly near a population then with manufacturers to bring AI to their
you have to validate your system with products, and at the beginning of next
a single point of failure, which has the year will be launching their simple
probability of one divided by 10,000. follow me functions up in the air
This is the number of flight hours you with drone manufacturer Quantum-
have to do, and you have to do a strong Systems.
validation of your system. This is a big
By the end of next year, Stefan says
opportunity for us because we know
he hopes to see Spleenlab’s detect
there are many companies who want to
and avoid system in drones. They are
do that, but it’s not solved yet. I believe
also working on automatic inspection
there is a long way to go before we see
of cell towers, with the AI looking for
package delivery outside of a test field,
the cell tower, flying a drone around
so we want to solve an easier approach
it while collecting inspection data, and
first – pipeline inspection where no
then bringing it back. This will save
population is nearby. You still have to
money for customers who want to
show that your system is safe if you want
automate the inspection process.
to fly far with the pilot out of the loop.
That is called BVLOS – beyond visual
Spleenlab are currently 15 people, and
line of sight. This is not yet solved for
they are hiring.
certification for most of the use cases in
North America and Europe.” “We are looking for computer vision
engineers, deep learning engineers, and
Spleenlab have some very exciting PhD applicants with a focus on SLAM
plans for next year and beyond. As a perception and sensor processing. Come
software company, they collaborate join us!”
Marcello and his team are not the first the robot uses soft-hand technologies
to attempt to reconstruct frescoes, but to take and manipulate the pieces very
the key differences between this and delicately. You can control the pressure
past approaches are the sheer size of the and make sure that it is just right for the
frescoes, and the fact that ultimately, piece. We’re going to use the experts in a
they are going to physically reconstruct kind of interactive way, to tell us whether
them. the solution proposed by the robot is
plausible or not.”
“We will build a robot and use soft
robotics,” Marcello explains. Finally, the robot will put the pieces
together, but just next to each other.
“Archaeologists were initially terrified by Expert restorers at Pompeii will take
the idea of a robot holding these pieces charge from there. It is a very delicate
because if they break one, it risks losing process to put the pieces physically back
something very precious. So, once the together using techniques that cannot
puzzle has been solved, the information is be incorporated into a simple robotic
given as input to the robotic platform and system.
RePAIR Pompeii
Summary 3
25
The team will soon be launching a “My archaeologist friends tell me that
dedicated website that describes the if we succeed, as we hope, then this
work. It is a Horizon 2020 project, under will be a huge breakthrough in their
the Future and Emerging Technologies field. When you have an object that has
broken into thousands or even tens of
(FET) Open program, which is an
thousands of pieces, it’s just hopeless
extremely selective call that gives very
to think that any human team can
few proposals the green light. Marcello solve it. Actually, in Pompeii they did
and the University of Venice are the try, but had to give up in the end.”
project coordinators, with other partners
including the Ben-Gurion University of Marcello hopes the technology they
the Negev in Israel, the Italian Institute of propose will be able to be used by other
Technology in Italy, the University of Bonn museums with broken frescoes, as well
as exported to other domains.
in Germany, Instituto Superior Técnico
in Portugal, and the Archaeological Park “There are other problems, such as
of Pompeii, which is one of the biggest reconstructing papyri, vases, or other
archaeological parks in the world. kinds of broken artifacts,” he adds.
“We are introducing a revolutionary “We hope our technology will turn out to
technology in archaeology,” Marcello be useful when the scale of the problem
says proudly. is unmanageable by humans!”
2
26 Summary
Hidden Stories of the Heart
To all readers of the magazine who live in London, this is a unique call if
you are looking for interesting plans for the second weekend of October!
Come join research students Marica (@maricaS8, King’s College London),
Sophie (www.richtersophie.com, Royal College of Arts) and Elizabeth (@
elizabetho157, Royal College of Arts), at the Science Museum on the 9th-
10th of October to explore the new installation Hidden Stories of the Heart.
SUBSCRIBE!
TCT SIPAIM
WSCG 2021 2020
Virtual Orlando, FL Campinas, Brazil
Plzen, Czech
and online Republic
17-19 Nov Join thousands of AI
4-6 Nov May 18-22 professionals who
receive Computer
Vision News as soon
as we publish it. You
AM Medical can also visit our
FREE SUBSCRIPTION
archive to find new
Days
Berlin (Germany) and (click here, its free) and old issues as
well.
Virtual Did you enjoy reading
22-23 Nov Computer Vision
News?
Would you like to We hate SPAM and
BMVC 2021 receive it every promise to keep
Online month? your email address
22-25 Nov safe, always!
Fill the Subscription Form
- it takes less than 1 minute!
Point cloud of key points on a bladder model and video cameras route
This unique tool can significantly This module was designed to suit
shorten the cystoscopy, as well as the cystoscopy procedure. It can be
increase levels of confidence in the tailored to any other procedure which
procedural findings. It makes the uses a scope to scan inner lumens in
repetitive bladder scans redundant the body, e.g. gastroscopy, with some
and provides an image with better alterations. It can also be combined
accuracy which makes lesion detection with other computer vision tools
easier, as it increases the field of view such as automatic lesion detection to
compared to the narrow cystoscopy improve procedural outcomes.
view.
Overall, this is a tool which utilizes
Another benefit of this module is that advanced computer vision techniques
it does not require long run-time, to significantly improve bladder cancer
allowing fast and easy implementation healthcare. More projects in AI for
in the clinic. urology here.
Best of
32 BEST OF MICCAI 2021 MICCAI
2021
Example MRI scan of a CRLM patient who had multifocal cancer. Existing
biomarkers only use information from the largest lesion (marked in red)
while ignoring information from other lesions (marked in orange).
The multiple instance learning technique used in this work comes from
digital pathology, where you have very large whole-slide images that are
too big for the computer to deal with, so you crop them into smaller tiles
and learn the features. Then at the end you aggregate the information from
all tiles to get an understanding of the whole-slide image. This perfectly fits
the problem here, which is to predict a patient’s survival based on a
number of tumors.
Jianan tells us one of the biggest challenges has been collecting data.
“When we look at survival of patients, we collect patient data that has had
some kind of treatment,” he tells us. “You might be surprised, but 80 per
cent of colorectal cancer liver metastasis patients, which I studied for my
paper, can’t receive liver resection and that is the only curative treatment
for them. That’s why we don’t have a lot of data on them. Existing datasets
include mostly unifocal patients because we know how to treat them, but
it’s difficult to collect a large database with multifocal cancer patients. We
don’t know how multiple tumors affect patient survival or how aggressive a
tumor is.”
Best of
36 Oral Presentation MICCAI
2021
Thinking about next steps, Jianan says there are many ways he hopes to
take this work forward, including by validating its findings using
independent datasets, and extending it to other diseases and imaging
modalities.
Jianan received his Bachelor’s and Master’s degree in communications
engineering and artificial intelligence, respectively, before switching to
medical imaging with machine learning for his PhD.
“I wanted to do something meaningful with my research that could affect
real people,” he reveals. “I think cancer research is very important.”
Finally, we are keen to know how Jianan has found working with Anne
Martel, a good friend of this magazine.
Teodora Popordanoska is a
PhD student at KU Leuven in
Belgium, under the
supervision of Matthew
Blaschko.
MICCAI this year is an extra
special occasion for Teodora
as she has had her first
accepted conference paper!
Her work investigates the
relationship between model
calibration and unbiased
volume estimation. She
spoke to us ahead of her
poster session.
The
The second
second part
part of
of the
the title
title refers to volume
refers to volume estimation.
estimation. In
In medical
medical image
image
analysis, the segmentation of an image is usually performed with
analysis, the segmentation of an image is usually performed with a neural a neural
network
network andand is
is mostly
mostly used
used toto calculate
calculate certain
certain biomarkers.
biomarkers. In
In the
the medical
medical
domain, the volume of a tumor, organ, or lesion are important biomarkers.
domain, the volume of a tumor, organ, or lesion are important biomarkers.
From
From aa segmentation,
segmentation, one
one can
can obtain the volume
obtain the volume by
by summing
summing up the
up the
probability
probability scores for each
scores for each voxel.
voxel. An
An important
important quantity
quantity in
in this
this case
case is
is the
the
bias of
bias of volume
volume estimate.
estimate. Ideally,
Ideally, to
to have
have the
the true
true volume,
volume, there
there would
would bebe
zero bias.
zero bias.
“The main
“The main theoretical
theoretical result
result from
from this
this paper
paper is
is that
that the
the absolute
absolute value
value of
of
the bias
the bias is
is upper
upper bounded
bounded byby the
the calibration
calibration error,”
error,” Teodora
Teodora clarifies.
clarifies.
“If we’re
“If we’re optimizing
optimizing calibration,
calibration, we’re
we’re also
also simultaneously
simultaneously reducing
reducing the
the bias
bias
of the
of thevolume
volumeestimate.
estimate.If Ifthe
thecalibration error
calibration is zero,
error then
is zero, it means
then that
it means that
Best of
40 Poster Presentation MICCAI
2021
we had an unbiased volume estimate, and we get the true volume. In this
sense, the result is not specific on the architecture of the model or the type
of organ or tumor or whatever we’re measuring the volume of.”
The result is a fundamental mathematical result that has been empirically
validated on two datasets and 18 models, trained with several loss
functions and calibrated with multiple calibration strategies.
Designing a calibrated model by itself is not a trivial task, so the team are
currently working on developing a new method of training well-calibrated
models.
Teodora is the third Macedonian that we have interviewed over the years,
following her friend Ivona Najdenkoska just yesterday and Jelena Frtunikj
at CVPR19, as part of our Women in Science series.
Best of
MICCAI
2021
Teodora Popordanoska 41
“Macedonia is a very small but beautiful country,” she tells us. “The people
are warm and nice. I’m always glad to go back there!”
Teodora reveals the most challenging part of this work for her was
discussing the clinical relevance of the theoretical results.
“This is my first time working with medical data, but thankfully all my co-
authors have plenty of experience in this area and in the medical domain, so
it worked out great,” she adds.
“In medical applications, we want the models to be trustworthy and we
want to have unbiased volume estimates. With our work, we are showing
that focusing on model calibration is sufficient and calibration error is in
fact a superior model selection criterion also with respect to volume bias.”
Best of
42 Interview MICCAI
2021
Mert, can you tell us about what that. While kicking around and
you do in the Sabuncu Lab at thinking about it, I ran into a bunch
Cornell? of neuroscientists at Princeton who
Our research group focuses on did functional MRI research. One of
developing computational methods, the group, James Haxby – who is
i.e., algorithms, for medical imaging now at Dartmouth College –
problems, including anything from introduced me to some image
MR acquisition to downstream analysis problems in fMRI. That was
applications using medical imaging my first experience with medical
data in the context of clinical imaging. Then I had the opportunity
workflows. during grad school to intern at
Siemens Corporate Research -
How did you come to work in the Siemens Healthineers now - which is
field of medical imaging? also located in Princeton, New
I did my PhD at Princeton University Jersey. I worked there over the
in electrical engineering. Princeton summer and during the year I would
doesn’t have a medical school and collaborate with them. They have a
when I first started my PhD, I wasn’t heavy focus on medical imaging.
sure what I wanted to work on. I What convinced you that this focus
knew that I wanted to do something was the right one for you?
that involved image processing, and
in the early 2000s, there was some I’m a strong believer that the
excitement around machine research we do should have a real-
learning, so I started learning about world impact. I’m very pragmatic in
Best of
MICCAI
2021
Mert Sabuncu 43
that sense. I view the healthcare and There is always a gap. Whatever field
biomedicine domain as a very you’re in, especially in the academic
interesting area where we can make setting, there’s going to be a gap
a strong impact and hopefully have a between the research and the way it
positive influence on many people’s impacts people’s lives in the real
lives. In my family, I have been world. The healthcare world has a lot
surrounded by doctors and medical of stakeholders, including patients,
researchers from a very young age, insurance companies, regulatory
and I guess that’s what first attracted bodies, hospitals, doctors, and
me to biomedical research. researchers. Having them all aligned
Do you feel that the work that you and getting through all their needs
and your team are doing, and the and requirements is a big challenge.
work that the MICCAI community is Taking an idea from a concept to a
doing, is having that impact? real product that can be used in a
clinical workflow on patients, there
I think that as a community what we are a lot of obstacles along the way.
do at MICCAI has a big impact. I also
think that the research that comes The other challenge is as researchers
out of my group has an impact in on the algorithmic side, we often
sometimes very non-obvious ways. focus on toy problems. That is a good
It’s important to remind oneself that starting point but can be distracting
what we do can feel a little bit like in terms of what matters in the real
basic research. A little bit removed world. We spend a lot of our
from real-world applications. There bandwidth on these “artificial
are still some open questions that we problems” and make good progress,
need to work on before we can take but to take those breakthroughs and
our technologies and apply them to translate them into the real world,
real-world problems. That said, as I there are other challenges that we
move along my career trajectory, I’m aren’t focusing on. There are
hoping I will increase my emphasis obviously exceptions, but there’s not
on real-world translation. That’s why a lot of incentive, at least in the
nowadays I’m focusing on real clinical academic world, to move along those
collaborations and understanding steps. A lot of those incentives are in
clinical workflows. I’m hoping to put the more basic research.
more effort into translating our Do you think today’s CLINNICAI,
technologies into the clinic, either by MICCAI’s first clinical day, is a step in
commercialization efforts, or at least the right direction?
trying to implement things in an I think these types of ideas and
academic hospital setting. especially communities like MICCAI
Why does it take so long sometimes make a difference. They enable
to translate MICCAI research into different groups of people
the clinic? to communicate with each other to
Best of
44 Interview MICCAI
2021
Doctors say you should go during the combined the biomedical part with
morning and avoid it from noon to 4 machine learning. Let’s give it a try!
PM, but most people don’t do that. Then I realized that this is very
Here, people spend the entire day at challenging. Skin imaging is in
the beach. between real imaging, that you find
We like the beach! in different computer vision tasks,
We do! We are Portuguese! and medical imaging, which are 2-D
images. You have challenges from
I love Albufeira. What brought you the two sides. That combined with
to this field? machine learning just convinced me
That’s a very interesting question. I to keep working on this!
have a biomedical engineering
degree. I studied many engineering Did you meet with some doctors
topics, from chemistry and physics before you started?
to mathematics. I didn’t know what I Yes, the co-supervisor of my
wanted to do, I just wanted to go Master’s thesis was a dermatologist.
into research. In the last year of my I met doctors and actually I still work
degree, I did a course on machine with doctors. We need them to keep
learning, and I fell in love with it. It working on these things, not only for
was just perfect for me! Then I was the data but for everything, for
looking for a Master’s thesis on this feedback, to understand if we are
topic. This thesis about skin cancer going in the right direction, or if we
just popped out. It was perfect! I need to change a little bit.
Best of
50 Women in Science MICCAI
2021
Sometimes we have different ideas. also supervise students. I supervise
You told us only half of the story - several Master’s students in
the half of the story about the different computer vision topics and
research you are doing. I know that in robotics as well, mostly using
you are also teaching. Can you tell machine learning. We try to
us a little bit about the second part challenge them to use recent
of the story? methods.
It’s funny because I just came out of You have supervised more female
a meeting to prepare for the next students than male students. How
course on machine learning that did this come about?
starts in two weeks. At the moment, To be honest, I don’t know. I don’t
we have more than 300 students in know if it’s because I am a female
this course. As you can imagine, it supervisor. Sometimes I know that it
has changed from 50 students ten plays a role, although my PhD and
years ago to 300 now. A lot of Master’s supervisor was male. We
people are interested in these share the students sometimes. What
topics. We have to let them know, is changing is the way women
for example, about these recent approach engineering. When I took
deep learning models that everyone the machine learning course 10
is hearing about. We have to years ago… no 11 years ago, there
contextualize these for the students. were six female students out of
That’s mainly what I do. I teach about 60 students. It’s less than one
machine learning at the university. I tenth of the students. Now, we have
Let’s say that I’m your little brother, Everyone is just taking classes
and I want to start teaching. What online.
advice would you give me? Share your secret with us please!
The first thing you need to pay [laughs] In these 30-student
attention to is first of all, you need classrooms, at a certain point, you
to prepare the classes. It’s a mistake start to know their names. You can
to imagine that you just go in there. try to call them by their names.
If you have a background in machine That’s something I did for the Zoom
learning, you can’t just go there and classes. I remember some students
rely on that background. You have to were more talkative than others. I
prepare. The second thing that you remember saying: “I want to hear a
definitely need to do is guide the different voice.” I asked them to try
students. That will make a big to talk to me. Just looking at their
difference in the way you teach. As I faces, I don’t understand if they are
try to engage with the students, to following what I’m saying or not. If
make them talk to me, tell me their you are solving a problem, try to
difficulties and if they understand. make them solve a problem with
That allows me to conduct the class. you. It’s hard. Like I said, it depends
How do you engage them? It’s on the shift. Some students are very
difficult to engage people! proactive, some aren’t. Some are
Yes! Especially in this Covid time. afraid to fail.
Introduction
Recently, transformer-based models have gained a lot of traction in natural
language processing and computer vision due to their capability of learning pre-
text tasks, scalability, better modeling of long-range dependencies in the
sequences of input data. In computer vision, vision transformers and their
variants have achieved state-of-the-art performance by large-scale pretraining
and fine-tuning on downstream tasks such as classification, detection and
segmentation. Specifically, input images are encoded as a sequence of 1D patch
embeddings and utilize self-attention modules to learn a weighted sum of values
that are calculated from hidden layers. As a result, this flexible formulation
allows us to effectively learn long-range information. This warrants the question,
what is the potential of Transformer-based networks in Medical Imaging for 3D
segmentation ?
Novel proposed methodologies that leverage transformer-based or hybrid
(CNN+transformer) approaches have demonstrated promising results in medical
image segmentation for different applications. In this article, we will deep dive
into one such network architecture (UNETR) and will also evaluate other
transformer based approaches in medical imaging (TransUNET & CoTr).
1. UNETR
NVIDIA researchers have proposed to leverage the power of transformers for
volumetric (3D) medical image segmentation and introduce a novel architecture
dubbed as UNEt TRansformers (UNETR). UNETR employs a pure vision
transformer as the encoder to learn sequence representations of the input
volume and effectively capture the global multi-scale information, while also
following the successful U-shaped network design for the encoder and decoder.
Why UNETR: Although Convolutional Neural Networks (CNN)-based approaches
have powerful representation learning capabilities, their performance in learning
long-range dependencies is limited to their localized receptive fields. As a result,
such a deficiency in capturing multi-scale contextual information leads to sub-
optimal segmentation of structures with various shapes and scales.
Best of
55
MICCAI
Potential of Transformers for 3D
2021 Medical Image Segmentation
Qualitative comparison
of different baselines.
UNETR has a
significantly better
segmentation accuracy
for left and right adrenal
glands, and UNETR is the
only model to correctly
detect branches of the
adrenal glands.
Comparison of number
of parameters, FLOPs
and averaged inference
time for various models
in BTCV using a sliding
window approach.
Overview of UNETR
architecture. A 3D input
volume is divided into a
sequence of uniform
non-overlapping patches
and projected into an
embedding space using
a linear layer. The
sequence is added with
a position embedding
and used as an input to
a transformer model.
The encoded
representations of
different layers in the
transformer are
extracted and merged
with a decoder via skip
connections to predict
the final segmentation.
In the spirit of open innovation and to accelerate the research in this emerging
field, NVIDIA has open-sourced UNETR via MONAI Github public repository. In
addition, a standalone UNETR repository is available in MONAI research
contributions repository. Furthermore, two UNETR tutorials (pure MONAI and
MONAI + PyTorch Lightning) for multi-organ segmentation using BTCV datasets
are available on MONAI tutorials for researchers to further explore this
methodology in practice.
Two notable approaches that have leveraged transformers for medical image
segmentation are TransUNet and CoTr. These approaches will be discussed in
detail in the following sections.
Best of
57
MICCAI
Potential of Transformers for 3D
2021 Medical Image Segmentation
2. TransUNet
TransUNet is a 2D hybrid CNN-Transformer segmentation model that leverages a
vision transformer (ViT) as a standalone layer into the encoder of UNet
architecture. Specifically, TransUNet uses a CNN as a feature extractor to
generate feature maps as input of the ViT model in the bottleneck of the
architecture. The ViT model uses self-attention layers to effectively process the
extracted feature maps that are fed into the decoder for computing the final
segmentation output. TransUNet has achieved comparable performance on the
tasks of multi-organ segmentation using BTCV dataset as well as Automated
Cardiac Diagnosis Challenge (ACDC) for automated cardiac segmentation.
Here is the paper explaining the architecture and the approach in further details,
while the code and models are available here.
model, CoTr and UNETR utilize volumetric inputs and hence can benefit
from the spatial context of data. UNETR and TransUNet both use the
transformer layers of the ViT model whereas CoTr leverages a deformable
transformer layer that narrows down the self-attention to a small set of key
positions in an input sequence. In addition, each of these models utilize the
transformer layers differently in their architecture. TransUNet uses the
transformer layers in the bottleneck of a UNet, while CoTr utilizes them in
between the CNN encoder and decoder by connecting them in different
scales via skip connections. On the other hand, UNETR uses the transformer
layers as the encoder of a U-shaped architecture and generates input
sequences by directly utilizing the tokenized patches. The transformer
layers of UNETR are connected to a CNN decoder via skip connections in
multiple scales.
Conclusion
Convolutional neural networks (CNNs) have been the de facto standard for
3D medical image segmentation so far. However, Transformers have the
potential to bring a fundamental paradigm shift with their strong innate
self-attention mechanisms and hold the potential to serve as strong
encoders for medical image segmentation tasks. The pre-trained
embedding can then be adapted for various down-stream tasks (example,
segmentation, classification & detection). In the years to come, we will see
new breakthroughs powered by Transformers for medical imaging - the
future is exciting, so we should brace ourselves.
Best of
57
MICCAI
Potential of Transformers for 3D
2021 Medical Image Segmentation
No worries!
We got you covered!
Keep in contact
with the
community!
Best of
60 Startup Village MICCAI
2021
The results of his work show the feasibility of using mobile devices to improve
neurosurgical processes. Augmented reality enables surgeons to focus on the
surgical field while getting intuitive guidance information. Mobile devices
also allow for easy interaction with the neuronavigation system thus enabling
surgeons to directly interact with systems in the operating room to improve
accuracy and streamline procedures. To encourage further research and
accelerate the pace of innovation, Étienne released the developed application
under an open source license, making it accessible to others to reuse and
keep improving upon.
On the left: screenshot of the system when used in standard IGNS mode.
On the right: screenshot of the system when used in augmented reality mode.
64 ICCV Workshop Preview
Computer Vision for
Automated Medical Diagnosis
Yuyin Zhou is a postdoctoral researcher
at Stanford University, working with
Matthew Lungren and Lei Xing on
medical image analysis and other
related machine learning problems.
She is also co-organizing the very
first Computer Vision for Automated
Medical Diagnosis workshop at this
year’s ICCV. With the conference only
a few weeks away, Yuyin is here to tell
us what to expect.
Yuyin Zhou
Machine learning has been a helpful
tool for doctors in dealing with different while many new medical devices have
medical imaging problems for some time been developed in conjunction with
now. It can also support disease diagnosis industry.
and treatment planning. Over the past
few years, there has been a great deal In spite of this, the safe and reliable
of progress made in this area because of adoption of such technologies in
huge advancements in computer vision hospitals in the real world remains
and artificial intelligence techniques. limited, and many problems, such as
Problems such as medical image cancer diagnosis, are still not solved.
registration, structure detection, and
tissue and organ segmentation have “This is the key reason why we have
achieved state-of-the-art performance, created this workshop,” Yuyin reveals.
Automated Medical Diagnosis 65
Computer Vision for
Their first meeting at ICCV later this Another benefit of this module is that
month already has a stellar line-up of it does not require long run-time,
speakers on board, including Russell allowing fast and easy implementation
Taylor from Johns Hopkins University, in the clinic.
who will be discussing Autonomy and
Semi-Autonomous Behavior in Surgical Other confirmed speakers include
Robot Systems. Yizhou Yu – How should machines
analyze medical images to aid
“I think this topic is one of the most diagnosis? – and Lena Maier-Hein –
important and has not been addressed Statistics meets machine learning in
enough in the computer vision biomedical image analysis. There will
community,” Yuyin tells us. also be 10 engaging oral talks from
UCLA, University of Oxford, Google,
“We’ve also got Demetri Terzopoulos, and more.
66 ICCV Workshop Preview
This is not the first medical computer “We’re going to give a more holistic
vision workshop – CVPR has had a and complete view of this field and we
similar event for a number of years want to discuss things from broader
now. The community recognizes and perspectives – not just medical computer
understands the importance of the vision like CT and MRI, but also NLP,
topic, so the foundations have been medical robots, surgical planning, and
laid for this new meeting at ICCV to be how to better adapt existing computer
a success this year and for many more vision and machine learning expertise
to come. into all of these different medical
However, the team do not intend to problems.”
create a carbon copy of another event.
They plan to discuss topics which The workshop is very close to Yuyin’s
haven’t been covered before. But with own body of work. During her PhD
so much on the menu at ICCV this year, career at Johns Hopkins University, the
including another medical workshop, team have been working on the Felix
how should attendees choose between Project, which aims to detect pancreatic
this one and everything else on offer? cancer earlier.
“What really sets our workshop apart “We started from pancreas
from others is that we are focusing on segmentation and went deeper into
the general challenges in the medical
pancreatic tumor segmentation and
computer vision arena,” Yuyin points
out. detection problems,” Yuyin explains.
Automated Medical Diagnosis 67
Computer Vision for