Recognition in Ultrasound Videos: Where am I?
Roland Kwitt1 , Nuno Vasconcelos2 , Sharif Razzaque3 , and Stephen Aylward1
1
2
Kitware Inc., Carrboro, NC, USA
Dept. of Electrical and Computer Engineering, UC San Diego, USA
3
Dept. of Computer Science, UNC, Chapel Hill, USA
Abstract. A novel approach to the problem of locating and recognizing
anatomical structures of interest in ultrasound (US) video is proposed.
While addressing this challenge may be beneficial to US examinations in
general, it is particularly useful in situations where portable US probes
are used by less experienced personnel. The proposed solution is based
on the hypothesis that, rather than their appearance in a single image,
anatomical structures are most distinctively characterized by the variation of their appearance as the transducer moves. By drawing on recent
advances in the non-linear modeling of video appearance and motion, using an extension of dynamic textures, successful location and recognition
is demonstrated on two phantoms. We further analyze computational demands and preliminarily explore insensitivity to anatomic variations.
1
Motivation
Many developing countries, as well as rural areas of developed nations, do not
have immediate access to expensive medical imaging equipment such as MR or
CT. As highlighted in a recent study [8], ultrasound (US) imaging is particularly well-suited for those underserved areas, since it is low-cost, versatile and
non-invasive. Additionally, medical care in emergency vehicles and military field
operations may benefit from US when performing first care. However, as emphasized in [8], training is needed for personnel to unfold the full potential of US
imaging; yet this training requirement can be problematic in rural and emergency situations due to cost and circumstance. In addition to US interpretation,
high US acquisition quality is essential but often difficult to achieve.
A cost-effective and practical solution to the training challenge is to embed expertise into the US system and/or a connected mobile device. This has
led other groups to attempt to embed computer-aided diagnosis (CAD) systems into imaging devices. We instead seek to help an inexperienced operator
acquire clinically significant images that can then be transferred to a central
location for reading by experts. This approach should be easier to achieve and
more broadly useful than an embedded CAD system. The technical challenge
reduces to developing algorithms for recognizing key anatomic structures as the
US videos are acquired. Based on those localizations, the system can then convey
to the operator how to acquire additional images, relative to those key locations,
for transmission to expert readers, or can indicate when US video must be reacquired to meet quality requirements. Apart from being helpful to novice users,
Video sequence S
(Linear in case of DTs)
xt
Time t
Observation noise
C(·)
+
yt
Non-linearity
Frame at t0
y0,0
Y = ...
y0,d
..
.
Frame at tT −1
yT −1,0
..
.
yT −1,d
(a) Data matrix
z −1
Unit delay
A
+
Linearity
State noise
(b) Generative model of the KDT
Fig. 1. (a) Assembly of the data matrix Y from a video sequence S; (b) Generative
model for the KDT.
we argue that automated recognition of anatomical structures might also be
beneficial to experienced physicians, since it could help to minimize the risk of
misidentifications.
The objective of this work is to promote a new pathway for locating anatomical structures when moving an US transducer. The key idea is to avoid image-toimage comparisons using an atlas but rather to exploit the full spatio-temporal
information of the US video sequences. It is argued that the appearance changes
of anatomical structures, due to probe motion, are particularly distinctive for
their localization. Technically, we draw on recent advances in video modeling in
computer vision. The varying appearance of an anatomical structure is represented by a generative video model, known as the kernel dynamic texture [2].
Similarity between video sequences is then defined as similarity in the parameter space of this model. Since we propose storing a database of key-location US
sequences on portable devices and performing real-time analysis of US videos
as they are acquired, generative models are particularly useful. In our case, we
only need to augment the database by the KDT model parameters (which have
a small memory footprint) and distances can be very efficiently computed.
While classification of US images has been previously studied (e.g., [7]), to
the best of our knowledge, this is the first work to tackle localization on the basis
of dynamic US sequence information. This paper presents 1) our application of
the kernel dynamic texture algorithm, 2) a preliminary study on sensitivity and
specificity using phantoms (admittedly for a limited range of the relevant problem space) and 3) a study on robustness towards simulated anatomic variations
between the modeled structures to be localized and the actual observations.
2
Recognition with Kernel Dynamic Textures
We selected dynamic texture (DT) [5] models as an appropriate class of generative models for capturing video appearance changes. DT models arose from
computer vision and were selected for US modeling because of the prominent role
texture plays in US images, e.g., compared to edges or intensity. In particular,
we exploited a recent non-linear extension of the DT family, denoted the kernel
dynamic texture (KDT) [2], to capture non-linear appearance changes that will
occur as structures move into and out of the ultrasound imaging plane.
Consider a US sequence S as an ordered sequence of T video frames, i.e.,
S = (y 0 , . . . , y T −1 ), where y t ∈ Rd is the frame observed at time t. Under
the DT framework of [5], these observations are modeled as samples of a linear
dynamical system (LDS). At time t, a vector of state coefficients xt ∈ RT is first
sampled from a first-order Gauss-Markov process, and the state coefficients are
then linearly combined into the observed video frame y t , according to
xt = Axt−1 + wt ,
(1)
y t = Cxt + v t
(2)
where A ∈ RT ×T is the state-transition matrix and C ∈ Rd×T is the generative
matrix that governs how the state determines the observation. Further, wt and
v t denote state and observation noise with wt ∼ N (0, I) and v t ∼ N (0, R),
respectively. Assuming that the observations are centered4 and following the
system identification strategy of [5], C is estimated by computing an SVD decomposition of the data matrix Y = [y 0 · · · y T −1 ] as Y = U ΣV ⊤ and setting
C = U . The state matrix X = [x0 · · · xT −1 ] is estimated as X = ΣV ⊤ and A
can be computed using least-squares as A = [x1 · · · xT −1 ][x0 · · · xT −2 ]† , where
†
denotes the pseudoinverse. When restricting the DT model to N states, C
is restricted to the N eigenvectors corresponding to the N largest eigenvalues.
The rest follows accordingly. Due to space limitations, refer to [5] for details on
noise parameter estimation. In the non-linear DT extension of [2], the generative
matrix C is replaced by a non-linear observation function C : RT → Rd , i.e.,
y t = C(xt ) + v t ,
(3)
while keeping the state evolvement linear. The corresponding dynamical system
is denoted a kernel dynamic texture (KDT), shown in Fig. 1(b). The non-linearity
of C requires a different, although conceptually equivalent, set of parameter estimates. The idea is to use kernel PCA (KPCA) to learn the inverse mapping
D : Rd → RT from observation to state space, in which case the KPCA coefficients then represent the state variables.5 We note that the KDTs are not
necessarily restricted to work with intensity observation matrices; they will work
with any kind of feature for which we can define a suitable kernel, c.f. [3].
Additionally, we have chosen to adopt the distance measure from [2] for
measuring similarity of two video sequences, S a and S b . This approach was
chosen for its speed. It is based on an adaption of the Martin distance [6] among
the corresponding DTs Da = (Aa , C a ) and Db = (Ab , C b ) with N states each.
The (squared) Martin distance, given by [6,4]
d2 (S a , S b ) = − log
N
Y
cos2 (φi ),
(4)
i=1
is based on the subspace angles φi among the infinite observability matrices O a
2 ⊤
⊤
⊤
and O b , defined as [4] [C ⊤
a (C a Aa ) (C a Aa ) · · · ] =: O a . In fact, the cos(φi )
4
5
Centering is straightforward by subtracting the column-wise means of Y .
See supp. material to [2] for centering in the feature space induced by the kernel.
Top view
Side view
Slice view
Fig. 2. Illustration of the noodle phantom, made of gelatine and Soba noodles (left
three images) and an abdominal CIRS phantom mounted in a water tank (right).
correspond to the N largest eigenvalues λi of the generalized eigenvalue problem
O aa 0
0 O ab x
x
=λ
with O ab = O ⊤
(5)
a Ob ,
O ba 0
y
0 O bb y
subject to x⊤ O aa x = 1 and y ⊤ O bb y = 1. For DTs, computation of O ab is
straightforward, since the terms C ⊤
a C b can be evaluated. For KDTs, it can be
C
(which are no longer available) boils down
shown that computation of C ⊤
b
a
to computing the inner products between the principle components of kernel
a
b
matrix Kij
= k(y ai , y aj ) and Kij
= k(y bi , y bj ), i.e.,
O ab =
∞
X
n
(Ana )⊤ α̃⊤ Gβ̃ Anb ,
A
→
C
(Ana )⊤ C ⊤
b
| {z }
| a{z } b
n=0
n=0
KDTs
DTs
∞
X
(6)
where α̃ = [α̃0 · · · α̃T −1 ], β̃ are the (normalized) KPCA weight matrices with
α̃i = αi − 1/N (e⊤ αi )e and G is the kernel matrix with entries Gij = k(y ai , y bj ).
In the remainder of the paper, we use (4)-(6) for measuring similarity between
US sequences and a standard RBF kernel for all kernel computations.6
For localization, we follow a sliding-window strategy, measuring how well a
key sequence matches a subsequence from a long path (i.e., the search sequence
Pn ) of acquisitions. That is, given Q frames in a key sequence, we move a slidingwindow Wi of Q frames along a path by p frame increments. For each Wi , we
estimate the KDT parameters and compute the Martin distance to the KDT
of the key sequence. A key sequence is indicated in a search sequence when the
distance is minimal. At this time these minimums are illustrative. As more data
and specific applications evolve, statistical likelihood methods will be used.
3
Experimental Protocol
For the studies in this paper, we use two different kinds of phantoms: 1) a
homemade noodle phantom made of gelatine with embedded Soba noodles and
2) a triple modality 3D abdominal phantom (CIRS Model 057) mounted in a
water tank, see Fig. 2. The noodle phantom is particularly useful, since the
6
For KPCA, kernel width is set as σ 2 = mediani,j ky i − y j k2 ; to compute Gij , it can
be shown [2] that y ai and y bj need to be scaled by σa and σb first.
S0 (ti )
S0 (ti+15 )
S1 (ti )
S1 (ti+15 )
S2 (ti )
S2 (ti+15 )
Fig. 3. Snapshots of three key structures at two time points on the noodle phantom.
noodles are self-similar at a small scale, have ambiguous patterns of bends at
medium scales, and at large scales and in US sequences present a rich set of
structures that are difficult to casually distinguish.
For imaging we use the Telemed LogicScan 128 INT-1Z kit. US frequency is
set to 5Mhz. Penetration depth is 90mm on the noodle phantom and 150mm
on the abdominal phantom. Speckle reduction is enabled in the US acquisition
software. All images were acquired freehand, without tracking. We learn N = 5
state KDTs and clip each sequence to a central 300 × 300 (noodle phantom), or
200 × 200 (abdominal phantom) pixel window. Using more states did not lead
to any improvements in the presented results.
3.1
Localization of structures within US sequences
The first experiment tested whether it is possible to localize key structures in
the noodle phantom. Two different sets of acquisitions were made. The first set
was composed of short (40 frames) US key sequences Sn , captured by moving
the US transducer over three different key structures to be localized, see Fig. 3.
Key structures were chosen ad hoc by the probe operator. We then estimated
KDT models for each of the three key structures using this first set of data.
The second set was composed of longer US search sequences Pn , acquired along
multiple paths on the noodle phantom; these simulated searches for the key
structures, see Fig. 4. On both sets, we tried to minimize probe tilt and rotation,
but rotation and titling was inevitable. Note also that the acquisition direction
of the key sequences matched the acquisition direction of the search sequences.
To evaluate sensitivity, we performed the localization using key sequences
applied to search sequences that also covered the corresponding key structures.
Distance plots are shown on the left-hand side of Fig. 5. To evaluate specificity,
we repeated this experiment along multiple search paths that did not cover any
of the key structures. Distance plots are on the right-hand side of Fig. 5.
To evaluate the robustness against shifts of the ultrasound imaging plane
(e.g., partial inclusion of a key structure), we performed ten runs with random
displacements δx , δy of the clipping window in x and y direction with δx , δy ∈
{−5, . . . , 5} pixel. Fig. 5 shows the Martin distance averaged over all clipping
window positions for each sliding window index along three search paths (left).
The enclosing light-blue hull illustrates the standard deviation.
Based on the above three experiments we make the following observations: 1)
key structures exist at global minima in the Martin distance metric of a search
sequence, when key structures are encountered; 2) Martin distance decreases as
US Transducer
(curved array)
Back-and-forth moves
→ Key sequence Sn
Path for search sequence (Pn )
t+
t+
Soba noodle
t−
t−
Wj
s
Tran
Gelatine
n
latio
Ti
Sliding window Wi at position i
(a) Acquisition
(b) Overlap
l ti
ng
(c) Movement
Fig. 4. Illustration of (a) the acquisition process on the noodle phantom, (b) sliding
windows overlapping key structures (yellow) and (c) probe movements. In (b), handannotations t+ and t− bracket where the sliding window overlaps the key structure.
the sliding window moves towards a key structure and increases as it leaves the
key structure; 3) if a key structure is not encountered by a search, then there is
not a distinctive minimum in the distance measurements.
3.2
Localizing a hepatic vessel on an abdominal phantom
Our second experiment is more challenging in the sense that we try to locate a
more subtle structure, namely a specific section of a hepatic vessel in an abdominal phantom. The experimental protocol is similar to the previous experiment;
however, US transducer tilting (also known as angulation [1]) (see Fig. 4) is used
instead of translation along a path. We attempt to localize the hepatic vessel key
structure within a single search sequence. The key sequence acquisition spans
≈ 40◦ around the angle where the vessel is visible. The search sequence covers
≈ 140◦ around the vessel. Again, all acquisitions were performed freehand, and
the ultrasound probe was repositioned on the phantom between each acquisition. Fig. 6 shows the Martin distances for localization and for localization using
shifted clipping windows.
This experiment highlights two things. First, we can again localize the key
sequence within the longer search sequence, even though the span of the minimal Martin distances that correspond to the true location of the vessel is less
prominent and less persistent than in the previous experiment. Second, variation
in the distance measurements is much higher for small vessels than for the more
distinct, larger structures form the noodle phantom.
3.3
Localization in the presence of simulated anatomical variation
Our third experiment focused on the insensitivity of the distance metric and the
localization method to anatomic variations. Specifically, we simulate anatomic
variation by (non-linear) spatial distortion of the search sequences. We admit
that this does not cover the range of variation among individuals, but it does
begin to give an impression of robustness. Fig. 7 shows the distance measurements when the key structure is encountered within the search sequence. The
Search sequence P1
t−
t+
70
70
Key structure not along path
Key structure along path
60
60
50
50
40
40
20
40
60
80
10
100
20
30
40
50
60
70
80
90
100
Search sequence P2
55
t−
t+
50
50
45
45
40
40
35
35
10
20
30
40
50
60
10
70
20
30
40
50
60
70
Search sequence P3
t−
60
70
60
50
50
40
40
30
30
20
40
60
80
100
120
20
140
Sliding window position
40
60
80
100
120
140
Fig. 5. Martin distance between key sequence KDT and the sliding window KTDs
for three different paths, averaged over ten random clipping window positions (left);
sliding window positions where the key structure is covered to some extent are marked
light gray (from manual annotation); Distance measurements when trying to locate a
key structure that was not covered by a path (right).
distortions are illustrated on a checkerboard pattern. Note that the underwater distortion is changing over time.7 As shown, the distortions do not have a
negative impact on the localization, however, we observe less distinctive minima.
3.4
Discussion and future work
We conclude that the KDT framework + Martin distance is an effective combination in localizing short US sequences of key structures. Further, computational
requirements are modest, even for our un-optimized (MATLAB) code and use
of intensity features. For example, given a key sequence consisting of 40 frames,
computing the KDT model and distance metric per sliding window requires ≈ 0.1
seconds on an Intel Core i7 1.7Ghz CPU with 8GB of RAM. It is also worth
noting that we can perfectly recognize the short US key sequences using a simple
nearest-neighbor classifier and the Martin distance as a metric. For future work
we note that using intensity information as our observations space has its limitations. Due to the generic nature of the KDT approach and KPCA-based system
identification, we could easily integrate more specifically tailored US features as
long as we can define a suitable kernel.
7
See supp. video at http://vimeo.com/rkwitt.
Clipping window pos. 3
Clipping window pos. 1
70
60
50
ti
ti+5
ti+10
ti+15
60
40
20
40
60
50
t−
t+
30
80
100
120
t+
20
Clipping window pos. 2
40
60
t−
80
Vessel
100
120
Average
65
70
60
60
55
50
50
40
t+
45
20
40
60
t+
t−
80
100
120
20
40
60
t−
80
100
120
Sliding window position
Fig. 6. Martin distance measurements for three clipping window positions and distance
measurements averaged over all random runs (left); illustration of the hepatic vessel
appearing and disappearing in the key sequence (right).
70 Wave distortion
Underwater distortion
60 (changing over time)
60
50
50
40
40
20
40
60
Sliding window position
80
100
20
40
60
80
100
Sliding window position
Fig. 7. Distortion experiments on the search sequence for two different types of (nonlinear) spatial distortion (illustrated on the checkerboard pattern).
Acknowledgements This work was partially funded by NSF (CCF-0830535)
and NIH/NCI (1R01CA138419-0, 2R44CA143234-02A1).
References
1. Block, B.: The Practice of Ultrasound: A Step-by-Step Guide to Abdominal Scanning. Thieme, first edn. (2004)
2. Chan, A., Vasconcelos, N.: Classifying video with kernel dynamic textures. In:
CVPR. pp. 1–6 (2007)
3. Chaudry, R., Ravichandran, A., Hager, G., Vidal, R.: Histograms of oriented optical
flow and binet-cauchy kernels on nonlinear dynamical systems for recognition of
human actions. In: CVPR. pp. 1932–1939 (2009)
4. De Cock, K., Moore, B.: Subspace angles between linear stochastic models. In: CDC.
pp. 1561–1566 (2000)
5. Doretto, G., Chiuso, A., Wu, Y., Soatto, S.: Dynamic textures. Int. J. Comput.
Vision 51(2), 91–109 (2001)
6. Martin, R.: A metric for ARMA processes. IEEE Trans. Signal Process. 48(4), 1164–
1170 (2000)
7. Sohail, A., Rahman, M., Bhattacharya, P., Krishnamurthy, S., Mudur, S.: Retrieval
and classification of ultrasound images of ovarian cysts combining texture features
and histogram moments. In: ISBI. pp. 288–291 (2010)
8. Spencer, J.: Utility of portable ultrasound in a community in Ghana. J. Ultrasound
Med. 27(12), 1735–1743 (2008)