Recognition in Ultrasound Videos: Where Am I?

Kwitt, Roland; Vasconcelos, Nuno; Razzaque, Sharif; Aylward, Stephen

Recognition in ultrasound videos: where am I?

2012, Medical image computing and computer-assisted intervention : MICCAI ... International Conference on Medical Image Computing and Computer-Assisted Intervention

visibility

…

description

8 pages

link

1 file

A novel approach to the problem of locating and recognizing anatomical structures of interest in ultrasound (US) video is proposed. While addressing this challenge may be beneficial to US examinations in general, it is particularly useful in situations where portable US probes are used by less experienced personnel. The proposed solution is based on the hypothesis that, rather than their appearance in a single image, anatomical structures are most distinctively characterized by the variation of their appearance as the transducer moves. By drawing on recent advances in the non-linear modeling of video appearance and motion, using an extension of dynamic textures, successful location and recognition is demonstrated on two phantoms. We further analyze computational demands and preliminarily explore insensitivity to anatomic variations.

Recognition in Ultrasound Videos: Where am I? Roland Kwitt1 , Nuno Vasconcelos2 , Sharif Razzaque3 , and Stephen Aylward1 1 2 Kitware Inc., Carrboro, NC, USA Dept. of Electrical and Computer Engineering, UC San Diego, USA 3 Dept. of Computer Science, UNC, Chapel Hill, USA Abstract. A novel approach to the problem of locating and recognizing anatomical structures of interest in ultrasound (US) video is proposed. While addressing this challenge may be beneficial to US examinations in general, it is particularly useful in situations where portable US probes are used by less experienced personnel. The proposed solution is based on the hypothesis that, rather than their appearance in a single image, anatomical structures are most distinctively characterized by the variation of their appearance as the transducer moves. By drawing on recent advances in the non-linear modeling of video appearance and motion, using an extension of dynamic textures, successful location and recognition is demonstrated on two phantoms. We further analyze computational demands and preliminarily explore insensitivity to anatomic variations. 1 Motivation Many developing countries, as well as rural areas of developed nations, do not have immediate access to expensive medical imaging equipment such as MR or CT. As highlighted in a recent study [8], ultrasound (US) imaging is particularly well-suited for those underserved areas, since it is low-cost, versatile and non-invasive. Additionally, medical care in emergency vehicles and military field operations may benefit from US when performing first care. However, as emphasized in [8], training is needed for personnel to unfold the full potential of US imaging; yet this training requirement can be problematic in rural and emergency situations due to cost and circumstance. In addition to US interpretation, high US acquisition quality is essential but often difficult to achieve. A cost-effective and practical solution to the training challenge is to embed expertise into the US system and/or a connected mobile device. This has led other groups to attempt to embed computer-aided diagnosis (CAD) systems into imaging devices. We instead seek to help an inexperienced operator acquire clinically significant images that can then be transferred to a central location for reading by experts. This approach should be easier to achieve and more broadly useful than an embedded CAD system. The technical challenge reduces to developing algorithms for recognizing key anatomic structures as the US videos are acquired. Based on those localizations, the system can then convey to the operator how to acquire additional images, relative to those key locations, for transmission to expert readers, or can indicate when US video must be reacquired to meet quality requirements. Apart from being helpful to novice users, Video sequence S (Linear in case of DTs) xt Time t Observation noise C(·) + yt Non-linearity Frame at t0  y0,0  Y =  ... y0,d .. . Frame at tT −1 yT −1,0 ..  .  yT −1,d (a) Data matrix z −1 Unit delay A + Linearity State noise (b) Generative model of the KDT Fig. 1. (a) Assembly of the data matrix Y from a video sequence S; (b) Generative model for the KDT. we argue that automated recognition of anatomical structures might also be beneficial to experienced physicians, since it could help to minimize the risk of misidentifications. The objective of this work is to promote a new pathway for locating anatomical structures when moving an US transducer. The key idea is to avoid image-toimage comparisons using an atlas but rather to exploit the full spatio-temporal information of the US video sequences. It is argued that the appearance changes of anatomical structures, due to probe motion, are particularly distinctive for their localization. Technically, we draw on recent advances in video modeling in computer vision. The varying appearance of an anatomical structure is represented by a generative video model, known as the kernel dynamic texture [2]. Similarity between video sequences is then defined as similarity in the parameter space of this model. Since we propose storing a database of key-location US sequences on portable devices and performing real-time analysis of US videos as they are acquired, generative models are particularly useful. In our case, we only need to augment the database by the KDT model parameters (which have a small memory footprint) and distances can be very efficiently computed. While classification of US images has been previously studied (e.g., [7]), to the best of our knowledge, this is the first work to tackle localization on the basis of dynamic US sequence information. This paper presents 1) our application of the kernel dynamic texture algorithm, 2) a preliminary study on sensitivity and specificity using phantoms (admittedly for a limited range of the relevant problem space) and 3) a study on robustness towards simulated anatomic variations between the modeled structures to be localized and the actual observations. 2 Recognition with Kernel Dynamic Textures We selected dynamic texture (DT) [5] models as an appropriate class of generative models for capturing video appearance changes. DT models arose from computer vision and were selected for US modeling because of the prominent role texture plays in US images, e.g., compared to edges or intensity. In particular, we exploited a recent non-linear extension of the DT family, denoted the kernel dynamic texture (KDT) [2], to capture non-linear appearance changes that will occur as structures move into and out of the ultrasound imaging plane. Consider a US sequence S as an ordered sequence of T video frames, i.e., S = (y 0 , . . . , y T −1 ), where y t ∈ Rd is the frame observed at time t. Under the DT framework of [5], these observations are modeled as samples of a linear dynamical system (LDS). At time t, a vector of state coefficients xt ∈ RT is first sampled from a first-order Gauss-Markov process, and the state coefficients are then linearly combined into the observed video frame y t , according to xt = Axt−1 + wt , (1) y t = Cxt + v t (2) where A ∈ RT ×T is the state-transition matrix and C ∈ Rd×T is the generative matrix that governs how the state determines the observation. Further, wt and v t denote state and observation noise with wt ∼ N (0, I) and v t ∼ N (0, R), respectively. Assuming that the observations are centered4 and following the system identification strategy of [5], C is estimated by computing an SVD decomposition of the data matrix Y = [y 0 · · · y T −1 ] as Y = U ΣV ⊤ and setting C = U . The state matrix X = [x0 · · · xT −1 ] is estimated as X = ΣV ⊤ and A can be computed using least-squares as A = [x1 · · · xT −1 ][x0 · · · xT −2 ]† , where † denotes the pseudoinverse. When restricting the DT model to N states, C is restricted to the N eigenvectors corresponding to the N largest eigenvalues. The rest follows accordingly. Due to space limitations, refer to [5] for details on noise parameter estimation. In the non-linear DT extension of [2], the generative matrix C is replaced by a non-linear observation function C : RT → Rd , i.e., y t = C(xt ) + v t , (3) while keeping the state evolvement linear. The corresponding dynamical system is denoted a kernel dynamic texture (KDT), shown in Fig. 1(b). The non-linearity of C requires a different, although conceptually equivalent, set of parameter estimates. The idea is to use kernel PCA (KPCA) to learn the inverse mapping D : Rd → RT from observation to state space, in which case the KPCA coefficients then represent the state variables.5 We note that the KDTs are not necessarily restricted to work with intensity observation matrices; they will work with any kind of feature for which we can define a suitable kernel, c.f. [3]. Additionally, we have chosen to adopt the distance measure from [2] for measuring similarity of two video sequences, S a and S b . This approach was chosen for its speed. It is based on an adaption of the Martin distance [6] among the corresponding DTs Da = (Aa , C a ) and Db = (Ab , C b ) with N states each. The (squared) Martin distance, given by [6,4] d2 (S a , S b ) = − log N Y cos2 (φi ), (4) i=1 is based on the subspace angles φi among the infinite observability matrices O a 2 ⊤ ⊤ ⊤ and O b , defined as [4] [C ⊤ a (C a Aa ) (C a Aa ) · · · ] =: O a . In fact, the cos(φi ) 4 5 Centering is straightforward by subtracting the column-wise means of Y . See supp. material to [2] for centering in the feature space induced by the kernel. Top view Side view Slice view Fig. 2. Illustration of the noodle phantom, made of gelatine and Soba noodles (left three images) and an abdominal CIRS phantom mounted in a water tank (right). correspond to the N largest eigenvalues λi of the generalized eigenvalue problem O aa 0 0 O ab x x =λ with O ab = O ⊤ (5) a Ob , O ba 0 y 0 O bb y subject to x⊤ O aa x = 1 and y ⊤ O bb y = 1. For DTs, computation of O ab is straightforward, since the terms C ⊤ a C b can be evaluated. For KDTs, it can be C (which are no longer available) boils down shown that computation of C ⊤ b a to computing the inner products between the principle components of kernel a b matrix Kij = k(y ai , y aj ) and Kij = k(y bi , y bj ), i.e., O ab = ∞ X n (Ana )⊤ α̃⊤ Gβ̃ Anb , A → C (Ana )⊤ C ⊤ b | {z } | a{z } b n=0 n=0 KDTs DTs ∞ X (6) where α̃ = [α̃0 · · · α̃T −1 ], β̃ are the (normalized) KPCA weight matrices with α̃i = αi − 1/N (e⊤ αi )e and G is the kernel matrix with entries Gij = k(y ai , y bj ). In the remainder of the paper, we use (4)-(6) for measuring similarity between US sequences and a standard RBF kernel for all kernel computations.6 For localization, we follow a sliding-window strategy, measuring how well a key sequence matches a subsequence from a long path (i.e., the search sequence Pn ) of acquisitions. That is, given Q frames in a key sequence, we move a slidingwindow Wi of Q frames along a path by p frame increments. For each Wi , we estimate the KDT parameters and compute the Martin distance to the KDT of the key sequence. A key sequence is indicated in a search sequence when the distance is minimal. At this time these minimums are illustrative. As more data and specific applications evolve, statistical likelihood methods will be used. 3 Experimental Protocol For the studies in this paper, we use two different kinds of phantoms: 1) a homemade noodle phantom made of gelatine with embedded Soba noodles and 2) a triple modality 3D abdominal phantom (CIRS Model 057) mounted in a water tank, see Fig. 2. The noodle phantom is particularly useful, since the 6 For KPCA, kernel width is set as σ 2 = mediani,j ky i − y j k2 ; to compute Gij , it can be shown [2] that y ai and y bj need to be scaled by σa and σb first. S0 (ti ) S0 (ti+15 ) S1 (ti ) S1 (ti+15 ) S2 (ti ) S2 (ti+15 ) Fig. 3. Snapshots of three key structures at two time points on the noodle phantom. noodles are self-similar at a small scale, have ambiguous patterns of bends at medium scales, and at large scales and in US sequences present a rich set of structures that are difficult to casually distinguish. For imaging we use the Telemed LogicScan 128 INT-1Z kit. US frequency is set to 5Mhz. Penetration depth is 90mm on the noodle phantom and 150mm on the abdominal phantom. Speckle reduction is enabled in the US acquisition software. All images were acquired freehand, without tracking. We learn N = 5 state KDTs and clip each sequence to a central 300 × 300 (noodle phantom), or 200 × 200 (abdominal phantom) pixel window. Using more states did not lead to any improvements in the presented results. 3.1 Localization of structures within US sequences The first experiment tested whether it is possible to localize key structures in the noodle phantom. Two different sets of acquisitions were made. The first set was composed of short (40 frames) US key sequences Sn , captured by moving the US transducer over three different key structures to be localized, see Fig. 3. Key structures were chosen ad hoc by the probe operator. We then estimated KDT models for each of the three key structures using this first set of data. The second set was composed of longer US search sequences Pn , acquired along multiple paths on the noodle phantom; these simulated searches for the key structures, see Fig. 4. On both sets, we tried to minimize probe tilt and rotation, but rotation and titling was inevitable. Note also that the acquisition direction of the key sequences matched the acquisition direction of the search sequences. To evaluate sensitivity, we performed the localization using key sequences applied to search sequences that also covered the corresponding key structures. Distance plots are shown on the left-hand side of Fig. 5. To evaluate specificity, we repeated this experiment along multiple search paths that did not cover any of the key structures. Distance plots are on the right-hand side of Fig. 5. To evaluate the robustness against shifts of the ultrasound imaging plane (e.g., partial inclusion of a key structure), we performed ten runs with random displacements δx , δy of the clipping window in x and y direction with δx , δy ∈ {−5, . . . , 5} pixel. Fig. 5 shows the Martin distance averaged over all clipping window positions for each sliding window index along three search paths (left). The enclosing light-blue hull illustrates the standard deviation. Based on the above three experiments we make the following observations: 1) key structures exist at global minima in the Martin distance metric of a search sequence, when key structures are encountered; 2) Martin distance decreases as US Transducer (curved array) Back-and-forth moves → Key sequence Sn Path for search sequence (Pn ) t+ t+ Soba noodle t− t− Wj s Tran Gelatine n latio Ti Sliding window Wi at position i (a) Acquisition (b) Overlap l ti ng (c) Movement Fig. 4. Illustration of (a) the acquisition process on the noodle phantom, (b) sliding windows overlapping key structures (yellow) and (c) probe movements. In (b), handannotations t+ and t− bracket where the sliding window overlaps the key structure. the sliding window moves towards a key structure and increases as it leaves the key structure; 3) if a key structure is not encountered by a search, then there is not a distinctive minimum in the distance measurements. 3.2 Localizing a hepatic vessel on an abdominal phantom Our second experiment is more challenging in the sense that we try to locate a more subtle structure, namely a specific section of a hepatic vessel in an abdominal phantom. The experimental protocol is similar to the previous experiment; however, US transducer tilting (also known as angulation [1]) (see Fig. 4) is used instead of translation along a path. We attempt to localize the hepatic vessel key structure within a single search sequence. The key sequence acquisition spans ≈ 40◦ around the angle where the vessel is visible. The search sequence covers ≈ 140◦ around the vessel. Again, all acquisitions were performed freehand, and the ultrasound probe was repositioned on the phantom between each acquisition. Fig. 6 shows the Martin distances for localization and for localization using shifted clipping windows. This experiment highlights two things. First, we can again localize the key sequence within the longer search sequence, even though the span of the minimal Martin distances that correspond to the true location of the vessel is less prominent and less persistent than in the previous experiment. Second, variation in the distance measurements is much higher for small vessels than for the more distinct, larger structures form the noodle phantom. 3.3 Localization in the presence of simulated anatomical variation Our third experiment focused on the insensitivity of the distance metric and the localization method to anatomic variations. Specifically, we simulate anatomic variation by (non-linear) spatial distortion of the search sequences. We admit that this does not cover the range of variation among individuals, but it does begin to give an impression of robustness. Fig. 7 shows the distance measurements when the key structure is encountered within the search sequence. The Search sequence P1 t− t+ 70 70 Key structure not along path Key structure along path 60 60 50 50 40 40 20 40 60 80 10 100 20 30 40 50 60 70 80 90 100 Search sequence P2 55 t− t+ 50 50 45 45 40 40 35 35 10 20 30 40 50 60 10 70 20 30 40 50 60 70 Search sequence P3 t− 60 70 60 50 50 40 40 30 30 20 40 60 80 100 120 20 140 Sliding window position 40 60 80 100 120 140 Fig. 5. Martin distance between key sequence KDT and the sliding window KTDs for three different paths, averaged over ten random clipping window positions (left); sliding window positions where the key structure is covered to some extent are marked light gray (from manual annotation); Distance measurements when trying to locate a key structure that was not covered by a path (right). distortions are illustrated on a checkerboard pattern. Note that the underwater distortion is changing over time.7 As shown, the distortions do not have a negative impact on the localization, however, we observe less distinctive minima. 3.4 Discussion and future work We conclude that the KDT framework + Martin distance is an effective combination in localizing short US sequences of key structures. Further, computational requirements are modest, even for our un-optimized (MATLAB) code and use of intensity features. For example, given a key sequence consisting of 40 frames, computing the KDT model and distance metric per sliding window requires ≈ 0.1 seconds on an Intel Core i7 1.7Ghz CPU with 8GB of RAM. It is also worth noting that we can perfectly recognize the short US key sequences using a simple nearest-neighbor classifier and the Martin distance as a metric. For future work we note that using intensity information as our observations space has its limitations. Due to the generic nature of the KDT approach and KPCA-based system identification, we could easily integrate more specifically tailored US features as long as we can define a suitable kernel. 7 See supp. video at http://vimeo.com/rkwitt. Clipping window pos. 3 Clipping window pos. 1 70 60 50 ti ti+5 ti+10 ti+15 60 40 20 40 60 50 t− t+ 30 80 100 120 t+ 20 Clipping window pos. 2 40 60 t− 80 Vessel 100 120 Average 65 70 60 60 55 50 50 40 t+ 45 20 40 60 t+ t− 80 100 120 20 40 60 t− 80 100 120 Sliding window position Fig. 6. Martin distance measurements for three clipping window positions and distance measurements averaged over all random runs (left); illustration of the hepatic vessel appearing and disappearing in the key sequence (right). 70 Wave distortion Underwater distortion 60 (changing over time) 60 50 50 40 40 20 40 60 Sliding window position 80 100 20 40 60 80 100 Sliding window position Fig. 7. Distortion experiments on the search sequence for two different types of (nonlinear) spatial distortion (illustrated on the checkerboard pattern). Acknowledgements This work was partially funded by NSF (CCF-0830535) and NIH/NCI (1R01CA138419-0, 2R44CA143234-02A1). References 1. Block, B.: The Practice of Ultrasound: A Step-by-Step Guide to Abdominal Scanning. Thieme, first edn. (2004) 2. Chan, A., Vasconcelos, N.: Classifying video with kernel dynamic textures. In: CVPR. pp. 1–6 (2007) 3. Chaudry, R., Ravichandran, A., Hager, G., Vidal, R.: Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for recognition of human actions. In: CVPR. pp. 1932–1939 (2009) 4. De Cock, K., Moore, B.: Subspace angles between linear stochastic models. In: CDC. pp. 1561–1566 (2000) 5. Doretto, G., Chiuso, A., Wu, Y., Soatto, S.: Dynamic textures. Int. J. Comput. Vision 51(2), 91–109 (2001) 6. Martin, R.: A metric for ARMA processes. IEEE Trans. Signal Process. 48(4), 1164– 1170 (2000) 7. Sohail, A., Rahman, M., Bhattacharya, P., Krishnamurthy, S., Mudur, S.: Retrieval and classification of ultrasound images of ovarian cysts combining texture features and histogram moments. In: ISBI. pp. 288–291 (2010) 8. Spencer, J.: Utility of portable ultrasound in a community in Ghana. J. Ultrasound Med. 27(12), 1735–1743 (2008)

Log In

Recognition in ultrasound videos: where am I?

Sign up for access to the world's latest research.

Related papers

Related topics