Evaluation of Texture Features For Content-Based Image Retrieval
Evaluation of Texture Features For Content-Based Image Retrieval
Evaluation of Texture Features For Content-Based Image Retrieval
Abstract. We have carried out a detailed evaluation of the use of texture features in a query-by-example approach to image retrieval. We used 3 radically dierent texture feature types motivated by i) statistical, ii) psychological and iii) signal processing points of view. The features were evaluated and tested on retrieval tasks from the Corel and TRECVID2003 image collections. For the latter we also looked at the eects of combining texture features with a colour feature.
Introduction
Texture is a key component of human visual perception. Like colour, this makes it an essential feature to consider when querying image databases. Everyone can recognise texture but, it is more dicult to dene. Unlike colour, texture occurs over a region rather than at a point. It is normally dened purely by grey levels and as such is orthogonal to colour. Texture has qualities such as periodicity and scale; it can be described in terms of direction, coarseness, contrast and so on [1]. It is this that makes texture a particularly interesting facet of images and results in a plethora of ways of extracting texture features. To enable us to explore a wide range of these methods we chose three very dierent approaches to computing texture features: The rst takes a statistical approach in the form of co-occurrence matrices, next the psychological view of Tamuras features and nally signal processing with Gabor wavelets. Our study is the rst to focus an evaluation of texture features on the whole image, and to tailor features for optimum retrieval performance in this context. The majority of original papers devising or evaluating texture features used classication or segmentation tasks to measure performance [2,3,4,5]. Both of these tasks are signicantly dierent to the problems faced in image retrieval where one looks at generic queries for an entire picture. Real pictures are made up of a patchwork of diering textures rather than the uniform texture images often used in studies, such as the ones taken from Brodatzs photo book [6]. To that eect we suggest encoding texture in terms of joint histograms of low dimensional texture characteristics over the image in the same way 3D colour histograms are computed, we have called this a Tamura image. Throughout our work we have considered how best to cope with varying image sizes, scales, formats and orientations.
P. Enser et al. (Eds.): CIVR 2004, LNCS 3115, pp. 326334, 2004. c Springer-Verlag Berlin Heidelberg 2004
327
In the next section we look at the features we have chosen and how they are computed. Sect. 3 then describes the image libraries and similarity measures we used for evaluation. Sect. 4 presents our initial results on a training set and suggests modications and parameters that we found gave the best retrieval performance. A larger performance comparison is carried out on the TRECVID2003 data set. Finally, Sect. 5 concludes the paper and outlines further work.
2
2.1
Texture Features
Co-occurrence
Statistical features of grey levels were one of the earliest methods used to classify textures. Haralick [7] suggested the use of grey level co-occurrence matrices (GLCM) to extract second order statistics from an image. GLCMs have been used very successfully for texture classication in evaluations [2].
Table 1. Features calculated from the normalised co-occurrence matrix P (i, j) Feature Energy Entropy Contrast Homogeneity Formula
i i i i
Haralick dened the GLCM as a matrix of frequencies at which two pixels, separated by a certain vector, occur in the image. The distribution in the matrix will depend on the angular and distance relationship between pixels. Varying the vector used allows the capturing of dierent texture characteristics. Once the GLCM has been created, various features can be computed from it. These have been classied into four groups: visual texture characteristics, statistics, information theory and information measures of correlation [7,3]. We chose the four most commonly used features, listed in Table 1, for our evaluation. 2.2 Tamura
Tamura et al took the approach of devising texture features that correspond to human visual perception [1]. They dened six textural features (coarseness, contrast, directionality, line-likeness, regularity and roughness) and compared them with psychological measurements for human subjects. The rst three attained very successful results and are used in our evaluation, both separately and as joint values. Coarseness has a direct relationship to scale and repetition rates and was seen by Tamura et al as the most fundamental texture feature. An image will
328
contain textures at several scales; coarseness aims to identify the largest size at which a texture exists, even where a smaller micro texture exists. Computationally one rst takes averages at every point over neighbourhoods the linear size of which are powers of 2. The average over the neighbourhood of size 2k 2k at the point (x, y) is
x+2k1 1 y+2k1 1
Ak (x, y) =
i=x2k1 j=y2k1
f (i, j)/22k .
Then at each point one takes dierences between pairs of averages corresponding to non-overlapping neighbourhoods on opposite sides of the point in both horizontal and vertical orientations. In the horizontal case this is Ek,h (x, y) = |Ak (x + 2k1 , y) Ak (x 2k1 , y)| . At each point, one then picks the best size which gives the highest output value, where k maximizes E in either direction. The coarseness measure is then the average of Sopt (x, y) = 2kopt over the picture. Contrast aims to capture the dynamic range of grey levels in an image, together with the polarisation of the distribution of black and white. The rst is measured using the standard deviation of grey levels and the second the kurtosis 4 . The contrast measure is therefore dened as Fcon = /(4 )n where 4 = 4 / 4 ,
4 is the fourth moment about the mean and 2 is the variance. Experimentally, Tamura found n = 1/4 to give the closest agreement to human measurements. This is the value we used in our experiments. Directionality is a global property over a region. The feature described does not aim to dierentiate between dierent orientations or patterns, but measures the total degree of directionality. Two simple masks are used to detect edges in the image. At each pixel the angle and magnitude are calculated. A histogram, Hd , of edge probabilities is then built up by counting all points with magnitude greater than a threshold and quantising by the edge angle. The histogram will reect the degree of directionality. To extract a measure from Hd the sharpness of the peaks are computed from their second moments. Tamura Image is a notion where we calculate a value for the three features at each pixel and treat these as a spatial joint coarseness-contrast-directionality (CND) distribution, in the same way as images can be viewed as spatial joint RGB distributions. We extract colour histogram style features from the Tamura CND image, both marginal and 3D histograms. The regional nature of texture meant that the values at each pixel were computed over a window. A similar 3D histogram feature is used by MARS [8].
329
2.3
Gabor
One of the most popular signal processing based approaches for texture feature extraction has been the use of Gabor lters. These enable ltering in the frequency and spatial domain. It has been proposed that Gabor lters can be used to model the responses of the human visual system. Turner [9] rst implemented this by using a bank of Gabor lters to analyse texture. A bank of lters at different scales and orientations allows multichannel ltering of an image to extract frequency and orientation information. This can then be used to decompose the image into texture features. Our implementation is based on that of Manjunath et al [10,11]. The feature is computed by ltering the image with a bank of orientation and scale sensitive lters and computing the mean and standard deviation of the output in the frequency domain. Filtering an image I(x, y) with Gabor lters gmn designed according to [10] results in its Gabor wavelet transform: Wmn (x, y) =
I(x, y)gmn (x x1 , y y1 )dx1 dy1
The mean and standard deviation of the magnitude |Wmn | are used to for the feature vector. The outputs of lters at dierent scales will be over diering ranges. For this reason each element of the feature vector is normalised using the standard deviation of that element across the entire database.
Experimental Set Up
We followed a two-stage approach: Initial evaluation and modications to the features were tested using a carefully selected subset of the Corel image library and the vector space similarity measure. We then ran larger tests on the TRECVID2003 data using the k-nearest neighbour measure (k-nn). We have a baseline for evaluation from previous work with the TREC dataset for which k-nn has consistently proved the best retrieval method. Image Collections. We selected 6,192 images from the Corel collection to give 63 categories that were visually similar internally, but dierent from each other [12]. A set of 630 single-image category queries was executed to test performance across all categories. Relevance judgments on the retrieved images were based on the categorisation. The results shown in Section 4 are the mean average precision (m.a.p.). A second larger image collection was used to give a more realistic performance comparison. This comprised of 32,318 key-frames from TRECVID2003 collection [13]. The search task specied for TRECVID2003 consisted of 25 topics, for each topic a few example images were given as a query. The published relevance judgments for these topics were used to evaluate the retrieval performance for dierent features and combinations of features.
330
Similarity Measures. Distances between feature vectors were calculated using the Manhattan metric. The resultant distances were then median normalised to give even weighting when combined. The plain vector space model was used for retrieval on the Corel data set as these involved only simple 1-image queries. For querying the TREC data a version of the distance weighted k-nn approach was used [14], with k = 40. Positive examples (P ) are supplied as the query and negative examples (N ) randomly selected from the collection. To rank an image i in the collection we identify those images in P and N that are amongst the k-nearest neighbours of i. Using these neighbours we determine the dissimilarity: d1 (i, n) D(i) = nN 1 (i, p) pP d
For each feature we evaluated performance in the conguration described in Sect. 2. Ideas to improve performance were devised and evaluated. The general themes considered were how best to represent an entire image, how to accommodate diering sizes and scale of images and how to cope with the regional qualities of textures. These evaluations were run on the Corel data. Paired ttests were carried out to check whether results were statistically signicant at = 0.05. The best performing features from the initial evaluation were then tested on the TRECVID2003 data set. Tests were run with each texture feature combined with a high performing colour feature. 4.1 Co-occurrence
The two main variables when creating a GLCM are the number of quantisation levels and the vector. We decided to use four vector angles: 0, 45, 90, 135 and four distances. This could be used to calculate up to sixteen GLCMs. However, as the statistics are not invariant under rotation we also tried summing the four angles at each distance into a single matrix. GLCMs can be made symmetrical by including the reverse vector; symmetric and asymmetric matrices were tested. The number of quantisation levels dictate the size of matrix and density of the matrix. This may become a problem with small images or tiles. The eect of varying quantisation between 4 and 64 levels was tried. Features were calculated for whole and tiled images. Preliminary results showed that distances between 1 and 4 pixels gave the best performance. There was no signicant dierence between symmetrical and asymmetric matrices. Tiling of the image gave a large increase in retrieval which attened out by 9 9 tiles. The results in Table 2 are for 7 7 tiles. Similarly increasing quantisation improves performance. The concatenated features (cat) gave better results at all points than the rotationally invariant summed matrices (sum). The best feature was homogeneity with a m.a.p. of 12.2%.
Evaluation of Texture Features for Content-Based Image Retrieval Table 2. Co-occurrence features mean average precision retrieval Quantisation 16 32 9.30% 8.85% 10.41% 9.79 % 8.35% 7.65% 11.16% 10.39% 9.85% 9.19% 11.09% 10.37% 8.29% 7.59% 11.83% 10.93%
331
Feature Energy: cat Energy: sum Entropy: cat Entropy: sum Contrast: cat Contrast: sum Homogeneity: cat Homogeneity: sum
4.2
Tamura
When calculating standard Tamura features for whole or tiled images the main variable is the k value for coarseness. This eect of varying this, and the number of tiles, can be seen in Table 3. The dashes in the table are where the image size resulting from tiling meant that the k value was too large to be used because of the border needed. With the histogram features the main variable to evaluate was the window size. Coarseness can be calculated at a pixel level. However, both the directionality and contrast features operate over a region. A large window would smear the feature and lose resolution; conversely a small window may invalidate the statistical features, particularly if the directionality histogram is too sparsely populated. To evaluate this the features were run over several window sizes, creating a histogram for each feature. A little surprisingly initial results showed that increasing the k value for coarseness reduced the performance the optimum value was 2. This may be due to the large borders necessary for higher values of k. However, it is more likely caused by the nature of textures in images and the way the algorithm averages the 2k values. There are unlikely to be textures with a coarseness of 64 or 32 pixels in a normal image. The algorithm may still detect noise at this dimension, biasing the average value of the feature. A change to the algorithm was made so that it took the values of k rather than 2k eectively introducing a logarithmic scaling of the coarseness and giving less inuence to the larger scales. This gave a signicant increase in performance for the histogram, from 6.1% to 10.1%, but no improvement when applied to the standard feature. Performance of the directionality feature was poor. A detailed look at the operation of the algorithm showed that this was largely due to the sparse population of the histogram and subsequent diculty in calculating valid variance of its peaks. Several options for improvement were tried including calculating global variance of the histogram and using entropy. The latter gave a substantial improvement, from 6.6% to 9.7%, for the standard feature but negligible eect on the histogram.
332
P. Howarth and S. Rger u Table 3. Tamura features mean average precision retrieval
Standard features Tiling 1x1 3x3 5x5 7x7 9x9 3.24% 2.91% 2.74% 4.42% 3.54% 3.49% 3.25% 2.92% 4.43% 3.91% 3.41% 6.08% 4.16% 5.35% 8.33% 7.57% 7.16% 5.74% 7.96% 7.50% 6.95% 7.20% 5.02% 7.45% 9.48% 8.79% 7.68% 9.32% 8.92% 7.74% 8.07% 5.79% 8.93% 9.87% 9.19% 6.98% 9.57% 9.10% 7.15% 8.03% 6.64% 9.73% 9.91% 9.02% 9.59% 8.94% Histogram features Window size 2 4 8 16 5.96% 5.39% 4.89% 6.90% 6.52% 6.12% 6.44% 5.68% 8.81% 6.71% 5.59% 4.37% 5.99% 5.85% 5.71% 9.98% 10.08% 9.33% 7.01% 5.57% 5.24% 6.09% 5.96% 5.64% 9.83% 9.24% 8.12% 6.92% 4.93% 5.43% 6.01% 5.83% 5.40% 8.22% 7.93% 7.67%
Feature Contrast Directionality: peak nding Directionality: entropy Coarseness-2: 2k Coarseness-3: 2k Coarseness-4: 2k Coarseness-5: 2k Coarseness-6: 2k Coarseness-2: k Coarseness-3: k Coarseness-4: k
Finally the combined marginal and 3D histograms were evaluated using a window size of 8, k of 3 and entropy directionality. In addition a combined feature vector of the 3 standard features was evaluated. The m.a.p. results were: marginal histogram 12.0%, 3D histogram 13.7% and standard 14.3%. All gave a signicant improvement over the single features. 4.3 Gabor
Sect. 2.3 describes the generation of this feature. However, there still remain questions over how to apply it to a heterogeneous set of images. The problems of scale, varying size and so on apply. The evaluation in [10] was applied to xed tiles extracted from the Brodatz album. In [11] the feature was used successfully with aerial photographs split into a large number of xed size tiles and then querying to nd individual tiles. We decided to evaluate the feature in two congurations across a range of scale and orientation values. The rst scaled the lter dictionary to the size of the image. This should scale the response so that the same image of dierent size gives a similar value. The second approach was to use a xed size lter and apply this to a sliding window over the image. Initial results showed that scaling the lter size gave much superior results to the sliding window approach. Tiling increased performance in a similar manner to the other features. The results shown in Table 4 are for 7 7 tiling. The best performance is obtained from just 2 scales and 4 orientations. This was unexpected as most literature recommends 4 scales and 6 orientations. Looking at the ltered images indicated that, as for Tamura, this may be due to noise at coarser scales. 4.4 Evaluation Using TRECVID2003 Video Data
A range of the best performing features were run on the TRECVID2003 data and evaluated using the published relevance judgments. The queries were run singly
Evaluation of Texture Features for Content-Based Image Retrieval Table 4. Gabor wavelets mean average precision retrieval Scale 3 2 3 4 Orientation 4 6
333
and then combined with a colour histogram feature, HSV [12]. The results are shown in Table 5. For comparison some features used for previous evaluations [12] gave m.a.ps of: HSV 1.9%, convolution 2.2% and variance 1.7%; random retrieval would give 0.26%. In this evaluation the texture features performed extremely well in comparison with previous benchmarks. Gabor gave the best results, 3.9% or 15 times better than random retrieval. Of the Tamura features the best performing was the combined standard features. The top 3 performing texture features combined and giving a m.a.p of 4.22%. Combining with the HSV feature improved average retrieval performance in all cases, but at an individual query level the benets were both positive and negative. It is interesting that using simple combination of features gives varying degrees of improvement; being able to choose the optimum combination based on the query would be benecial.
Table 5. TREC evaluation mean average precision retrieval Feature gabor-2-4 co-occurence homogeneity tamura standard all tamura CND tamura coarseness-2 Single 3.93% 2.85% 2.57% 1.65% 0.97% Combined with HSV 4.31% 3.03% 3.43% 2.72% 2.49%
Conclusions
We selected 3 dierent texture features, implemented and evaluated them. Both the evaluation and implementation focussed on query-by-example image retrieval rather than the usual classication task. This led to some novel modications to the Tamura features. We found that looking for large scale coarseness degraded performance, so we limited the range and used a logarithmic scale. An improvement in directionality performance over small window sizes was achieved by using an entropy measure rather than taking
334
the second moments of the peaks. We also encoded the features in terms of joint histograms, the overall performance of these was similar to the standard features. To improve the retrieval with Gabor we scaled the lter size to that of the image, rather than using a xed size lter. Rather unintuitively we found that fewer scales gave higher retrieval rates. Our tests of co-occurrence matrices showed a solid performance as expected! Our evaluation with TRECVID2003 data showed that the top 3 texture features performed better than previously used colour features. Combination with a colour feature boosted retrieval performance in all cases. Overall we have demonstrated that we have produced robust texture features for image retrieval. We would like to carry out further evaluations on larger data sets, particularly investigating the interaction of dierent feature combinations. Finally, texture features have an advantage over colour features in that performance should be the same for monochrome images. It would be interesting to perform an evaluation on a library of black and white pictures. Acknowledgement. This work was partially supported by the EPSRC, UK.
References
1. Tamura, H., Mori, S., Yamawaki, T.: Textural features corresponding to visual perception. IEEE Trans on Systems, Man and Cybernetics 8 (1978) 460472 2. Ohanian, P., Dubes, R.: Performance evaluation for four classes of textural features. Pattern Recognition 25 (1992) 819833 3. Gotlieb, C.C., Kreyszig, H.E.: Texture descriptors based on co-occurrence matrices. Computer Vision, Graphics and Image Processing 51 (1990) 7086 4. Jain, A.K., Farrokhnia, F.: Unsupervised texture segmentation using gabor lters. Pattern Recognition 23 (1991) 11671186 5. Randen, T., Husy, J.H.: Filtering for texture classication: A comparative study. IEEE Trans on Pattern Analysis and Machine Intelligence 21 (1999) 291310 6. Brodatz, P.: Textures: A Photographic Album for Artists & Designers. Dover (1966) 7. Haralick, R.: Statistical and structural approaches to texture. Proceedings of the IEEE 67 (1979) 786804 8. Ortega, M., Rui, Y., Chakrabarti, K., Mehrotra, S., Huang, T.S.: Supporting similarity queries in MARS. In: ACM Multimedia. (1997) 403413 9. Turner, M.: Texture discrimination by Gabor functions. Biological Cybernetics 55 (1986) 7182 10. Manjunath, B., Ma, W.: Texture features for browsing and retrieval of image data. IEEE Trans on Pattern Analysis and Machine Intelligence 18 (1996) 837842 11. Manjunath, B., Wu, P., Newsam, S., Shin, H.: A texture descriptor for browsing and similarity retrieval. Journal of Signal Processing: Image Communication 16 (2000) 3343 12. Pickering, M., R ger, S.: Evaluation of key-frame based retrieval techniques for u video. Computer Vision and Image Understanding 92 (2003) 217235 13. Alan Smeaton, W.K., Over, P.: TRECVID 2003 An introduction. In: TRECVID 2003 Workshop. (2003) 110 14. Mitchell, T.M.: Machine Learning. McGraw Hill (1997)