Academia.eduAcademia.edu

Evaluating glyph binarizations based on their properties

2013, Proceedings of the 2013 ACM symposium on Document engineering - DocEng '13

Document binary images, created by different algorithms, are commonly evaluated based on a pre-existing ground truth. Previous research found several pitfalls in this methodology and suggested various approaches addressing the issue. This article proposes an alternative binarization quality evaluation solution for binarized glyphs, circumventing the ground truth. Our method relies on intrinsic properties of binarized glyphs. The features used for quality assessment are stroke width consistency, presence of small connected components (stains), edge noise, and the average edge curvature. Linear and tree-based combinations of these features are also considered. The new methodology is tested and shown to be nearly as sound as human experts' judgments.

Evaluating Glyph Binarizations Based on Their Properties § § § Shira Faigenbaum , Arie Shaus , Barak Sober , Eli Turkel Eli Piasetzky The Department of Applied Mathematics Tel Aviv University Tel Aviv 69978, Israel +972-3-640-6024 [email protected], [email protected], [email protected], [email protected] The Sackler School of Physics and Astronomy Tel Aviv University Tel Aviv 69978, Israel +972-3-640-9428 [email protected] § These authors contributed equally to this work. ABSTRACT inspection of binarizations. Document binary images, created by different algorithms, are commonly evaluated based on a pre-existing ground truth. Previous research found several pitfalls in this methodology and suggested various approaches addressing the issue. This article proposes an alternative binarization quality evaluation solution for binarized glyphs, circumventing the ground truth. Our method relies on intrinsic properties of binarized glyphs. The features used for quality assessment are stroke width consistency, presence of small connected components (stains), edge noise, and the average edge curvature. Linear and tree-based combinations of these features are also considered. The new methodology is tested and shown to be nearly as sound as human experts’ judgments. This article provides an approach which eliminates the need for GT. The document binarizations are judged automatically, based on the intrinsic properties of their glyphs. Four estimates are introduced: stroke width consistency, proportion of stains, average edge curvature, and proportion of edge noise. In certain scenarios, these may be utilized on their own right. Alternatively, these measures can be combined in order to provide the relative ranking of the binarizations. Producing such a model may involve a train-test procedure, dependent on the task under consideration (human epigraphic analysis, alphabet reconstruction, OCR, etc.). Categories and Subject Descriptors I.7.5 [Document Capture]: Document analysis General Terms Algorithms, Measurement, Performance, Experimentation. Keywords Binarization, glyph, evaluation, quality measure, ground truth. 1. INTRODUCTION The plethora of available binarization algorithms results in different outputs for the same document image. The ensuing need for comparing binarizations, gives rise to the existing ground truth-based (GT) evaluation methodology [1-3]. The evaluation is based on a manual GT creation, and on various GT-versusbinarization measures (e.g., F-measure, PSNR, Distance Reciprocal Distortion, Misclassification Penalty, etc.). Several recent papers [4-6] performed a detailed analysis of this approach, stressing its inherent weaknesses such as subjectivity and the inherent inconsistency within the GT creation process. Among the alternative solutions suggested, are skeleton-based GT variants (maintaining some degree of human intervention) [7-8], automatic GT creation (via another binarization procedure) [9], creation of synthetic document images out of existing GT (applicable if noise model exists) [10-11] and goal-directed approach, e.g. assessing OCR results (applicable if an OCR engine is available) [12]. Trier and Taxt [13] proposed a method somewhat reminiscent of the one specified herein, yet it was performed manually upon visual Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DocEng’13, September 10–13, 2013, Florence, Italy. Copyright 2013 ACM 1-58113-000-0/00/0010 …$15.00. The purpose of this study is to provide the best available binary image on a glyph scale. The challenging problem of glyph regions extraction, along with its related topics of concern such as broken strokes and touching characters, is outside the scope of this article (the papers [14-16] deal with some of these issues). 2. SUGGESTED GLYPH MEASURES 2.1 Measures Definitions We start by defining independent binarization quality measures, correlating to common human perception. Four measures, pertaining to different aspects of binarized images, are proposed and formalized. We will work on small binarized images, each containing a single glyph. This can be an outcome of any segmentation algorithm, such as [14-16]. The foreground (valued at 0) and the background (valued at 255) will be denoted respectively as F and B , with p  ( x, y) a pixel coordinate. 2.1.1 Stroke width consistency The local scale consistency of a character stroke width is closely related to the quality of the binarized character. Indeed, partially erased letters, or the presence of stains may introduce discontinuities in stroke width. The idea is not simply to measure the width of a stroke at every point, but to assess the smoothness of its change between adjacent pixels. The measure is defined by the following algorithm (though devised independently, our first step is reminiscent of [17], while steps 2 and 3 are original). Step 1 – Evaluate the stroke width SW ( p) for each pF :  For each angle  {0 , 45 ,90 ,135 } , examine the line segments with inclination  passing through p and restricted to F . Among these, denote the longest segment as seg ( p,  ) .  Define SW ( p)  min seg ( p,  ) . 2  Step 2 – Calculate the stroke width gradient magnitude G( p) :  Calculate directional derivatives Gx ( p) and Gy ( p) . pE  Define the gradient magnitude with respect to L norm: G( p)  max(| Gx ( p) |,| Gy ( p) |) pE  s( p) pE M AEC    mean  arccos v1 ( p), v2 ( p) Note that the following also holds: pE M SWC  mean(G( p)) Step 3 – Apply the measure: M AEC  mean( ( p))   s( p) ( p) Step 4 – Apply the measure: pF Note that given a clean binarization with gradually changing stroke widths, G( p) yields low values, resulting in a small M SWC .  (7) It should be stated that in certain cases, p  E might possess more than two neighboring pixels. In such a case, we account for all possible neighboring pairs in Steps 2-4. 2.1.4 Edge noise proportion 2.1.2 Stains proportion The existence of black spots within a white background, or vice versa, is an indication of either an imperfect binarization or the presence of noise. In what follows, we will consider the stains relative area in pixels, denoted below as ... . While stains count Another suggested property is the presence of typical edge noise, which often correlates with the overall quality of the binarization. The paper [18] suggests a procedure involving 12 different convolution kernels, approximating the amount of such noise. Below, we suggest a simplified method, involving 4-connectivity morphological operations. may be used instead, according to our experiments, this measure performs poorly. Step 1 – Find the edge utilizing dilation and erosion of F : CC  {cci }i 1 ; these belong to either F or B . The set of Stain Step 2 – Calculate a noise estimate (cl=closure, op=opening): The image is partitioned into a set of Connected Components N CCs is defined as: SCC  {cci  CC | cci  T } . Throughout our experiments, the value of T was set to 0.5% of the glyph image size. The measure definition is: M  cc j cci SP  cc j SCC  cci CC The “ideal” letter is expected to possess a smooth edge. This is tightly related to the average edge curvature (herein, we use its absolute value):  dT d    ds ds s (1) where T is the normalized tangent of the edge curve,  is the tangent angle, and s is the arclength parameter. The computation of the average edge curvature is as follows: E  F \ erosion( F ) Step 1 – Find the edge via 4-connectivity erosion of F : (2) Step 2 – Calculate local angle: For each pixel p  E , and for each pair of its neighboring pixels p1 , p2  E (assuming 8-connectivity), define the unit vectors vk ( p)  ( pk  p) pk  p for k  1, 2 . Next, we find  ( p) , the angle between v1 ( p) and v2 ( p) : 2  ( p)  arccos v1 ( p), v2 ( p) The angle  ( p) , used for the curvature definition, is:  ( p) [0,  ] .  ( p)    ( p) the definition of arccos ,  ( p) [0,  ]  ( p)   ( p) s( p) E  dilation( F ) \ erosion( F ) (8) N   cl ( F ) \ F    F \ op( F )   cl ( F ) \ op( F ) (9) The closure attaches isolated B pixels to F , while the opening performs a dual operation. N provides a set of all isolated pixels. Step 3 – Apply the measure: M ENP  N E 2.1.5 Monochromatic binarizations 2.1.3 Average edge curvature Due to (6) (3) (4) and Step 3 – Approximate the local curvature: (5) In general, undesirable scenarios of an almost completely black or white binarization (e.g. due to illumination conditions) should also be addressed for all four measures. Accordingly, cases where an insufficient number of either F or B pixels exist, were detected and handled in the following fashion. Assuming 4connectivity, if a double-dilation of F left no B pixels, or if a double-erosion of F left no F pixels, all the measures were set to Inf  32768 . 2.2 Measure Combinations The measures presented above can be applied on their own right, each assessing a different glyph characteristic. In fact, in certain settings, we have seen some of them (in particular M ENP ) producing judgments comparable to human appraisals. Conversely, these measures can be combined into a joint score or classifier, depending on the task under consideration. These may vary according to the type of writing in question (printed or handwritten), medium, corpora, noise characteristic, binarizations end goal (epigraphical research, glyph reconstruction, OCR), etc. Subsequently, we do not suggest that the combinations derived below to be the ultimate model in all conceivable cases. We do suggest a procedure to derive models for settings comparable to ours. With certain adjustments, these ideas may also be applicable for training binarization quality control apparatus for other tasks. The combinations dealt with below are linear and tree models, used due to their simplicity. These models require training and testing phases, based on experts’ estimations. Such a procedure is presented in the next section. 3.2.1 Prerequisites 3. EXPERIMENTAL SECTION 3.1 Motivation and Data Set The motivation behind this research was an attempt at ranking binarizations according to their suitability for human and computer-based handwriting analysis. Visually appealing binarizations, faithful to the document images, were preferred. Our database consisted of segmented glyphs, along with their binarizations. We used glyphs originating from two different First Temple Period Hebrew inscriptions: 50 images (glyphs) were taken from Arad #1 [19], while 47 images (glyphs) were obtained from Lachish #3 [20]. The segmentation into individual characters was performed via algorithm [16]. The state of preservation of these ink-over clay samples was poor, presenting a challenge for our methodology. The 9 binarizations in use were: Otsu [21], Bernsen [22] with window sizes (in pixels) of w  50 and w  200 , Niblack [23] with w  50 and w  200 , Sauvola [24] with w  50 and w  200 , as well as our own binarization [16] with or without unspeckle stage. (a) (b) (c) (d) (e) Input data: As stated previously, each binarization block (containing 9 binarizations) for each of the 97 letters, had 3 expert rankings. Resulting vectors of length 873, containing rankings of binarization blocks in a stacked manner, are denoted as R1 , R2 , R3 (one for each expert). For training purposes, a combined experts ranking Rexperts was derived. First, Rmean  mean( R1 , R2 , R3 ) , was calculated (coordinate-wise), possibly containing non-integer values. Then, a re-ranking of Rmean enforced scores of 1…9 within each binarization block, resulting in Rexperts . Such process is denoted below as “re-ranking procedure”. In addition, the 4 different measures produced their own rankings for every binarization block, yielding the corresponding vectors RSWC , RSP , RAEC and RENP . Model score: A model m is scored in the following fashion. A prediction produced by the model is re-ranked, resulting in Rm , which is then compared with the experts ranking via standard linear ( cor ) or Kendall (  ) [26] correlations: cm  min  cor ( Ri , Rm )  ,  m  min  ( Ri , Rm )  i 1..3 i 1..3 3.2.2 Model selection stage Model specifications: (f) (g) (h) (i) (j) Figure 1. Expert’s ranking of one glyph, in decreasing quality order. (a) Original image, (b) Sauvola w  200 , (c) Shaus et al. [16] inc. unspeckle stage, (d) Shaus et al., (e) Otsu, (f) Niblack w  200 , (g) Niblack w  50 , (h) Sauvola w  50 , (i) Bernsen w  50 , (j) Bernsen w  200 . From the 97 original grayscale images, a database of 873 (97 x 9) binary images was constructed. Each set of 9 binarizations, denoted herein as a “binarization block”, was judged independently by three different experts. The experts’ rankings (from 1=high, up to 9=low) were based on their prior epigraphical knowledge. An example of a single expert’s opinion is presented in Fig. 1. Constructing such a data set with manual ranking information for different binarization procedures is a labor-intensive procedure. This explains the relatively modest size of our database. 3.2 Ranking Prediction The experiment attempted at creating a model matching the three experts’ ranking. The model types under consideration were linear and tree-based regressions [25]. These models used the 4 rankings based on the measures M SWC , M SP , M AEC and M ENP . The utilization of rankings, rather than measure values, provides a common scale across different letters. The experiment consisted of model selection and model verification stages. Both necessitate the prerequisites specified in the next sub-section. Both linear and tree-based regression models were considered. The independent variables were RSWC , RSP , RAEC and RENP , while the dependent variable was Rexperts . The linear regression models differed from each other by the presence or absence of independent variables (15 possible combinations). The tree regression models differed from each other by the presence or absence of independent variables, as well as by their depths (2 configurations were attempted: default setting of [25], as well as a “forced” tree with 9 leaves). This resulted in a total of 30 tree models under consideration. Selection procedure: The model corresponding to the highest cm and  m scores was selected. As will be seen, in this experiment, both scores resulted in the same selected model. Success criteria: Since even human experts differ in their judgments, we do not expect the best model to perform flawlessly, but in a “human-like” fashion. Our golden standards are the minimal correlations between pairs of human experts, denoted as cexpert and  expert . cm  0.8  min  cor ( Ri , R j )   0.8  cexpert Hence, our optimal model is expected to adhere to:  m  0.8  min  ( Ri , R j )   0.8   expert 1i  j  3 1i  j  3 (10) (11) Selected model: The selected model, for both cm and  m scores, was a tree with 9 leaves, of depth 6. The tree used rankings from all 4 measures, RENP , with cm  0.678 and  m  0.543 . Since cexpert  0.768 with the most important one (used for the upper splits) being and  expert  0.634 , the criteria was met. 6. REFERENCES 3.2.3 Model verification stage The selected model type (a tree with 9 leaves and all independent variables) was bootstrapped in order to check its robustness. Each iteration performed a 50-50 test/train separation on the binary blocks level (thus, all the binarizations of a single glyph were assigned either to train or to test data, avoiding possible bias). Subsequently, a new model was trained and tested. The bootstrap included 1000 iterations, resulting in pvalue=0.05 for cm , and [0.454, 0.610]  m . These indicate the robustness of our model. confidence intervals of [0.582, 0.74] for University. Arie Shaus is grateful to the Azrieli Foundation for the award of an Azrieli Fellowship. 4. SUMMARY AND FUTURE RESEARCH DIRECTIONS Following inherent obstacles in GT-based quality evaluation of binary images, we proposed a solution based on several intrinsic properties of binary glyphs. Four binarization quality measures were introduced: stroke width consistency, proportion of stains or edge noise, and average edge curvature. In certain scenarios, these may suffice on their own right. Alternatively, a combination of these scores can be trained for specific purposes, such as paleographical analysis, glyph reconstruction or OCR. For our uses, a tree-based model produced adequate and robust results. Some shortcomings and potential enhancements can be proposed:  The results of different binarization algorithms, as well as comparison with other methodologies, can be elaborated upon.  The approach is not limited to the glyph level. If an extraction of words, sentences, or text areas are given, the measures remain applicable. However, this might involve issues such as illumination equalization and text size normalization.  The size of our training/testing set is limited due to the reasons stated above. A further enlargement of our database is planned in the near future. In particular, testing in different settings (e.g. printed characters) may provide interesting insights related to our methodology. Moreover, if a labeled database is available, an individual combination of measures can be trained for every character, taking into account their different features.  A potential hazard is an undesired “tailoring” of the binarization algorithms according to the evaluation methodologies employed (e.g. post-processing via median filter). Indeed, any quality measure can result in a binarization algorithm trained (in fact, over-fitted) to target the measure. 5. ACKNOWLEDGMENTS The research was partially funded by the European Research Council under the European Community's Seventh Framework Programme (FP7/2007-2013)/ERC grant agreement no. 229418. This study was also supported by a generous donation of Mr. Jacques Chahine, made through the French Friends of Tel Aviv [1] Gatos, B., Ntirogiannis, K. and Pratikakis, I. 2009. ICDAR 2009 document image binarization contest (DIBCO 2009). In Proc. of ICDAR ‘09, 1375-1382. [2] Pratikakis, I., Gatos, B. and Ntirogiannis, K. 2010. H-DIBCO 2010 – Handwritten document image binarization competition. In Proc. of ICFHR ’10, 727-732. [3] Pratikakis, I., Gatos, B. and Ntirogiannis, K. 2011. ICDAR 2011 document image binarization contest (DIBCO 2011). In Proc. of ICDAR ‘11, 1506-1510. [4] Barney Smith, E. H. 2010. An analysis of binarization ground truthing. In Proc. of DAS ‘10, 27-33. [5] Barney Smith, E. H. and An, C. 2012. Effect of “Ground Truth” on image binarization. In Proc. of DAS ‘12, 250-254. [6] Shaus, A., Turkel, E. and Piasetzky, E. 2012. Quality evaluation of facsimiles of Hebrew First Temple period inscriptions. In Proc. of DAS ‘12, 170-174. [7] Ntirogiannis, K., Gatos, B., and Pratikakis, 2008. An objective evaluation methodology for document image binarization techniques. In Proc. of DAS ‘08, 217-224. [8] Ntirogiannis, K., Gatos, B., and Pratikakis, 2013. Performance evaluation methodology for historical document image Binarization. IEEE Transactions On Image Processing, Vol. 22(2). [9] Ben Messaoud, I., El Abed, H., Amiri, H. and Märgner, 2011. A design of a preprocessing framework for large database of historical documents. In Proc. of HIP ‘11, 177-183. [10] Stathis, P., Kavallieratou, E. and Papamarkos, N. 2009. An evaluation technique for binarization algorithms. J. of Universal Computer Science 14, No. 18, pp. 3011-3030. [11] Paredes, R. and Kavallieratou, E. 2010. ICFHR 2010 contest: Quantitative evaluation of binarization algorithms. In Proc. of ICFHR 2010, 733-736. [12] Trier, Ø. D. and Jain, A. K. 1995. Goal-directed evaluation of binarization methods, IEEE PAMI 17, No. 12, 1191-12 [13] Trier, Ø. D. and Taxt, T. 1995. Evaluation of binarization methods for document images. IEEE PAMI 17, No. 3. 31-36. [14] Breuel, T. M., 2001. Segmentation of handprinted letter strings using a dynamic programming algorithm. In Proc. of DAS 2001, 821-826. [15] Casey, R.G. and Lecolinet, E, 1996. A survey of methods and strategies in character segmentation, IEEE PAMI 18, No. 7, 690-706 [16] Shaus, A., Turkel, E. and Piasetzky E. 2012. Binarization of First Temple Period inscriptions - performance of existing algorithms and a new registration based scheme. In Proc. of ICFHR ‘12, 641-646. [17] Epshtein, B., Ofek, E. and Wexler, Y. 2010. Detecting Text in Natural Scenes with Stroke Width Transform. In Proc. of CVPR ‘10. [18] McGillivary, C., Hale, C. and Barney Smith, E. H. 2009. Edge Noise in Document Images. In Proc. of AND ‘09, 17-24. [19] Aharoni, Y. 1981. Arad Inscriptions. Israel Exploration Society. [20] Torczyner, H. et al. 1938. Lachish I: The Lachish Letters. London. [21] Otsu, N. 1979. A threshold selection method from gray-level histograms. IEEE Trans. Systems Man Cybernet. Vol. 9 (1). 62–66. [22] Bernsen, J. 1986. Dynamic thresholding of grey-level images. In Proc. of ICPR ‘86. 1251–1255. [23] Niblack, W. 1986. An Introduction to Digital Image Processing. Prentice-Hall, 115–116. [24] Sauvola, J. and Pietikainen, M. 2000. Adaptive document image binarization. Pattern Recognition, Vol. 33. 225–236. [25] Tree model, R version 2.12.2. http://www.r-project.org [26] Kendall, M. 1938. A New Measure of Rank Correlation. Biometrika 30 (1–2), 81–93.