Academia.eduAcademia.edu

An Automated Approach for Interpretation of Statistical Graphics

2014, 2014 Sixth International Conference on Intelligent Human-Machine Systems and Cybernetics

Text plays vital role in the analysis of quantitative data as in statistics the data representation is made through different graphical tools such as bar charts, pie charts, line charts, scatter diagram, histograms etc. Statistical graphics are the valuable tool used for visual information representation in multimodal documents. It is often observed that communicative goal of the statistical graphics is not captured by documents accompanying text. To perceive the represented information using statistical graphics is hard-hitting job for novice readers. An approach to automate the process of image classification and information extraction is presented in this paper. This study focuses on the area charts that are important type of statistical graphics used for probability distribution and testing of hypothesis process. Firstly, we have classified the area charts into different classes and then designed architecture for chart image classification and information withdrawal from each class of area chart. The extracted information is represented in the form of natural language summaries using template based approach.

An Automated Approach for Interpretation of Statistical Graphics Aqsa Mahmood1 Imran Sarwar Bajwa2 Kiran Qazi3 Department of Computer Science &IT The Islamia University of Bahawalpur Bahawalpur, Punjab, Pakistan [email protected] School of Computer Science, University of Birmingham, Uk [email protected] Department of Computer Science &IT The Islamia University of Bahawalpur Bahawalpur, Punjab, Pakistan [email protected] Abstract---Text plays vital role in the analysis of quantitative data as in statistics the data representation is made through different graphical tools such as bar charts, pie charts, line charts, scatter diagram, histograms etc. Statistical graphics are the valuable tool used for visual information representation in multimodal documents. It is often observed that communicative goal of the statistical graphics is not captured by documents accompanying text. To perceive the represented information using statistical graphics is hard-hitting job for novice readers. An approach to automate the process of image classification and information extraction is presented in this paper. This study focuses on the area charts that are important type of statistical graphics used for probability distribution and testing of hypothesis process. Firstly, we have classified the area charts into different classes and then designed architecture for chart image classification and information withdrawal from each class of area chart. The extracted information is represented in the form of natural language summaries using template based approach. Keywords---Statistical graphics, area charts classification and detection, Text detection and extraction, Optical character recognition, Natural language processing. I. INTRODUCTION Graphical representation of Information is getting valuable gradually as it helps the reader to comprehend the conveyed concepts and ideas. Numerous multimodal credentials and web pages symbolize valuable information using different statistical graphics such as bar charts, pie charts, line charts, scatter diagram, histograms etc. Text plays crucial role in the analysis of quantitative data as in statistics the data demonstration is made through different graphical tools. It is experiential that generally text accompanying graphics is not explaining the image completely. The key concept of the concerned issues discussed in text can be understood completely with the help of graphics. Graphs carry important data and information with them. The readers or analysts on the basis of their own knowledge analysis and interpret the graphs and try to explore the message encapsulated in these graphics. Interpretation of graph requires attention to detail and probability of misapprehension arises when analyst or reader is novice or not well aware about the key features and attributes. The swift mounting field of information technology demands a better and improved strategy for graph understanding and information retrieval from graphical tool used in the sources of information such as credentials and web pages instead of using conventional methods of graph interpretation. A typical graph classification and detection procedure first of all get hands on graphs from any source to be analyzed for supplementary information then the image analyst or the reader recognize the category of the graph by using his own background awareness of that type of graph then deduce the graph on the basis of its attributes according to his own perception. The aim of this study is to identify key features of area charts, extract textual information automatically and represent extracted information in form of natural language based summaries. The main contribution of this study is in classification of area chart into different classes on the basis of input, attributes and the output generated by them. In our proposed methodology for text extraction and graph classification we designed architecture. Then we have designed the natural language based templates for final interpretation of each class of area charts. II. RELATED WORK There are numerous efforts made for the text detection and extraction from images and graphs classification. In this section we scrutinize some preceding efforts. Graphical features (like title, grid lines etc.) detection and examination is dealt under the study of Graph classification for the abandonment of supplementary graph information as it is an indispensable element of the digital image processing [2][8][9]. Hidden Markov models and neural network models were used for graph nature recognition by correlating syntactical rule library with main alternative attribute vectors [7]. An approach was projected for Textual information repossession from credentials and real scene metaphors as it turn out to be a challenging task due to low image contrast, complex background, diversified text styles etc.[6]. Another approach was proposed for the programmed text mining and binarization in actual world unstable situation [5]. C-means method for the region fragmentation and edge partition was used for text detection from signboards. Graphical Smoothing Algorithm based bottom up approach was used for line fragmentation from handwritten content as it has distance unpredictability, skew fickleness and component overlapping issues[4]. A structure was proposed for removal and identification of gray-scale value text in images and videos based on device learning text authentication step followed by conventional OCR algorithm [3]. Images captured by digital cameras can have different limitations such as noise attenuation, uneven light and imperfect images orientation etc. An algorithm embedded in typical OCR software was proposed to handle these limitations and for the improvement of output generated by OCR [1]. Hidden Markov models (HMM) and Neural Network models (NNM) were introduced for diagram nature recognition and for additional diagram information extraction main alternative attribute vectors were correlated with syntactical rule library [2][7][9]. Mechanized withdrawal of textual information from metaphors elaborates the significance of information retrieval from different statistical graphics III. GRAPH CLASSES An expert classifier module of images often uses precisely defined classes based on the attributes of the image. Normally area graphs or charts are of three types: 1. Positively skewed curve 2.Negatively skewed curve Two-tailed charts: In two-tailed area charts the acceptance region lies on both ends of the distribution i.e. left side and on right side. Acceptance region: µ = µo -∞ z1 µ=0 z2 +∞ Figure 3. Two-tailed area charts The brief description of all classes of the symmetrical area charts and designed templates of each class are given below. A. Class I It is two tailed area chart in which two z-values will be given and acceptance region lies on both sides of the distribution. Acceptance region= (area between z1 to 0) + (area between 0 to z2) B. Class II It is two tailed area chart in which two z-values will be given and acceptance region lies on both outer edges of the distribution. Acceptance region = {(area between −∞ to 0)–(area between z1 to 0)} + {(area between 0 to +∞) + (area between 0 to z2)} Figure 1. Area Chart Types C. Class III It is one-tailed area chart where two z-values will be given but both of these values lie on right side of the distribution. Acceptance region= (area between 0 to z2) – (area between 0 to z1) In this study we have used symmetrical area chart as these are highly important in the probability distribution and for testing of hypothesis. Reason of area chart asymptotic approach towards X-axis is our movement away from mean (µ) in either direction. Value of z can be calculated with the help of the formula given below: D. Class IV It is one-tailed area chart where two z-values will be given but both of these values lie on left side of the distribution. Acceptance region= (area between 0 to z2) – (area between 0 to z1) 3. Symmetrical curve Where X is the given value to be changed into z µ is the mean of the distribution σ is the standard deviation of the distribution Symmetrical area charts are further divided into two types based on the acceptance regions. One-tailed charts: In one-tailed area charts the acceptance region lies on only one side of the distribution. Acceptance region: µ ≥ µo -∞ -z µ=0 +∞ or Acceptance region: µ ≤ µo -∞ µ=0 Figure 2. One-tailed area charts +z +∞ E. Class V It is one-tailed area chart where only one z-value will be given and the critical value lie on right side of the distribution. Acceptance region lies below the critical values and also include entire first half area (−∞ to 0) of the distribution. Acceptance region= (area between −∞ to 0) + (area between 0 to +z) F. Class VI It is one-tailed area chart where only one z-value will be given and the critical value lie on left side of the distribution. Acceptance region lies above the critical value and also include the entire second half area (0 to +∞) of the distribution. Acceptance region= (area between -z to 0) + (area between 0 to +∞) G. Class VII It is one-tailed area chart where the only z-value lies on right side of the distribution and acceptance region lies in between the mean and critical value. Acceptance region= area between 0 to +z H. Class VIII It is one-tailed area chart where the only z-value lies on left side of the distribution and acceptance region lies in between the critical value and mean. Acceptance region= area between -z to 0 I. Class IX It is one-tailed area chart where z-value lies on right side of the distribution and acceptance region lies at the tail end of the distribution above the critical value. Acceptance region= (area between 0 to +∞) − (area between 0 to +z) J. Class X It is one-tailed area chart where z-value lies on left side of the distribution and acceptance region lies at the tail end of the distribution below the critical value. Acceptance region= (area between −∞ to 0) − (area between - z to 0) IV. PROPOSED ARCHITECTURE Many multimodal documents contain statistical graphics as qualitative data representation tool. Statistical graphs are difficult to understand with no prior knowledge about their attributes and the relation between these attributes. Our proposed architecture automates the process of graph classification and information retrieval for the interpretation of natural language based summaries. . Image acquisition Preprocessing Graph classification Knowledge based model Text detection & extraction Recognition OCR Graph detection Text Graph & text correlation Syntactic rule library NL interpretation of graph Figure 4. Proposed Architecture of NL Interpreter of Area Charts I. Image acquisition In this phase source images are captured by using different image capturing techniques for further processing. The input image of this phase is raw image that needs to be processed and manipulated. II. Preprocessing This stage involve the steps of removing reflection, noise, bad illumination effects, low or high frequency conditions that are mostly occur due to the wrong or skewed placement of document, poor focusing techniques, inadequate lighting conditions and poor quality of image capturing devices from the acquired image. III. Graph classification Image received from preprocessing stage are classified by measuring probability of occurrence into different classes on the basis of features, arrangement of symbols, frame outline and shapes etcetera. IV. Graph detection The goal of graph detection stage is to make final decision about the chart type by identifying the relation between each extracted chart feature and text of input image with the feature and text of the model class images with the help of knowledge based models. V. Text detection and extraction It is the stage where the textual data and information is identified and extracted from the input image received from preprocessing stage. This stage involves four sub stages i.e. text detection, text localization, text extraction and enhancement. Text tracking techniques is used for text localization then localized text is segmented for making it binary image. VI. Recognition through OCR This stage involves optical character recognition technology to identify each character. In this phase segmented and recognized text from the image is received from text detection and extraction stage and each character is parsed individually by using separate frames. VII. Text generation In this phase text recognized through OCR technology is received and finally displayed in text regions without showing graphical features of background and resultant text is produced in binarized form. VIII. Image structure understanding The input of this phase is the identified graph features and extracted text. This phase correlates the graphics features and extracted text by using syntactic rule based approach. IX. NL interpretation of graph Here elucidation of statistical area charts based on natural language is generated by combining the template premeditated for that class with the information extracted from the image. V. RESULTS AND DISCUSSION Performance evaluation of our proposed architecture was done by taking different area chart images from several multimodal books and web pages. It was observed that how accurately area charts are interpreted in to NL using our proposed architecture. Different case studies were used to assess the performance of our presented NL interpreter in the assessment methodology. VII. FUTURE WORK Keeping an eye towards the future area charts NL interpreter some aspects are needed to be explored. These aspects are discussed below.  Area charts can be used for probability distribution as it plays vital role in statistical information representation.  Our tool can be expanded towards the skewed area charts (i.e. positive and negative) interpretation.  Development of specialized OCR system for handling the images with large amount of textual information as Tesseract OCR system that we have used scale down its performance in such situations. VIII. REFERENCE Figure 5. Input Area Chart Bell shaped Rejection region 1st Z-value Acceptance region 2nd Z-value Mean value Figure 6. Graph Feature extraction In the given area chart, two z values are given i.e. z1= -1.96 lies in first half and z2= +1.96 lies in second half that are obtained by using the formula z= where µ is the mean and σ is the standard deviation of the distribution. By checking the values of z1 and z2 in area table, areas A1=0.025 and A2= 0.025 are obtained. The acceptance region = 0.95. Null hypothesis is accepted if the computed value of z lies between z1= -1.96 to z2= +1.96 otherwise it will be rejected and alternative hypothesis will be accepted. Figure 7. Final interpretation of input area chart. Consequences of area chart NL interpreter for each class is evaluated and according to our assessment methodology, model rudiments were 52 wherein 42 were approved, 7 were faulty and 3 were omitted outcome of NL interpreter for area charts. VI. CONCLUSION AND LIMITATIONS To address the primary objective of this research study we have designed and used architecture for elucidation of area charts used for testing the hypothesis. The offered tool for NL area charts interpretation based on NLP is not only proficient to distinguish and extort the textual information from area charts but also at one fell swoop it performs chart image taxonomy and identification. Generation of natural language based summaries of input area charts is done using template based approach. Different case studies were used to evaluate and interpreted successfully but it has some limitations like the proposed approach only deals with the symmetrical area charts and considered them only for testing the hypothesis. [1] Bieniecki, W., Grabowski, S., & Rozenberg, W. (2007). Image preprocessing for improving ocr accuracy. In Perspective Technologies and Methods in MEMS Design, 2007. MEMSTECH 2007. International Conference on (pp. 75-80). IEEE. [2] Zhou, Y. P., & Tan, C. L. (2001). Learning-based scientific chart recognition. In4th IAPR International Workshop on Graphics Recognition, GREC (pp. 482-492). [3] Chen, D., Odobez, J. M., & Bourlard, H. (2004). Text detection and recognition in images and video frames. Pattern Recognition, 37(3), 595-608. [4] Sarkar, D., & Ghosh, R. (2009). A Bottom-Up Approach of Line Segmentation from Handwritten Text. [5] Park, J., Dinh, T. N., & Lee, G. (2008). Binarization of text region based on fuzzy clustering and histogram distribution in signboards. Proceedings of World Academy Science, Engineering and Technology, 33, 85-90. [6] Jung, K., In Kim, K., & K Jain, A. (2004). Text information extraction in images and video: a survey. Pattern recognition, 37(5), 977-997. [7] Zhou, Y. P., & Tan, C. L. (2000). Hough technique for bar charts detection and recognition in document images. In Image Processing, 2000. Proceedings. 2000 International Conference on (Vol. 2, pp. 605-608). IEEE. [8] Prasad, V. S. N., Siddiquie, B., Golbeck, J., & Davis, L. (2007, June). Classifying computer generated charts. In Content-Based Multimedia Indexing, 2007. CBMI'07. International Workshop on (pp. 85-92). IEEE. [9] Huang, W., Zong, S., & Tan, C. L. (2007). Chart image classification using multiple-instance learning. In Applications of Computer Vision, 2007. WACV'07. IEEE Workshop on (pp. 27-27). IEEE. [10] Leon, M., Vilaplana, V., Gasull, A., & Marques, F. (2009). Caption text extraction for indexing purposes using a hierarchical region-basedimage model. In Image Processing (ICIP), 2009 16th IEEE International Conference on (pp. 1869-1872). IEEE.