An Automated Approach for Interpretation of Statistical Graphics
Aqsa Mahmood1
Imran Sarwar Bajwa2
Kiran Qazi3
Department of Computer Science &IT
The Islamia University of Bahawalpur
Bahawalpur, Punjab, Pakistan
[email protected]
School of Computer Science,
University of Birmingham, Uk
[email protected]
Department of Computer Science &IT
The Islamia University of Bahawalpur
Bahawalpur, Punjab, Pakistan
[email protected]
Abstract---Text plays vital role in the analysis of quantitative
data as in statistics the data representation is made through
different graphical tools such as bar charts, pie charts, line
charts, scatter diagram, histograms etc. Statistical graphics are
the valuable tool used for visual information representation in
multimodal documents. It is often observed that
communicative goal of the statistical graphics is not captured
by documents accompanying text. To perceive the represented
information using statistical graphics is hard-hitting job for
novice readers. An approach to automate the process of image
classification and information extraction is presented in this
paper. This study focuses on the area charts that are
important type of statistical graphics used for probability
distribution and testing of hypothesis process. Firstly, we have
classified the area charts into different classes and then
designed architecture for chart image classification and
information withdrawal from each class of area chart. The
extracted information is represented in the form of natural
language summaries using template based approach.
Keywords---Statistical graphics, area charts classification
and detection, Text detection and extraction, Optical character
recognition, Natural language processing.
I.
INTRODUCTION
Graphical representation of Information is getting
valuable gradually as it helps the reader to comprehend the
conveyed concepts and ideas. Numerous multimodal
credentials and web pages symbolize valuable information
using different statistical graphics such as bar charts, pie
charts, line charts, scatter diagram, histograms etc. Text
plays crucial role in the analysis of quantitative data as in
statistics the data demonstration is made through different
graphical tools. It is experiential that generally text
accompanying graphics is not explaining the image
completely. The key concept of the concerned issues
discussed in text can be understood completely with the help
of graphics. Graphs carry important data and information
with them. The readers or analysts on the basis of their own
knowledge analysis and interpret the graphs and try to
explore the message encapsulated in these graphics.
Interpretation of graph requires attention to detail and
probability of misapprehension arises when analyst or
reader is novice or not well aware about the key features and
attributes. The swift mounting field of information
technology demands a better and improved strategy for
graph understanding and information retrieval from
graphical tool used in the sources of information such as
credentials and web pages instead of using conventional
methods of graph interpretation. A typical graph
classification and detection procedure first of all get hands
on graphs from any source to be analyzed for supplementary
information then the image analyst or the reader recognize
the category of the graph by using his own background
awareness of that type of graph then deduce the graph on the
basis of its attributes according to his own perception. The
aim of this study is to identify key features of area charts,
extract textual information automatically and represent
extracted information in form of natural language based
summaries. The main contribution of this study is in
classification of area chart into different classes on the basis
of input, attributes and the output generated by them. In our
proposed methodology for text extraction and graph
classification we designed architecture. Then we have
designed the natural language based templates for final
interpretation of each class of area charts.
II.
RELATED WORK
There are numerous efforts made for the text detection
and extraction from images and graphs classification. In this
section we scrutinize some preceding efforts.
Graphical features (like title, grid lines etc.) detection
and examination is dealt under the study of Graph
classification for the abandonment of supplementary graph
information as it is an indispensable element of the digital
image processing [2][8][9]. Hidden Markov models and
neural network models were used for graph nature
recognition by correlating syntactical rule library with main
alternative attribute vectors [7]. An approach was projected
for Textual information repossession from credentials and
real scene metaphors as it turn out to be a challenging task
due to low image contrast, complex background, diversified
text styles etc.[6]. Another approach was proposed for the
programmed text mining and binarization in actual world
unstable situation [5]. C-means method for the region
fragmentation and edge partition was used for text detection
from signboards. Graphical Smoothing Algorithm based
bottom up approach was used for line fragmentation from
handwritten content as it has distance unpredictability, skew
fickleness and component overlapping issues[4]. A structure
was proposed for removal and identification of gray-scale
value text in images and videos based on device learning
text authentication step followed by conventional OCR
algorithm [3]. Images captured by digital cameras can have
different limitations such as noise attenuation, uneven light
and imperfect images orientation etc. An algorithm
embedded in typical OCR software was proposed to handle
these limitations and for the improvement of output
generated by OCR [1]. Hidden Markov models (HMM) and
Neural Network models (NNM) were introduced for
diagram nature recognition and for additional diagram
information extraction main alternative attribute vectors
were correlated with syntactical rule library [2][7][9].
Mechanized withdrawal of textual information from
metaphors elaborates the significance of information
retrieval from different statistical graphics
III.
GRAPH CLASSES
An expert classifier module of images often uses
precisely defined classes based on the attributes of the
image. Normally area graphs or charts are of three types:
1. Positively skewed curve
2.Negatively skewed curve
Two-tailed charts: In two-tailed area charts the acceptance
region lies on both ends of the distribution i.e. left side and
on right side.
Acceptance region: µ = µo
-∞
z1 µ=0 z2
+∞
Figure 3. Two-tailed area charts
The brief description of all classes of the symmetrical
area charts and designed templates of each class are given
below.
A. Class I
It is two tailed area chart in which two z-values will be
given and acceptance region lies on both sides of the
distribution.
Acceptance region= (area between z1 to 0) + (area between
0 to z2)
B. Class II
It is two tailed area chart in which two z-values will be
given and acceptance region lies on both outer edges of the
distribution.
Acceptance region = {(area between −∞ to 0)–(area
between z1 to 0)} + {(area between 0 to +∞) + (area
between 0 to z2)}
Figure 1. Area Chart Types
C. Class III
It is one-tailed area chart where two z-values will be
given but both of these values lie on right side of the
distribution.
Acceptance region= (area between 0 to z2) – (area between
0 to z1)
In this study we have used symmetrical area chart as
these are highly important in the probability distribution and
for testing of hypothesis. Reason of area chart asymptotic
approach towards X-axis is our movement away from mean
(µ) in either direction. Value of z can be calculated with the
help of the formula given below:
D. Class IV
It is one-tailed area chart where two z-values will be
given but both of these values lie on left side of the
distribution.
Acceptance region= (area between 0 to z2) – (area between
0 to z1)
3. Symmetrical curve
Where
X is the given value to be changed into z
µ is the mean of the distribution
σ is the standard deviation of the distribution
Symmetrical area charts are further divided into two
types based on the acceptance regions.
One-tailed charts: In one-tailed area charts the acceptance
region lies on only one side of the distribution.
Acceptance region: µ ≥ µo
-∞
-z µ=0
+∞
or
Acceptance region: µ ≤ µo
-∞
µ=0
Figure 2. One-tailed area charts
+z
+∞
E. Class V
It is one-tailed area chart where only one z-value will be
given and the critical value lie on right side of the
distribution. Acceptance region lies below the critical values
and also include entire first half area (−∞ to 0) of the
distribution.
Acceptance region= (area between −∞ to 0) + (area between
0 to +z)
F. Class VI
It is one-tailed area chart where only one z-value will be
given and the critical value lie on left side of the
distribution. Acceptance region lies above the critical value
and also include the entire second half area (0 to +∞) of the
distribution.
Acceptance region= (area between -z to 0) + (area between
0 to +∞)
G. Class VII
It is one-tailed area chart where the only z-value lies on
right side of the distribution and acceptance region lies in
between the mean and critical value.
Acceptance region= area between 0 to +z
H. Class VIII
It is one-tailed area chart where the only z-value lies on
left side of the distribution and acceptance region lies in
between the critical value and mean.
Acceptance region= area between -z to 0
I. Class IX
It is one-tailed area chart where z-value lies on right side
of the distribution and acceptance region lies at the tail end
of the distribution above the critical value.
Acceptance region= (area between 0 to +∞) − (area between
0 to +z)
J. Class X
It is one-tailed area chart where z-value lies on left side
of the distribution and acceptance region lies at the tail end
of the distribution below the critical value.
Acceptance region= (area between −∞ to 0) − (area between
- z to 0)
IV.
PROPOSED ARCHITECTURE
Many multimodal documents contain statistical graphics
as qualitative data representation tool. Statistical graphs are
difficult to understand with no prior knowledge about their
attributes and the relation between these attributes. Our
proposed architecture automates the process of graph
classification and information retrieval for the interpretation
of natural language based summaries.
.
Image acquisition
Preprocessing
Graph classification
Knowledge
based model
Text detection & extraction
Recognition OCR
Graph detection
Text
Graph & text correlation
Syntactic rule library
NL interpretation of graph
Figure 4. Proposed Architecture of NL Interpreter of Area Charts
I. Image acquisition
In this phase source images are captured by using
different image capturing techniques for further processing.
The input image of this phase is raw image that needs to be
processed and manipulated.
II. Preprocessing
This stage involve the steps of removing reflection,
noise, bad illumination effects, low or high frequency
conditions that are mostly occur due to the wrong or skewed
placement of document, poor focusing techniques,
inadequate lighting conditions and poor quality of image
capturing devices from the acquired image.
III. Graph classification
Image received from preprocessing stage are classified
by measuring probability of occurrence into different classes
on the basis of features, arrangement of symbols, frame
outline and shapes etcetera.
IV. Graph detection
The goal of graph detection stage is to make final
decision about the chart type by identifying the relation
between each extracted chart feature and text of input image
with the feature and text of the model class images with the
help of knowledge based models.
V. Text detection and extraction
It is the stage where the textual data and information is
identified and extracted from the input image received from
preprocessing stage. This stage involves four sub stages i.e.
text detection, text localization, text extraction and
enhancement. Text tracking techniques is used for text
localization then localized text is segmented for making it
binary image.
VI. Recognition through OCR
This stage involves optical character recognition
technology to identify each character. In this phase
segmented and recognized text from the image is received
from text detection and extraction stage and each character
is parsed individually by using separate frames.
VII. Text generation
In this phase text recognized through OCR technology is
received and finally displayed in text regions without
showing graphical features of background and resultant text
is produced in binarized form.
VIII. Image structure understanding
The input of this phase is the identified graph features
and extracted text. This phase correlates the graphics
features and extracted text by using syntactic rule based
approach.
IX. NL interpretation of graph
Here elucidation of statistical area charts based on
natural language is generated by combining the template
premeditated for that class with the information extracted
from the image.
V.
RESULTS AND DISCUSSION
Performance evaluation of our proposed architecture
was done by taking different area chart images from several
multimodal books and web pages. It was observed that how
accurately area charts are interpreted in to NL using our
proposed architecture. Different case studies were used to
assess the performance of our presented NL interpreter in
the assessment methodology.
VII.
FUTURE WORK
Keeping an eye towards the future area charts NL
interpreter some aspects are needed to be explored. These
aspects are discussed below.
Area charts can be used for probability distribution
as it plays vital role in statistical information
representation.
Our tool can be expanded towards the skewed area
charts (i.e. positive and negative) interpretation.
Development of specialized OCR system for
handling the images with large amount of textual
information as Tesseract OCR system that we have
used scale down its performance in such situations.
VIII. REFERENCE
Figure 5. Input Area Chart
Bell shaped
Rejection region
1st Z-value
Acceptance region
2nd Z-value
Mean value
Figure 6. Graph Feature extraction
In the given area chart, two z values are given i.e. z1= -1.96 lies in first
half and z2= +1.96 lies in second half that are obtained by using the
formula z=
where µ is the mean and σ is the standard deviation of the
distribution. By checking the values of z1 and z2 in area table, areas
A1=0.025 and A2= 0.025 are obtained. The acceptance region = 0.95. Null
hypothesis is accepted if the computed value of z lies between z1= -1.96
to z2= +1.96 otherwise it will be rejected and alternative hypothesis will
be accepted.
Figure 7. Final interpretation of input area chart.
Consequences of area chart NL interpreter for each class
is evaluated and according to our assessment methodology,
model rudiments were 52 wherein 42 were approved, 7 were
faulty and 3 were omitted outcome of NL interpreter for
area charts.
VI.
CONCLUSION AND LIMITATIONS
To address the primary objective of this research study
we have designed and used architecture for elucidation of
area charts used for testing the hypothesis. The offered tool
for NL area charts interpretation based on NLP is not only
proficient to distinguish and extort the textual information
from area charts but also at one fell swoop it performs chart
image taxonomy and identification. Generation of natural
language based summaries of input area charts is done using
template based approach. Different case studies were used
to evaluate and interpreted successfully but it has some
limitations like the proposed approach only deals with the
symmetrical area charts and considered them only for
testing the hypothesis.
[1] Bieniecki, W., Grabowski, S., & Rozenberg, W.
(2007). Image preprocessing for improving ocr accuracy.
In Perspective Technologies and Methods in MEMS Design,
2007. MEMSTECH 2007. International Conference on (pp.
75-80). IEEE.
[2] Zhou, Y. P., & Tan, C. L. (2001). Learning-based
scientific chart recognition. In4th IAPR International
Workshop on Graphics Recognition, GREC (pp. 482-492).
[3] Chen, D., Odobez, J. M., & Bourlard, H. (2004).
Text detection and recognition in images and video
frames. Pattern Recognition, 37(3), 595-608.
[4] Sarkar, D., & Ghosh, R. (2009). A Bottom-Up
Approach of Line Segmentation from Handwritten Text.
[5] Park, J., Dinh, T. N., & Lee, G. (2008). Binarization
of text region based on fuzzy clustering and histogram
distribution in signboards. Proceedings of World Academy
Science, Engineering and Technology, 33, 85-90.
[6] Jung, K., In Kim, K., & K Jain, A. (2004). Text
information extraction in images and video: a
survey. Pattern recognition, 37(5), 977-997.
[7] Zhou, Y. P., & Tan, C. L. (2000). Hough technique
for bar charts detection and recognition in document images.
In Image Processing, 2000. Proceedings. 2000 International
Conference on (Vol. 2, pp. 605-608). IEEE.
[8] Prasad, V. S. N., Siddiquie, B., Golbeck, J., &
Davis, L. (2007, June). Classifying computer generated
charts. In Content-Based Multimedia Indexing, 2007.
CBMI'07. International Workshop on (pp. 85-92). IEEE.
[9] Huang, W., Zong, S., & Tan, C. L. (2007). Chart
image classification using multiple-instance learning.
In Applications of Computer Vision, 2007. WACV'07. IEEE
Workshop on (pp. 27-27). IEEE.
[10] Leon, M., Vilaplana, V., Gasull, A., & Marques,
F. (2009). Caption text extraction for indexing purposes
using a hierarchical region-basedimage model. In Image
Processing (ICIP), 2009 16th IEEE International
Conference on (pp. 1869-1872). IEEE.