Associating Text and Graphics for Scientific Chart Understanding
Weihua Huang, Chew Lim Tan and Wee Kheng Leow
School of Computing, National University of Singapore
{huangwh, tancl, leowwk}@comp.nus.edu.sg
Abstract
This paper presents our recent work that aims at
associating the recognition results of textual and
graphical information contained in the scientific chart
images. Text components are first located in the input
image and then recognized using OCR. On the other
hand, the graphical objects are segmented and form
high level symbols. Both logical and semantic
correspondence between text and graphical symbols
are identified. The association of text and graphics
allows us to capture the semantic meaning carried by
scientific chart images in a more complete way. The
result of scientific chart image understanding is
presented using XML documents.
1. Introduction
Research activities in the document image analysis
field can be mainly classified into two categories: text
processing that deals with the text components of a
document image, and graphics processing that deals
with the lines and symbol components that make up
diagrams, maps and engineering drawings etc.
Traditional research works in these two categories are
usually independent of each other. OCR systems, as a
good example of text processing application, tend to
recognize the text in the document images without
touching the graphics contained on the same page. On
the other hand, most graphics recognition systems
concentrate on the graphics segmentation, symbol
construction and classification etc. without making use
of the textual information contained in the images.
We believe that to achieve full document
understanding, the textual information and graphical
information should be combined to capture more
complete semantic meaning of the document image
and to enhance the performance of the recognition
system. In fact, K. Tombre et al already pointed out
that associating textual information with the graphics is
an important step in the semantic labeling of graphics
[1].
The work reported here does not aim to provide a
general solution to all document image understanding
problems. It is an attempt to combine the textual
information with graphical information from both
logical level and semantic level. We choose to focus
on the diagrams that frequently present in the scientific
papers and web pages. The main reason for choosing
this kind of document is that the context of the
graphical information is relatively easier to model and
the role of the text components can be identified in a
more standard way. As a beginning, we start with the
commonly used scientific charts such as bar-chart, pie
chart etc. The result of our work can be applied on
content-based web image retrieval and auto-conversion
of raster image into structured document etc.
The remaining sections of this paper are arranged in
this way: section 2 surveys some previous works
related to our study. Section 3 introduces the detailed
design and implementation issues, focusing on textual
information retrieval and association of text and
graphics. Section 4 presents experimental results.
Section 5 gives a conclusion to this paper.
2. Related works
In the past, although there were works dealing with
various entities in graphics including text, very few of
them actually attempted to extract textual information
and associate it with the graphics. Kasturi et al
developed a system to interpret various components in
a line drawing, including text strings [2]. However the
text components were not recognized. Joseph and
Pridmore presented an experimental system for
mechanical engineering drawing interpretation [3].
Like Kasturi’s, the system had no provision for text
recognition. B. Lamiroy et al conducted experiments to
analyze the role of the text components in cutaway
diagrams [4, 5]. The result reported was limited to
identifying the relationship between drawings, their
indices and the legends. However in their work the
graphics layer is completely discarded so the text and
graphics association was still preliminary.
In recent years, works on chart recognition have
been continuously reported. Futrelle et al presented a
diagram understanding system based on graphics
constraint grammars to recognize x-y data graphs and
gene diagrams [6], with the major assumption of
proper graphics segmentation which is difficult to
achieve. Yokokura et al proposed a schema-based
framework to graphically describe the layout
relationship information of the bar charts [7] based on
vertical and horizontal projections. The bar chart styles
that can be recognized are restricted due to the
simplicity of the method. Zhou et al applied Houghbased techniques to achieve bar chart detection and
segmentation [8]. Later they also proposed a learningbased chart recognition paradigm using Hidden
Markov Models [9]. In both cases the main focus was
only on the low-level features in the input image,
without touching the semantic meanings in the chart.
In all works mentioned above, textual information was
not retrieved and thus was never combined with
graphical information.
3. Design and implementation issues
The system proposed here handles both textual and
graphical information. There are four main modules:
text/graphics separation module, text recognition
module, graphics recognition module and text/graphics
association module. Figure 1 shows the flow of control
in the system. The basic scheme here is to recognize
the text and graphics in the input image separately, and
then combine the two kinds of information to achieve
full understanding of the input image. Text/graphics
association is performed in the chart understanding
module. The combined recognition result is captured
using XML description for future interpretation. The
details of the graphics recognition module can be
found in our previous paper [10]. The outcome of the
module includes major chart components and
information about the chart type. In this paper, we will
focus on the design and implementation of the text
retrieval module and the text/graphics association
module.
3.1. Retrieving textual information
Input Image
Text
Image
Text blocks
identification
Graphics
Image
Text /graphics
separation
Graphics
segmentation
Text/graphics separation is done through connected
component analysis. A series of filters are applied to
all the components in the image to classify them into
text and graphics. The text and graphics classified are
stored as two separate images that will be processed by
two different modules in the system later.
Text groups are formed through the calculation of
“gravity” G based on the Newton’s formula [11].
G = C1.C2 / r2
OCR
String/number +
x-y coordinates
Symbol
construction
Chart model
matching
Associating
text with chart
components
Chart data
extraction
Chart
components
XML
generation
XML
descriptions
Figure 1. Overview of the proposed system
where C1 and C2 are the sizes of the two components
and r is the distance between the centers of the
components. If G is greater than a threshold value,
then the components belong to the same text group.
The advantage of this method is that text groups in
different orientations can all be handled efficiently.
OCR is then applied to each text group to recognize
its content. The result of text recognition is further
classified as a string or a number (integer or floating
point number). The value of the number is obtained
through parsers. Text information obtained will be
recorded as: string/number + x-y coordinates.
During textual information retrieval, there are
mainly two sources of error: the error in text/graphics
separation and the error in the character recognition
process. Text/graphics separation errors come from
text touching graphics which connected component
analysis fails to handle. There are techniques proposed
that deals with separating touching text and graphics,
and they can be applied to reduce the errors [12]. OCR
error rate heavily depends on the quality of the
document. Since the OCR implementation is not the
main focus of our project, we choose to manually
correct the errors. To guarantee the performance of the
whole system, the errors here should be minimized.
Figure 2 gives an example of textual information
retrieval.
(a) The original image
(b) Text extracted and grouped
[9, 22] to [36, 264], content “Rate of foraging”
[48, 30] to [64, 51], content “8”
[100, 44] to [359, 70], content “Egg caring females”
[48, 75] to [64, 95], content “6”
[48, 117] to [65, 137], content “4”
……
[533, 225] to [561, 247], content “15”
[623, 226] to [647, 246], content “17”
[721, 226] to [770, 247], content “1930”
[283, 255] to [504, 292], content “Time of day”
(c) Text location and OCR result
Figure 2. Obtaining the textual information
3.2. Text-graphics association on logical level
There are two levels of text-graphics association:
logical level and semantic level. On the logical level,
we are interested in “what is the role of a text block in
the chart?” In the case of scientific chart, logical
association between text and graphics is obtained by
examining the spatial relationship between the text
blocks and the chart components.
In a chart image, text plays a number of logical
roles. We summarize them into six main categories:
•
Caption, including the title of the chart and
sometimes additional descriptions.
• Axis_title, the name of an axis.
• Axis_label, defining the scope/range of an axis as
well as the data type (string or number).
• Legend, distinguishing multiple data series. There
is a small graphical object in front of the text in a
legend.
• Data_value. Sometimes the value of the data is
directly shown inside or near the data components.
Data_value is required to be a number.
• Others. Any other supplementary description of
the chart content.
If we treat these logical roles as a set of tags, then
finding the logical association becomes a task of
tagging the text blocks. Let’s denote the tagging
process as φ (C ) , where C is the input chart image. We
define P(R, A, T) as a joint probability distribution
associated with each chart type, and each joint
probability P(ri, ti, ak) shows the probability that the ith
text block of type ti has logical role ri in the kth area ak
in C. The type ti is either string or number. The
available set of logic roles R depends on the type of the
chart (which is already determined through graphics
recognition module). To obtain P(R, A, T), a set of
training images for each chart type are used to
calculate individual probability values. In the
understanding phase, given A and T, our task is to find
out the best way to tag the text such that:
φ (C ) = arg max P ( R, A, T )
R
Now the remaining question is how to divide the
chart image into a set of areas A. Firstly the area inside
the data components and the area outside the data
components are separated. For a chart type with x-y
axes, the area outside the data components is further
divided based on the plot area. Then we have:
1) Area above the plot area;
2) Area below the plot area;
3) Area on the left hand side of the plot area;
4) Area on the right hand side of the plot area;
5) Area within the plot area.
For chart types without x-y axes, such as a pie
chart, the division is done similarly, but based on the
position of the data components themselves instead of
the plot area.
On the logical level, graphical information provides
evidence about the chart type that further determines
the available text roles. It also affects the detailed area
division for text block tagging. Without graphical
information, the logical role of a text block can hardly
be determined.
•
3.3. Text-graphics association on semantic level
The next step is to achieve semantic association
between text and graphics which is more challenging.
The goal here is to extract the absolute data values
based on the chart components obtained from the
graphics recognition module together with the text
blocks whose logical role is determined. The steps
involved are:
(a) To estimate the absolute data value. Without
losing generality, let’s assume the x axis
represents the index of the data component and y
axis determines the value of the data component.
Let lm and ln be two neighboring labels along the y
axis, then define step_value as |value of lm – value
of ln| and step_dist as Distance(center of lm, center
of ln). Absolute value of data component ci is
calculated as: Distance(Aci, y-axis) * step_value /
step_dist, where Aci is the attribute of ci that
correlates with the data value, such as the height
of a bar in the bar chart.
(b) If data_value associated with the data component
is available, then there are two cases:
i. If data_value agrees with the estimated data
value obtained from part (a), then data_value
is picked.
ii. If the difference between data_value and the
estimated data value is too large, then
data_value is treated as a false value (which is
cause by an error on the logical level) and the
estimated data value is picked.
Note that step (a) shown above only works for chart
with x-y axes. At the moment, pie chart is the only
chart type without x-y axes. In a pie chart, the original
data values are relative values, thus step (a) is not
necessary. The detection of chart type is done in the
graphics recognition module [10].
The association between text and graphics on the
semantic level allows us to extract absolute data
values, which is not achievable without textual
information.
•
•
<x_axis> and <y_axis>. The existence of x-y axes
depends on the type of the chart. If they exist,
<x_axis_title> and <y_axis_title> give titles to the
axes and <labels> contains a set of <label> that
defines the scope/range of each axis.
<data_set>, the data values extracted from the
chart image. Each data entry has an <index>
which is automatically generated, and a <value>.
<legend>, <mark> and <description>. Generally
speaking, a chart with single data series does not
need legends. <mark> is an attribute such as a
small sample of color or texture etc. to represent a
unique data series. <description> contains the text
string that describes one legend.
<?xml version="1.0"?><!--chart_recognized.xml-->
<?xml-stylesheet type="text/xsl" href="chartXML.xsl"?>
<!DOCTYPE chart [
<!ELEMENT chart (caption, x_axis, y_axis, data_set )>
<!ELEMENT caption ( #PCDATA) >
<!ELEMENT x_axis ( x_axis_title, labels )>
<!ELEMENT y_axis ( y_axis_title, labels )>
<!ELEMENT labels ( label+ )>
<!ELEMENT label ( #PCDATA )>
<!ELEMENT x_axis_title ( #PCDATA )>
<!ELEMENT y_axis_title ( #PCDATA )>
<!ELEMENT data_set ( data+ )>
<!ELEMENT data ( index, value )>
<!ELEMENT index ( #PCDATA )>
<!ELEMENT value ( #PCDATA )>
]>
<chart>
<x_axis><x_axis_title>Time of day</x_axis_title>
<labels><label>5</label>
……
<label>1930</label></labels></x_axis>
<y_axis><y_axis_title>Rate of foraging</y_axis_title>
<labels><label>0</label>
……
<label>8</label></labels></y_axis>
<data_set>
<data><index>1</index><value>0.819</value></data>
<data><index>2</index><value>3.273</value></data>
……
<data><index>7</index><value>0.864</value></data>
</data_set></chart>
Figure 3. Example of an XML description
3.4. Generating XML description
Based on the text recognized and the data values
extracted, we can generate an XML description for the
chart image. The document <chart> contains the
following parts:
• <caption>, the title of the chart and the description
of the chart.
To transform an XML description file into
something that can be viewed using the browser, an
XML style sheet is needed. The current choice is to
parse the XML descriptions into an HTML table. Due
to space limitation, the XML style sheet is not
discussed in detail here. Figure 3 shows the XML
description of the chart image in Figure 2(a).
4. Experimental results
For testing purpose, we collected 53 chart images
that were generated from scanner or downloaded from
the internet. As we mentioned, OCR errors during text
recognition are manually corrected. We count the
number of text blocks correctly identified for each
category and the result is presented in table 1. For
graphics recognition, we count the number of symbols
recognized by the system and calculate the recall and
precision, which are shown in table 2. For chart
understanding, it’s hard to evaluate the data values
extracted by the system mainly due to the lack of
ground-truth for the testing data (except for those with
known data values).
Table 1. Result of textual information retrieval
Category
Caption
Axis_title
Axis
Label
Legend
Data value
Others
Total
Total
Number
32
39
641
Extracte
d
28
30
616
Accuracy
(%)
87.5
76.92
96.1
46
37
25
820
30
30
14
748
65.22
81.08
56
91.22
Table 2. Result of graphics recognition
No. of components in the
chart images
No. of symbols recognized
No. of symbols correct
Recall (%)
Precision (%)
296
263
245
82.77
93.15
5. Conclusion
This paper presents our work about associating
textual and graphical information for the understanding
of scientific chart images. Textual components are
extracted from the input image and are recognized
through OCR. Graphics are segmented and high-level
chart components are obtained through a graphics
recognition module. Understanding of the chart image
is achieved by associating the textual and graphical
information on both logical and semantic level. The
overall recognition result is presented using XML
document. In the future, more effort should be put to
minimize the errors occurred in individual module in
the system and to extend the system to handle more
complex diagrams.
Acknowledgement: this research is supported in part
by A*STAR under grant R252-000-206-305 and NUS
URC under grant R252-000-202-112.
6. References
[1] K. Tombre and B. Lamiroy, “Graphics recognition From re-engineering to retrieval”, Proc. of 7th ICDAR,
Edinburgh (Scotland, UK), pp. 148--155, August 2003.
[2] R. Kasturi, S. T. Bow, W. El-Masri, Y. Shah, J. R.
Gattiker and U. B. Mokate, “A System for Interpretation of
Line Drawings”, IEEE Trans. PAMI, vol. 2, no. 10, pp. 978992, Oct. 1990.
[3] S. H. Joseph and T. P. Pridmore, “Knowledge-Directed
Interpretation of Mechanical Engineering Drawings”, IEEE
Trans. PAMI, vol. 14, no. 9, pp. 928-940, Sept. 1992.
[4] B. Lamiroy, L. Najman, R. Ehrard, C. Louis, F. Quelain,
N. Rouyer and N. Zeghache, “Scan-to-XML for Vector
graphics: an Experimental Setup for Intelligent Browsable
Document Generation”, Proc. 4th IAPR International
Workshop on Graphics Recognition, Kingston, Ontario
(Canada), pp. 312-325, Sept. 2001.
[5] E. Valveny and B. Lamiroy, “Scan-to-XML: Automatic
Generation of Browsable Technical Documents”, Proc. of
16th ICPR, Quebec (Canada), pp. 188-191, Aug. 2002.
[6] R.P. Futrelle et al., “Understanding diagrams in technical
documents”, IEEE Computer, Vol.25, NO.7, pp. 75-78, 1992.
[7] N. Yokokura and T. Watanabe, “Layout-Based Approach
for extracting constructive elements of bar-charts”, Graphics
recognition: algorithms and systems, GREC'97, pp. 163-174.
[8] Y. Zhou and C L Tan, “Hough-based Model for
Recognizing Bar Charts in Document Images”, SPIE
conference on Document image and retrieval, 2001.
[9] Y. P. Zhou and C. L. Tan, “Learning-based scientific
chart recognition”, 4th IAPR International Workshop on
Graphics Recognition, GREC2001, pp. 482-492, 2001.
[10] W. H. Huang, C. L. Tan and W. K. Leow, “Model based
chart image recognition”, International Workshop on
Graphics Recognition, GREC2003, 30-31 July 2003,
Barcelona, Spain.
[11] C. L. Tan, B. Yuan and C. H. Ang, “Agent-based text
extraction from pyramid images”, Int. Conf. on Advances in
Pattern Recognition, 1998, Plymouth, UK, pp. 344-352.
[12] K. Tombre, S. Tabbone, L. Pélissier, B. Lamiroy, and P.
Dosch, “Text/Graphics Separation Revisited”, 5th
International Workshop on Document Analysis Systems, DAS
2002, pp. 200-211, 2002.