Knowledge Representation For Multilingual Text Categorization

Download as ps, pdf, or txt
Download as ps, pdf, or txt
You are on page 1of 5

Knowledge Representation for Multilingual Text Categorization

Loukachevitch Natalia V.

Institute of the USA and Canada Studies


Khlebny per. 2/3
Moscow, Russia, 121814
[email protected]

Abstract labor and time costs. Several tools were constructed to


The described approach to text categorization is diminish the problem such as instrumental tools (Hayes
based on thematic representation of a text. Thematic 1992) or automatic generator of dictionary AUTOSLOG
representation includes nodes of thematically related (Riloff 1993).
terms simulating topics of the text and is provided In our approach we describe knowledge about a very
with classes of their importance for the text. broad domain as a model of the world without fixation of
Thematic representation is created on the basis of any systems of pre-defined categories. The knowledge base
detailed description of the domain and allows to is represented as the Thesaurus. The Thesaurus was
process different types of texts, to use different specially created as a tool for automatic processing of texts
systems of categories (in various languages) for text in a broad domain of sociopolitical life of Russia and is
categorization, to adapt quickly the system to other developed now as bilingual Russian-English Thesaurus.
formats and types of texts and/or other systems of Various systems of categories (in Russian or English) can
categories, to categorize texts using several systems be flexibly attached to Thesaurus units.
of categories simultaneously. The most part of the Our technique of text categorization is based on
algorithm is not language-dependent. constructing thematic representation of a text including
recognition of terms, incorporation of thematically related
terms into thematic nodes, determination of importance of
Introduction topics represented by thematic nodes in the text. After
terms recognition is carried out the technology is not
Text categorization is an important task in networks. language-dependent.
There are a lot of everyday information from multilingual This technology allows to process different types of
sources that ought to be thematically divided to satisfy documents (Russian or English) such as official
various users needs. documents or news reports by information agencies, to use
Today there are two primary approaches to text different systems of categories for text categorization, to
categorization: knowledge engineering approach and adapt quickly the system to new types of documents and/or
machine learning approach. systems of categories.
A variety of existing machine learning approaches have
been tested in text categorization (Goldberg 1996; Lewis
& Rinquette 1992). They allow to construct text 1. Thesaurus
categorizers automatically by means of inductive learning,
using texts pre-categorized by humans as examples. The Creators of conventional thesauri (LIV 1994; UNBIS
highest known performance of these systems is close to 74 Thesaurus 1976; Subject Headings 1991) take into account
% of recall and precision (Goldberg 1996). domain, commonsense, and grammatical knowledge of
The knowledge engineering approach obtains more indexers, and therefore thesauri created for manual
efficient results. The performance of CONSTRUE (Hayes indexing are hard to utilize in automatic indexing
1992) is evaluated as around 90%. Riloff and Lehnert environment (Salton 1989) -- important terms of texts are
(1994) report about a high precision approach reaching not found, less important terms are revealed, some terms
100% precision with over 60 % recall. Higher efficiency is are identified incorrectly because of their ambiguity. We
based on manual creation of knowledge bases, rules and created our Thesaurus as a tool for automatic indexing
dictionaries describing the domains (Goodman 1991; (AI-Thesaurus) -- the Thesaurus on Contemporary Life in
Vledutz-Stokolov 1987). This requires a considerable Russia. The Thesaurus contains wide scope of terms from
amount of human labor and development time. Once a general to very specific ones, has means for representation
system was created changes in types and formats of texts, of ambiguous terms, comprises developed system of
modification of categories or substitution of a whole relations between terms.
system of categories result in to significant additional The Thesaurus has been created in semi-automatic
mode using automatic processing of more than 70 Mb of For example, description of concept INSURANCE is as
Russian official texts (Lukashevich 1995). This procedure follows:
consists of two main stages. At the first stage it INSURANCE
automatically processes new texts and reveals new terms- BT FINANCIAL ACTIVITY
like language expressions. Such language expressions are NT COINSURANCE
determined on basis of their syntactical and lexical
NT PERSONAL INSURANCE
structure. The special dictionary containing more 30
thousand words directs this process. Lexical control helps NT PROPERTY INSURANCE
to avoid consideration of such expressions as large NT RE-INSURANCE
volume, new approach, new way, better results to be PT INSURANCE CONTRACT
terms. PT INSURANCE COVERAGE
At the second stage our specialists manually choose PT INSURANCE ORGANIZATION
terms from gathered terms-like expressions. PT INSURANCE PREMIUM
This everyday procedure adds constantly new terms to PT INSURANCE RISK
the Thesaurus. If the first megabytes of texts could give up PT INSURANCE TARIFF
to 1000 new terms per megabyte, now every megabyte of PT INSURANT
texts gives about 10 terms in average. PT INSURED RISK
Carefully gathered terms form rows of quasi-synonyms PT FRANCHISE
(UF references) -- sometimes up to 20 elements.
RT INSURANCE LEGISLATION
Adjectives and verbs that are derivatives of a descriptor
can also be its quasi-synonyms. RT INSURANCE SUPERVISION
Ambiguous terms can be described in two ways in the RT INSURANCE MARKET
Thesaurus. The first -- an ambiguous term can be a quasi-
synonym of two or more descriptors that represent Currently the Thesaurus contains more than 18
different meanings of this term. For example, (hereinafter thousand terms and 7 thousand geographic names.
we give fragments from the Thesaurus in English Russian descriptors and their quasi-synonyms were
translation) term capital is described as a synonym to two translated into English and formed English sub-system of
descriptors CAPITAL (City) and CAPITAL (Finance). If the Thesaurus consisting of English descriptors and
only one meaning of an ambiguous term is represented in synonyms. Synonymic rows were supplemented with terms
the Thesaurus such term is marked with a special sign of from the thesauri (LIV 1994; UNBIS THESAURUS 1976,
ambiguity. Miller et al.. 1990). We plan to organize the procedure of
Existing relationships between descriptors in the processing English texts for enriching the synonymic
Thesaurus are: broader term (BT) -- narrower term (NT), rows.
associative term (RT), whole-term (WT) -- part-term (PT). English ambiguous terms are described in the
Latter relationship is used for description of physical Thesaurus in the same way as Russian ambiguity.
parts, elements or actants of a concept. For example, this
relationship connects such descriptors as AVIATION and
AIRCRAFT, AGRICULTURE and FARMER, and others.
2 Relations between the Thesaurus and Categories
Using these relations we developed our Thesaurus as a Our technique allows to carry out text categorization using
thesaurus inheritance system in which more specific different systems of categories.
concepts inherit information from more general concepts. We consider any category as user defined query that has
In our system this means that relationship "associative to be represented by descriptors of the Thesaurus.
term" is inherited from a descriptor by its narrower Hierarchical structure of the Thesaurus allows to choose a
descriptors and by its parts. Relationship "part-term" is subtree of the Thesaurus corresponding to the category and
inherited from a descriptor by its narrower descriptors. connect the category with upper descriptor of this subtree.
Relationships “broader term --narrower term” and “whole- We call such a descriptor “supporting descriptor” of the
term --part-term” are transitive relationships. category.
Thus every descriptor of AI-thesaurus is related to a A category can be represented by some descriptors. Now
wide scope of terms. For most descriptors the number of we use two types of category representation by a set of
related descriptors is much larger than the number of supporting descriptors.
direct indicated relationships. For example, descriptor The first type of representation is disjunction of
FINANCE has 13 direct relations with other descriptors, supporting descriptors
but in fact according to the properties of inheritance and
transitivity it is related to more than 400 ones. D1  D2 .... Dn.
This extended set of related terms in AI-Thesaurus
For example, category “Taxes and Budget” can be
allows to determine which terms of a document are
related to each other and to provide the disambiguation of represented with expression TAX  BUDGET SYSTEM.
terms during automatic indexing. Other type of representation is a conjunction of
disjunctions of supporting descriptors
(D11D12...D1n) & (D21D22...D2m) & terms, subject headings) from these thesauri and created
...& (Dk1Dk2 ...Dkr). relations between the categories and our Thesaurus. Every
such thesaurus has systematic part describing
For example, category “Taxes and Budget of the correspondence between its descriptors and top categories.
Russian Federation” is represented with the following Thus these systematic parts determine interpretation of
sequence of supporting descriptors: (TAX  BUDGET each top category. For example, Legislative Indexing
SYSTEM) & RUSSIAN FEDERATION. Vocabulary (LIV 1994) has 89 top terms that were
After relations between categories and supporting connected with 250 supporting descriptors of our
descriptors are fixed, categories corresponding to other Thesaurus. In particular, top term “Medicine” containing
descriptors of the Thesaurus are established automatically 400 descriptors in LIV was connected with 7 supporting
using the following algorithm: descriptors and now 460 descriptors of our Thesaurus
Step 1. Verify if a given descriptor is a supporting correspond to this top term.
descriptor. If it is then a corresponding category is
found, else go to Step 2. 3. Text Categorization Using Thematic
Step 2. Look through descriptors related to the given
descriptor with relationships BT, WT, RT. If some of
Representation of Text
these descriptors are supporting ones then add
Text units are compared with terms of a Thesaurus using
corresponding categories to a list of categories of the
morphological representation of the text and terms. If the
given descriptor. If some descriptors are not supporting
same fragment of a text corresponds to different
ones and related to the initial descriptor with
descriptors of the Thesaurus, ambiguity of the text unit is
relationships BT and WT they are added to a buffer for
indicated.
further search of categories.
After comparison with the Thesaurus the text is
Step 3. If the buffer is not empty every descriptor of the
represented as a sequence of descriptors and the following
buffer is processed as at Step 2.
steps of the algorithm are not language dependent. All
As a result most descriptors of the Thesaurus are
quasi-synonyms of any descriptor are represented by the
connected with some categories indicating disjunction it
that descriptor and are not differentiated further.
belongs to. A descriptor can have no category.
Now it is necessary to determine what descriptors of the
The establishing of such flexible relationships between
text are related to each other. We can do it using thesaurus
categories and descriptors of the Thesaurus allows to take
relationships and properties of inheritance and transitivity.
into consideration specific features of documents and
A set of text descriptors and relationships between them
categories without changing thesaurus relationships. For
that are obtained using properties of thesaurus
example, in the Thesaurus descriptor GOVERNMENT
relationships is called “thesaurus projection”.
COMMISSION is related to descriptor GOVERNMENT.
Descriptors corresponding to different meanings of
But if it is known that all documents of a collection are
ambiguous terms also participate in construction of
decrees by government of the Russian Federation then
thesaurus projection for a text. Using thesaurus projection
descriptor GOVERNMENT COMMISSION has to
a proper meaning of an ambiguous term is chosen.
correspond to category “Government of the Russian
Efficiency of term disambiguation is more than 75
Federation”. To obtain it we can do descriptor
percents of chosen correctly descriptors.
GOVERNMENT COMMISSION supporting descriptor of
At the next stage it is necessary to identify topics of the
this category.
text and describe them constructing thematic nodes.
In order to reflect properly specific features of document
Every topic discussed in a text is usually expressed with
collection and categories we add special “empty category”
a set of related terms. For example, discussion of scientific
to any system of categories. We use it when thesaurus
problems can be expressed in a text by means of the
description of concepts is not appropriate for a given
following terms: mathematics, physics, fundamental
document collection. For example, USA is described as a
research, applied research, academic institute, and so on.
foreign country from the point of view of our Thesaurus.
The term that characterizes the topic is usually stressed in
But if we process such documents as treaties between the
a text. It can be used in the title or in the beginning of the
Russian Federation and the US then USA is a participant
text or it can have the highest frequency among terms of
of any document. For this document collection category
the topic.
“Foreign country” has to correspond to any country except
Any term of the Thesaurus (either general or specific
the Russian Federation and USA. In this case descriptor
FOREIGN COUNTRY is a supporting descriptor for a one) can become the main term of a topic. For example,
category “Foreign country” but descriptors USA and term mathematics can become the main term of a topic if
RUSSIAN FEDERATION are supporting descriptors for the text is devoted to development of mathematics, or term
“empty category”. scientist can become the main term of a topic if a text is
To provide convenient access to Russian official about “brain drain” to foreign countries.
documents (Yudina & Dorsey 1995) via Internet for users Thematic relations between terms in a text are
accustomed to one of well-known thesauri (LIV 1990; represented by relationships between corresponding
UNBIS THESAURUS 1976), we took top categories (top descriptors in the thesaurus projection. The thesaurus
projection usually consists of some separate fragments. A the stage of comparison of text with Thesaurus. After
fragment of the thesaurus projection can have a complex construction of thematic nodes textual relations
structure and contain descriptors that are not really frequencies of descriptors in each thematic node are
thesaurus related to each other. Thus it is necessary to summed up, and we receive textual relations between
subdivide these fragments further in the thesaurus thematic nodes.
projection. In our approach we assume that main thematic nodes
Our experiments show that for the most effective are those ones that
division of the thesaurus projection it is necessary to use - have textual relations with all other main thematic nodes
the notion of “thematic node”. A set of descriptors from a and
text that have thesaurus relationships with one and the - have a sum of frequencies of textual relations between
same descriptor D0 in the thesaurus projection of the text these nodes greater than the sum of frequencies for the
is called “thematic node”. Descriptor D0 is called “main same number of other thematic nodes of this text.
descriptor” of this thematic node. Evaluated in such a way main thematic nodes determine
Let us see fragments of a thematic nodes with main a threshold that distinguishes main thematic nodes among
descriptor CUSTOMS FORMALITY that were constructed all other thematic nodes of a text. threshold is an average
during automatic processing of Customs Code of the frequency of descriptors in determined main thematic
Russian Federation (the right column represents descriptor nodes. The initial set of main thematic nodes is
frequency in the text). supplemented with those thematic nodes whose frequency
is more than the threshold.
CUSTOMS FORMALITY 520 In our example of the thematic representation for the
CUSTOMS DUTY 165 Customs Code main thematic nodes were thematic nodes
CUSTOMS CONTROL 153 with main descriptors GOODS, CUSTOMS FORMALITY,
CUSTOMS DECLARATION 47 CUSTOMS COMMITTEE, LAW.
CUSTOMS BENEFITS 21 Besides main thematic nodes there are specific thematic
IMPORT TAX 12 nodes and mentioned descriptors. Specific thematic nodes
EXPORT TAX 8 represent primary characteristics of main topics discussed
in the text. Specific nodes are those thematic nodes that
During automatic processing of Customs Code more have textual relations with at least two different main
than 140 thematic nodes were constructed. (Size of the thematic nodes. Descriptors that are not elements of main
document is more than 500 Kb). or specific thematic nodes are called mentioned
At the next stage it is necessary to evaluate the descriptors.
importance of topics and thematic nodes representing A set of thematic nodes constructed for the text with
these topics in the text. At first we have to determine main evaluated status of these thematic nodes is called
topics of the text, that is to choose main thematic nodes. “thematic representation” of the text.
In our approach we assume that in normal, conventional Thus all descriptors of the text are divided into five
texts main topics pass through the whole text and are classes of different importance for the text:
discussed in combination with each other. It means that • main descriptors of main thematic nodes,
descriptors of different main thematic nodes are usually • other descriptors of main thematic nodes,
located together all over the text. To find out how • main descriptors of specific thematic nodes,
descriptors of thematic nodes are distributed in the text we • other descriptors of specific thematic nodes,
use the notion “textual relation”: a given descriptor has • mentioned descriptors.
textual relations with those descriptors of the text that are Division of descriptors into classes of importance is
located not further than N descriptors from the given used for text categorization. A category represented as a
descriptor (location order is not important). disjunction of supporting descriptors became a category of
As a result we obtain a set of textual relations for every the text if one of main descriptors of main thematic nodes
descriptor of a text. For example, here are fragments of a correspond to this category. If a category is a conjunction
set of textual relations of descriptor CUSTOMS BORDER of two disjunctions, a special function f(k1,k2,r) evaluates
received during processing of Customs Code (on the right if this category is a category of the text, where k1 is the
side frequency of textual relations is indicated): highest class of descriptors corresponding to the first
disjunction of the category, k2 is the highest class of
CUSTOMS BORDER descriptors corresponding to the second disjunction and r
GOODS 8 is a frequency of textual relations between descriptors
MEANS OF TRANSPORT 5 corresponding to different disjunctions of the category.
CUSTOMS TERRITORY 3
FREE CUSTOMS ZONE 1
CUSTOMS DUTY 1 Conclusion
Textual relations between descriptors are determined at Now we have evaluation of performance of our system
only for Russian texts. Intelligent Systems: Current Research and Practice in
Our evaluation of performance was obtained as a result Information Extraction and Retrieval. New Jersey, P.227-
of the following procedure. The system processed texts. 242.
We looked through categories obtained for each text and Lewis D.D. and Ringuette M. 1992. Text Categorization
determined by Inductive Learning. Proceedings of AAAI-92.
• how many categories were obtained -- Si ; LIV 1994. Legislative Indexing Vocabulary 21th Edition. -
• how many categories correspond to the contents of the Washington: The Library of Congress.
text -- Pi ; Lukashevich N. 1995. Automated Formation of an
• how many categories this text has in our opinion -- Wi Information-Retrieval Thesaurus on the Contemporary
. Sociopolitical Life of Russia. Automatic documentation
Precision of the whole process was estimated as Σ(Pi and mathematical linguistics. 29(2): 29-35.
)/Σ(Si ), recall -- Σ(Pi )/Σ(Wi ). Miller G.; Beckwith R.; Fellbaum C.;Gross D. and Miller
Text categorization of official documents of the Russian K. 1990. Five papers on WordNet. CSL Report
Federation is fulfilled for information system RUSSIA 43.Cognitive Science Laboratory, Princeton University.
(Yudina & Dorsey 1995). The system of categories Riloff E.1993. Automatically Constructing a Dictionary
consists of 180 categories that are connected with 210 for Information Extraction Tasks. Proceedings of the
supporting descriptors of the Thesaurus. Categories are Eleventh National Conference on Artificial Intelligence.
represented as disjunctions of supporting descriptors. AAAI Press/ The MIT Press. 811-816.
Efficiency of text categorization -- 91.2% of precision and Riloff E. and Lehnert W. 1994. Information Extraction as
94.2% of recall --was tested on 700 documents that were a Basis for High Precision Text Classification. ACM
not used for construction of the Thesaurus Transactions on Information Systems, 12(3):296-333.
Text categorization for news reports uses 35 categories Salton G. 1989. Automatic Text Processing - The
that are connected with 145 supporting descriptors of the Analysis, Transformation and Retrieval of Information by
Thesaurus. Most categories are represented as Computer. Addison-Wesley, Reading, MA.
conjunctions of two disjunctions of supporting descriptors. UNBIS Thesaurus 1976. English Edition.- Dag
Evaluation of text categorization received as a result of Hammarskjold Library of United Nations, New York.
analysing 1200 reports of IMA-PRESS information agency Subject Headings 1991. Subject Headings. 14th Edition. -
is as follows: precision 91.1%, recall - 93.8%. Cataloging Distribution Service, Library of Congress,
We have shown that it is possible to provide effective Washington, D.C.
text categorization of various text collections using Vledutz-Stokolov N. 1987. Concept Recognition in an
description of domain in the Thesaurus created as a Automatic Text-Processing System for the Life Sciences.
special tool for automatic text processing, and constructing J. of the American Society for Information sciences 38,
thematic representations of texts. Categories are connected P.269-287.
with the Thesaurus by flexible relationships. The system Yudina T.; and Dorsey P. 1995. IS RUSSIA: An Artificial
can be quickly adapted to other types of texts and other Intelligence-Based Document Retrieval System. Oracle
systems of categories in various languages; it can process Select.2(2):12-17.
texts using different systems of categories simultaneously.
We plan to develop text categorization of English texts
using Russian categories and to provide access to Russian
official documents using top categories of well-known
thesauri (LIV 1994; UNBIS THESAURUS 1976).

Acknowledgments
This research was funded in part by John and Catherine
MacArtur Foundation N. 95-30332A-FSU.

Bibliography
Goldberg J.L. 1996. A machine learning method for text
categorization. Ph.D. diss., Texas A&M University.
Goodman M. 1991. Prism: A Case-Based Telex Classifier.
In Proceedings of the Second Annual Conference on
Innovative Applications of Artificial Intelligence. AAAI
Press. 25-37.
Hayes Ph. 1992. Intelligent High-Volume Processing
Using Shallow, Domain-Specific Techniques. Text-Based

You might also like