Semi-supervised Single-label Text Categorization
using Centroid-based Classifiers
Ana Cardoso-Cachopo
Arlindo L. Oliveira
IST — TULisbon / INESC-ID
Av. Rovisco Pais, 1
1049-001 Lisboa — Portugal
INESC-ID / IST — TULisbon
Rua Alves Redol, 9
1000-029 Lisboa — Portugal
[email protected]
[email protected]
ABSTRACT
1. INTRODUCTION
In this paper we study the effect of using unlabeled data in
conjunction with a small portion of labeled data on the accuracy of a centroid-based classifier used to perform singlelabel text categorization. We chose to use centroid-based
methods because they are very fast when compared with
other classification methods, but still present an accuracy
close to that of the state-of-the-art methods. Efficiency is
particularly important for very large domains, like regular
news feeds, or the web.
We propose the combination of Expectation-Maximization
with a centroid-based method to incorporate information
about the unlabeled data during the training phase. We
also propose an alternative to EM, based on the incremental update of a centroid-based method with the unlabeled
documents during the training phase.
We show that these approaches can greatly improve accuracy relatively to a simple centroid-based method, in particular when there are very small amounts of labeled data
available (as few as one single document per class).
Using one synthetic and three real-world datasets, we
show that, if the initial model of the data is sufficiently precise, using unlabeled data improves performance. On the
other hand, using unlabeled data degrades performance if
the initial model is not precise enough.
Text Categorization (TC) is concerned with finding methods that, given a document, can automatically classify it into
one or more of a predefined set of categories (or classes) [20].
When there is no overlap between classes, that is, when each
document belongs to a single class, it is called single-label
TC. In this paper, we are concerned with single-label TC.
A very efficient class of methods for TC is that of centroidbased methods [5, 10, 11, 8, 22, 13]. These methods are very
efficient during the classification phase, because time and
memory are proportional to the number of classes, rather
than to the number of training documents. Despite their
computational simplicity, these methods are very effective,
even when compared with state-of-the-art methods like Support Vector Machines (SVM) [12, 3].
A characteristic common to many TC applications is that
it is expensive to classify data for the training phase, while
it is relatively inexpensive to find unlabeled data. In other
situations, only a small portion of the document space is
available initially, and new documents arrive incrementally.
The web is a good example of a large, changing environment,
with both these characteristics.
It has been shown that the use of large amounts of unlabeled data in conjunction with small amounts of labeled
data can greatly improve the performance of some TC methods [17]. This has been done using a well known iterative
algorithm called Expectation-Maximization (EM) [7].
In this paper, we propose the combination of EM with
a centroid-based method to incorporate information about
the unlabeled data during the training phase. We show that
this approach can greatly improve accuracy relatively to a
simple centroid-based method, in particular when there are
very small amounts of labeled data.
For the situations where only a small portion of the document space is available initially, we incrementally update a
centroid-based method with the unlabeled documents during the training phase.
It is particularly interesting that, by using a centroidbased method as the underlying classification method, we
are able to compare the accuracy of semi-supervised and incremental learning directly, and choose the most adequate
approach for each situation.
Using one synthetic and three real-world datasets, we
show that, if the initial model of the data is sufficiently precise, using unlabeled data improves performance. On the
other hand, using unlabeled data degrades performance if
the initial model is not precise enough.
This paper is structured as follows: Section 2 briefly de-
Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information
Search and Retrieval—Retrieval models
General Terms
Algorithms, Experimentation, Performance
Keywords
Single-label Text Categorization, Centroid-based Models, Semisupervised Learning, Online Learning
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SAC’07 March 11-15, 2007, Seoul, Korea
Copyright 2007 ACM 1-59593-480-4 /07/0003 ...$5.00.
scribes centroid-based methods and puts in context some of
the work that has been done in this area. Section 3 describes
how EM can be combined with a centroid-based method to
incorporate information about the unlabeled data and refers
to some other applications of EM to semi-supervised learning. Section 4 describes our proposal for incrementally incorporating information about the unlabeled data, as well
as some of the work done on online learning. Section 5 describes the experimental setup that was used for this paper.
Section 6 discusses the results that we obtained and compares them to published work. Finally, in Section 7, we
conclude and refer some future directions for our work.
2. CENTROID-BASED METHODS
Centroid-based methods combine documents represented
using a vector-space model [18], to find a representation for a
“prototype” document that summarizes all the known documents for a given class, which is called the centroid. Given
→
−
−
→
a set of n document vectors D = {d1 , . . . , dn }, classified
along a set C of m classes, C = {C1 , . . . , Cm }, we use DCj ,
for 1 ≤ j ≤ m, to represent the set of document vectors
belonging to class Cj . The centroid of a particular class
−
Cj is represented by a vector →
cj , which is a combination of
→
−
the document vectors di belonging to that class, sometimes
combined with information about vectors of documents that
are not in that class. There are several ways to calculate this
centroid during the training phase and several proposals for
centroid-based methods are available in the literature. Each
proposal uses one possible way of calculating the centroids,
which are similar, but produce different results. The most
common are:
−
• The Rocchio formula, where each centroid, →
cj , is represented by the sum of all the document vectors for the
positive training examples for class Cj , minus the sum
of all the vectors for the negative training examples,
weighted by control parameters β and γ, respectively.
The application of this method to TC was first proposed by Hull [9] and it has been used in other works
where the role of negative examples is deemphasized,
by setting β to a higher value than γ (usually β = 16
and γ = 4) [5, 10, 11].
X →
X →
−
−
1
1
→
−
di −γ·
di (1)
cj = β·
·
·
|DCj | →
|D − DCj | →
−
−
di ∈DCj
di ∈D
/ Cj
−
• The average formula [8, 22], where each centroid, →
cj ,
is represented by the average of all the vectors for the
positive training examples for class Cj :
X →
−
1
→
−
di
(2)
·
cj =
|DCj | →
−
di ∈DCj
−
• The sum formula [4], where each centroid, →
cj , is represented by the sum of all the vectors for the positive
training examples for class Cj :
X →
−
→
−
cj =
di
(3)
→
−
di ∈DCj
• The normalized sum formula [13], where each centroid,
→
−
cj , is represented by the sum of all the vectors for the
positive training examples for class Cj , normalized so
that it has unitary length:
X →
−
1
→
−
di
(4)
cj = P
→
− ·
−
k →
di k →
−
d ∈D
d ∈D
i
Cj
i
Cj
It is fairly obvious that these ways of calculating each
class’s centroid make centroid-based methods very efficient
during the training phase, because there is little computation involved, unlike methods based on SVMs, which build
a more sophisticated model of the data. Centroid-based
methods also have the advantage that they are very easy
to modify in order to perform incremental learning during
their training phase, as we shall show in Section 4.
During the classification phase, each test document (or
→
−
query) is represented by its vector, di , and is compared to
−
each of the class’s centroids →
cj . The document will be classified as belonging to the class to whose centroid it has the
greatest cosine similarity:
→
− →
→
− −
di · −
cj
sim( di , →
cj ) = →
(5)
−
−
|| di || × ||→
cj ||
Centroid-based methods are very efficient during the classification phase because time and memory spent are proportional to the number of classes that exist, rather than to the
number of training documents as is the case for the vector
method and other related methods.
3. INCORPORATING UNLABELED DATA
WITH EM
It has been shown that the use of large amounts of unlabeled data in conjunction with small amounts of labeled
data can greatly improve the performance of some TC methods [17]. The combination of the information contained in
the labeled and unlabeled data can be done using ExpectationMaximization (EM) [7].
Inputs: A set of labeled document vectors, L, and a set of
unlabeled document vectors U .
Initialization step:
• For each class Cj appearing in L, set DCj to the set of
documents in L belonging to class Cj .
−
• For each class Cj , calculate the class’s centroid →
cj ,
using one of the formulas (1) to (4).
Estimation step:
• For each class Cj appearing in L, set UCj to the empty set.
→
−
• For each document vector di ∈ U :
→
−
– Let Ck be the class to whose centroid di has the greatest cosine similarity, calculated using Equation 5.
→
−
– Add di to the set of document vectors labeled as Ck ,
→
−
i.e., set UCk to UCk ∪ { di }.
Maximization step:
−→, using D ∪ U
• For each class Cj , calculate −
cj−new
Ck
Ck as the
set of documents labeled as Ck , for each class Ck .
Iterate:
−
−→ and repeat
−
−→, then set →
cj to c−j−new
• If, for some j, →
cj 6= c−j−new
from the “Estimation step” forward.
−
Outputs: For each class C , the centroid →
c .
j
j
EM is a class of iterative algorithms for maximum likelihood estimation of hidden parameters in problems with
incomplete data. In our case, we consider that the labels of
the unlabeled documents are unknown and use EM to estimate these (unknown) labels. EM has been used to combine labeled and unlabeled data for classification in conjunction with several different methods: Shahshahani and Landgrebe [21] use a mixture of Gaussians; Miller and Uyar [16]
use mixtures of experts; McCallum and Nigam [15] use poolbased active learning; and Nigam [17] uses Naive Bayes.
EM has also been used with k-means [14], which can be
considered as a centroid-based method, but for clustering
rather than for classification, under the name of constrained
k-means [1, 2].
We propose the combination of EM with a centroid-based
method for TC, which works according to the following algorithm, after choosing one of the formulas (1) to (4) to
calculate each class’s centroid.
4. INCREMENTALLY UPDATING THE
MODEL OF THE DATA
Online methods [20] build a classifier soon after examining the first training document, and incrementally refine it
as they examine new ones. This may be an advantage in
the applications where the training set is not available in its
entirety from the start, or in which the meaning of the category may change in time. We can describe the incremental
method as a very generic algorithm:
Given a classification model, M , and a set of documents to classify, D, repeat for each document d ∈ D:
• Classify d according to model M
• Update model M with the new document d classified in the
previous step
The incremental approach is very suited for tasks that require continuous learning, because the available data changes
over time. Centroid-based methods are particularly suitable
for this kind of approach because they are very fast, both for
training and for testing, and can be applied to very large domains like the web. Moreover, unlike the traditionally used
perceptron-based model [19, 23], which needs to train different classifiers to consider more than two classes, centroidbased models trivially generalize to multiple classes. In the
case of single-label TC, there is no need to fine tune a threshold for deciding when a document belongs to a class, because
it will belong to the class represented by the most similar centroid. In terms of computational efficiency, centroidbased methods are very fast, because updating a centroidbased method can be easily achieved, provided that we keep
some additional information in the model with each centroid.
The next paragraphs describe how the model is updated
for each of the centroid-based methods presented in Section 2. In each case, we want to update the model with a
−−→
new document, dnew , classified as belonging to class Cj . The
simplest case is for the sum method (formula (3)), where the
−
model is updated by calculating a new value for centroid →
cj
using the following attribution:
−−→
→
−
−
cj ←[ →
cj + dnew
(6)
To simplify the incremental update of the average method
(formula (2)), we maintain in the model also the number of
documents, nj , which were used to calculate each centroid
→
−
cj . With this information, the model is updated according
to the following attributions:
−−→
−
(→
cj .nj ) + dnew
→
−
cj ←[
and then
nj ←[ nj + 1 (7)
nj + 1
For the normalized sum method (formula (4)), we main−
tain, with each normalized centroid →
cj , the non-normalized
−
−
→
centroid, nncj , so that updating the model can be performed
by the following attributions:
−
−→
−→
nnc
j
−
−→ ←[ −
−→ + −
→
−
nnc
nnc
dnew
and then
cj ←[ −−→ (8)
j
j
k nncj k
Finally, the most complex case is for the Rocchio method
(formula (1)), because each new document forces the update
of every centroid in the model. In this case, we maintain,
−
for each centroid, →
cj , two vectors: the sum of the positive
−→, and the sum of the negative examples, −
−→.
examples, −
pos
neg
j
j
Using these two vectors, formula (1) can be rewritten as
→
−
−→ − γ · −
−→
c =β·−
pos
neg
(9)
j
j
j
Updating the model can be achieved by first updating the
appropriate vectors:
−→
−→
−
−→ ←[ −
−→ + −
−→ ←[ −
−→ + −
pos
pos
dnew and, for each i 6= j, −
neg
neg
dnew
j
j
i
i
(10)
and then, calculating all the new centroids according to formula (9).
In this paper we show how incrementally updating the
model of the data (that is, the centroids) with the unlabeled
documents during the training phase influences the accuracy
of a centroid-based method.
5. EXPERIMENTAL SETUP
In this section we present the experimental setup that was
used for this work, namely the datasets and the evaluation
measures that were used. In order to show that using unlabeled data improves performance when the initial model of
the data is sufficiently precise, while hurting performance if
the initial model is not precise enough, we used one synthetic
and three real-world datasets.
5.1
Synthetic Dataset
The synthetic dataset corresponds to four different mixtures of Gaussians, in one dimension. The data points belonging to each Gaussian distribution are randomly generated according to a Gaussian probability distribution function:
2
2
e−(x−µ) /2σ
√
(11)
σ 2π
In each combination, each Gaussian distribution corresponds to a different class, and we used different ratios between parameters µ and σ to simulate problems with differing difficulties. Figure 1 depicts the Gaussian distributions
that we used. Is is easy to see that, as the ratio σµ decreases,
the problem of deciding which distribution originated a randomly generated point belongs is more difficult, because the
overlap between the distributions increases. In particular,
the limit for the accuracy of the optimal classifier can be obtained as the value of the cumulative distribution function
of the Gaussian at the point of intersection.
g(x) =
5.2
1
Gaussian Distributions
µ = 1.0, σ = 0.5
µ = −1.0, σ = 0.5
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-5
-4
-3
-2
-1
0
1
1
2
3
4
5
4
5
4
5
Gaussian Distributions
µ = 1.0, σ = 1.0
µ = −1.0, σ = 1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-5
-4
-3
-2
-1
0
1
1
2
3
Gaussian Distributions
µ = 1.0, σ = 2.0
µ = −1.0, σ = 2.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-5
-4
-3
-2
-1
0
1
1
2
3
Gaussian Distributions
µ = 1.0, σ = 4.0
µ = −1.0, σ = 4.0
0.9
0.8
0.7
5.3
0.6
0.5
0.4
0.3
0.2
0.1
0
-5
Figure 1:
used.
-4
-3
-2
-1
0
1
2
3
4
5
Real-world Datasets
To allow the comparison of our work with previously published results, we used two standard TC benchmarks in our
evaluation, downloaded from a publicly available repository
of datasets for single-label text categorization.1 In this website there is also a description of the datasets, their standard
train/test splits, how they were processed to become singlelabeled, and the pre-processing techniques that were applied
to each dataset, namely character clean-up, removal of short
words, removal of stopwords, and stemming. We also used
a set of classified web pages extracted from the CADÊ Web
Directory.2
20 Newsgroups — The 20ng dataset is a collection
of approximately 20,000 newsgroup documents, partitioned
(nearly) evenly across 20 different newsgroups. We used its
standard “ByDate” split, where documents are ordered by
date and the first two thirds are used for training and the
remaining third for testing. For this dataset, we used the
files 20ng-train-stemmed and 20ng-test-stemmed, available from that website.
Reuters 21578 — The documents in Reuters-21578 appeared on the Reuters newswire in 1987 and were manually
classified by personnel from Reuters Ltd. We used the standard ”modApté” train/test split. Due to the fact that the
class distribution for these documents is very skewed, two
sub-collections are usually considered for text categorization
tasks [6]: R10, the set of the 10 classes with the highest number of positive training examples; and R90, the set of the
90 classes with at least one positive training and testing example. Because we are concerned with single-label TC, we
used r8, which corresponds to the documents with a single
topic and the classes which still have at least one training and one test example after removing documents with
more than one topic from R10. For this dataset, we used
the files r8-train-stemmed and r8-test-stemmed, available
from that website.
CADE — The documents in the Cade12 dataset correspond to web pages extracted from the CADÊ Web Directory, which points to Brazilian web pages classified by
human experts. This dataset corresponds to a subset of the
pages that are available through that directory. We randomly chose two thirds of the documents for training and
the remaining third for testing.
Table 1 shows some information about these datasets,
namely the number of classes and the numbers of documents
in the train and test sets.
Evaluation Measure
TC methods are usually evaluated in terms of measures
based on Precision and Recall, like F1 or PRBP [20]. However, to evaluate single-label TC tasks, these measures are
not adequate, because Recall does not make sense in this
setting. So, accuracy, which is the percentage of correctly
classified test documents (or queries), is used to evaluate
this kind of tasks. We will therefore use accuracy as the
criterion for evaluating the performance of the algorithms.
Combinations of Gaussians that were
Accuracy =
1
2
#Correctly classified test documents
#Total test documents
(12)
Available at http://www.gia.ist.utl.pt/~acardoso/datasets/
Available at http://www.cade.com.br, in Brazilian Portuguese.
20ng (20 classes)
alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc
Total
r8 (8 classes)
acq
crude
earn
grain
interest
money-fx
ship
trade
Total
Cade12 (12 classes)
01–servicos
02–sociedade
03–lazer
04–informatica
05–saude
06–educacao
07–internet
08–cultura
09–esportes
10–noticias
11–ciencias
12–compras-online
Total
Train
480
584
572
590
578
593
585
594
598
597
600
595
591
594
593
598
545
564
465
377
11293
Train
1596
253
2840
41
190
206
108
251
5485
Train
5627
4935
3698
2983
2118
1912
1585
1494
1277
701
569
423
27322
Test
319
389
394
392
385
392
390
395
398
397
399
396
393
396
394
398
364
376
310
251
7528
Test
696
121
1083
10
81
87
36
75
2189
Test
2846
2428
1892
1536
1053
944
796
643
630
381
310
202
13661
Total
799
973
966
982
963
985
975
989
996
994
999
991
984
990
987
996
909
940
775
628
18821
Total
2292
374
3923
51
271
293
144
326
7674
Total
8473
7363
5590
4519
3171
2856
2381
2137
1907
1082
879
625
40983
Table 1: List of classes and number of documents
for each dataset.
It can be shown that, in single-label classification tasks,
Accuracy = microaveraged F 1 =
(13)
microaveraged P recision = microaveraged Recall
because each document can be correctly classified or not,
so the number of false positives in contingency tables is the
same as the number of false negatives.
5.4
Preliminary Testing
In order to be able to compare our work with some other
TC methods, we have determined the accuracy that some
of the most common methods achieve in our datasets. As
usual, tfidf term weighting is used to represent document
vectors, and they were normalized to unitary length. It has
already been shown that, of the several centroid-based methods proposed in the literature, Centroid-NormalizedSum (the
one that uses formula (4) to calculate the centroids) was the
best performing one [3], and so we used it in our experiments.
For comparison purposes, Table 2 shows the accuracy obtained with some well known TC methods using our framework and all the training documents as labeled documents. The “dumb classifier” ignores the contents of the
test document and always gives as the predicted class the
most frequent class in the training set.
Dumb classifier
Vector
k-NN (k = 10)
Centroid-NormalizedSum
SVM (linear kernel)
r8
0.4947
0.7889
0.8524
0.9543
0.9698
20ng
0.0530
0.7240
0.7593
0.7885
0.8278
Cade12
0.2083
0.4142
0.5120
0.5147
0.5283
Table 2: Accuracy achieved by some TC methods
using our framework, considering all training documents as labeled.
Note that, because r8 is very skewed, the dumb classifier
has a “reasonable” performance for this dataset. Also, it is
worth noting that, while for r8 and 20ng we can find good
classifiers, that is, classifiers that achieve a high accuracy,
for Cade12 the best we can get does not reach 53% accuracy,
even with one of the best classifiers available.
6. EXPERIMENTAL RESULTS
In this section we provide empirical evidence that, if the
initial model of the data is sufficiently precise, using unlabeled data improves performance, and that, on the other
hand, using unlabeled data degrades performance if the initial model is not precise enough. We do this, first by using
one synthetic dataset, and then confirming the results obtained using three real-world datasets.
6.1
Using Unlabeled Data with the Synthetic
Dataset
The synthetic dataset was created with several goals in
mind. First, we wanted a dataset that was simple, and
whose properties were well known. Additionally, we wanted
to be able to generate as many “documents” as necessary
for our experiments. Ultimately, our goal was to prove that
the effect of using unlabeled data depends not only on the
classification method that is used, but also on the quality of
the dataset.
We randomly generated four different two-class datasets,
each according to two different one-dimensional Gaussian
distributions, so that each dataset posed a different difficulty
level, known in advance. With each dataset, we used the
same classification methods, and in the end we compared
the results.
So that our experiments would not depend on one particular ordering of the dataset, we repeated the following steps
500 times for each dataset:
1. Randomly generate from 1 to 20 labeled training documents per class.
2. Based on the training documents alone, calculate each
class’s centroid, that is, the average of the numbers
corresponding to the training documents.
3. Randomly generate 5000 test documents. These should
be approximately half from each class.
4. Determine accuracy for the centroid-based method.
5. Randomly generate 5000 unlabeled documents.
6. Using EM / incremental, update each class’s centroid.
7. Determine accuracy for EM / incremental, using the
same test documents as for the centroid-based method
alone.
0.98
This can be summed-up in the following algorithm:
0.97
0.96
Accuracy
For each dataset, repeat 500 times
For i = 1 to 20
Randomly generate i training docs per class
Calculate each class’s centroid
Randomly generate 5000 test docs
Determine accuracy for the centroid-based method
Randomly generate 5000 unlabeled docs
Using EM / incremental, update centroids
Determine accuracy for EM / incremental
0.95
0.94
0.93
µ = 1.0, σ = 0.5
centroid
EM
Inc
0.92
0.91
0
The mean accuracy values for the 500 tests for each dataset,
as a function of the number of labeled documents per class
that were used, are presented in figure 2.
We can observe that there are some features that are common, independently of the difficulty level of the dataset:
All these observations allow us to conclude that the effect
of using unlabeled data depends not only on the classification method that is used, but also on the quality of the
dataset that we are considering. If our initial model of the
labeled data is able to achieve a high accuracy, using unlabeled data will help improve the results, and it will help
more if the initial model is better.
6.2
Using Unlabeled Data with the Real World
Datasets
To confirm the results obtained in the previous section,
we used two standard TC benchmarks and a set of classified
15
20
0.86
0.84
Accuracy
0.82
0.8
0.78
0.76
µ = 1.0, σ = 1.0
centroid
EM
Inc
0.74
0.72
0
5
10
15
20
Labeled documents per class
0.7
0.68
0.66
As for the features that depend on the difficulty level of
the dataset:
0.64
0.62
µ = 1.0, σ = 2.0
centroid
EM
Inc
0.6
0.58
0
5
10
15
20
Labeled documents per class
0.59
0.58
0.57
Accuracy
• As the difficulty level of the dataset increases, the accuracy that can be achieved decreases (note the different
ranges in the Y axis).
• When only one labeled document per class is available,
5000 unlabeled documents allow us to improve accuracy from 0.9196 to 0.9771, for the easier dataset, while
for the most difficult dataset improvement is only from
0.5252 to 0.5331.
• As a general rule, the effect of using 5000 unlabeled
documents to update our model of the data decreases
as the difficulty level of the dataset increases.
10
Labeled documents per class
Accuracy
• As expected, more labeled training documents per class
improve results, because the curves go up as the number of available labeled documents increases. However,
this observed improvement decreases as the number of
labeled documents increases.
• As the number of labeled training documents per class
increases, the effect of using unlabeled documents decreases. This means that, having unlabeled data is always good, and it is better when we have less labeled
data.
• Both methods (EM and incremental) of updating the
centroids of the classes give the same results, because
both lines are the same. In this setting, it is better to
incrementally update the centroids, because the results
are the same and this method is computationally more
efficient.
5
0.56
0.55
0.54
µ = 1.0, σ = 4.0
centroid
EM
Inc
0.53
0.52
0
5
10
15
20
Labeled documents per class
Figure 2: Accuracy for the Gaussians dataset, as
a function of the number of labeled documents per
class that were used.
0.95
0.9
0.85
Accuracy
web pages extracted from the CADÊ Web Directory. As
we already saw in Section 5.4, we were able to find good
classifiers for the first two datasets, but not for the third.
The steps we followed for testing these datasets were similar to the ones followed in the previous section, and can be
summed-up in the following algorithm3 :
For each dataset, consider its train/test split
For each dataset, repeat 5 times
For i in {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15,
20, 30, 40, 50, 60, 70}
Randomly select i labeled training docs per class
Calculate each class’s centroid using Formula 4
Determine accuracy for the centroid-based method
Randomly select 5000 unlabeled docs from the
remaining training docs
Using EM / incremental, update centroids
Determine accuracy for EM / incremental
• r8, for which it was possible to achieve a high accuracy using all 5485 training documents with SVM
and Centroid-NormalizedSum (see Table 2), accuracy
varies from 0.6492 with one labeled document per class
to 0.9321 with 40 labeled documents per class. 20ng,
for which it was also possible to achieve high accuracy
values with Centroid-NormalizedSum using all 11293
training documents and SVM, accuracy varies from
0.2627 with one labeled document per class to 0.7424
with 70 labeled documents per class. For Cade12,
which already had poor accuracy values using all 27322
training documents with Centroid-NormalizedSum and
SVM, accuracy varies from 0.1212 with one labeled
document per class to 0.3782 with 70 labeled documents per class (note the different ranges in the Y
axis).
• For r8 and 20ng, using unlabeled data improves results,
and the improvement is larger for smaller numbers of
labeled documents per class. For Cade12, using unlabeled data worsens results. This observation allows us
to experimentally confirm our initial intuition that it is
only worth it to use unlabeled data if the initial model
for the labeled data already provided good results.
• For r8 and 20ng, as the number of labeled training documents per class increases, the effect of using unlabeled
documents decreases. As for the synthetic dataset,
having unlabeled data is good, and it is better when
we have less labeled data.
• For all datasets, the two methods (EM and incremental) of updating the centroids of the classes give different results, because now the way by which the centroids are updated is different. Moreover, the difference in the results decreases as the number of labeled
3
Due to its reduced size, for the r8 dataset we could only select up to 40 labeled documents per class and 1000 unlabeled
documents.
0.75
0.7
r8
centroid
EM
Inc
0.65
0.6
0
5
10
15
20
25
30
35
40
Labeled documents per class
0.75
0.7
0.65
Accuracy
0.6
0.55
0.5
0.45
0.4
20ng
centroid
EM
Inc
0.35
0.3
0.25
0
10
20
30
40
50
60
70
Labeled documents per class
0.4
0.35
0.3
Accuracy
The mean accuracy values for the 5 runs for each dataset
are presented in Figure 3.
Once more, we can observe that, independently of the
dataset that is used, more labeled training documents per
class improve results, and that this improvement decreases
as the number of labeled documents increases.
The rest of the observations depend on the dataset that
is used:
0.8
0.25
0.2
Cade12
centroid
EM
Inc
0.15
0.1
0
10
20
30
40
50
60
70
Labeled documents per class
Figure 3: Accuracy for the three real world datasets,
as a function of the number of labeled documents
per class that were used.
documents increases. This happens because when more
labeled documents are available, the initial model is
better, and therefore the centroids are less moved by
either one of the updating techniques. Generally, using
EM to update the centroids yields better results.
All these observations allow us to confirm our previous
conclusion that the effect of using unlabeled data depends
not only on the classification method that is used, but also
on the quality of the dataset that we are considering. For the
datasets for which we could come up with good classification
accuracy (r8 and 20ng), using unlabeled data helped improve
results, while for the dataset for which the initial results were
not so good (Cade12), using unlabeled data actually made
our results even worse.
7. CONCLUSIONS AND FUTURE WORK
In this paper we are concerned with single-label text categorization. We proposed the combination of EM with a
centroid-based classifier that uses information from small
amounts of labeled documents together with information
from larger amounts of unlabeled documents. We also showed
how a centroid-based method can be used to incrementally
update the model of the data, based on new evidence from
the unlabeled data. Using one synthetic dataset and three
real-world datasets, we provided empirical evidence that, if
the initial model of the data is sufficiently precise, using unlabeled data improves performance. On the other hand, using unlabeled data degrades performance if the initial model
is not precise enough. As future work, we plan to extend this
approach to multi-label datasets.
8. ACKNOWLEDGMENTS
We thank the anonymous reviewers for their helpful comments on this work. This research was sponsored in part by
FCT project POSC/EIA/58194/2004.
9. REFERENCES
[1] A. Banerjee, C. Krumpelman, J. Ghosh, S. Basu, and
R. Mooney. Model-based overlapping clustering. In
KDD ’05: Proceedings of the eleventh ACM SIGKDD
international conference on Knowledge discovery in
data mining, pages 532–537. ACM Press, 2005.
[2] M. Bilenko, S. Basu, and R. Mooney. Integrating
constraints and metric learning in semi-supervised
clustering. In ICML ’04: Proceedings of the
twenty-first international conference on Machine
learning, page 11. ACM Press, 2004.
[3] A. Cardoso-Cachopo and A. Oliveira. Empirical
evaluation of centroid-based models for single-label
text categorization. Technical Report 7/2006,
INESC-ID, June 2006.
[4] W. Chuang, A. Tiyyagura, J. Yang, and G. Giuffrida.
A fast algorithm for hierarchical text classification. In
Proceedings of DaWaK-00, 2nd International
Conference on Data Warehousing and Knowledge
Discovery, pages 409–418. Springer Verlag, 2000.
[5] W. Cohen and Y. Singer. Context-sensitive learning
methods for text categorization. ACM Transactions
on Information Systems, 17(2):141–173, 1999.
[6] F. Debole and F. Sebastiani. An analysis of the
relative hardness of reuters-21578 subsets. Journal of
the American Society for Information Science and
Technology, 56(6):584–596, 2004.
[7] A. Dempster, N. Laird, and D. Rubin. Maximum
likelihood from incomplete data via the EM
algorithm. Journal of the Royal Statistical Society,
Series B, 39:1–38, 1977.
[8] E.-H. Han and G. Karypis. Proceedings of the 4th
european conference on centroid-based document
classification: Analysis and experimental results. In
Principles of Data Mining and Knowledge Discovery,
pages 424–431, 2000.
[9] D. Hull. Improving text retrieval for the routing
problem using latent semantic indexing. In Proceedings
of SIGIR-94, 17th ACM International Conference on
Research and Development in Information Retrieval,
pages 282–289. Springer Verlag, 1994.
[10] D. Ittner, D. Lewis, and D. Ahn. Text categorization
of low quality images. In Proceedings of SDAIR-95,
4th Annual Symposium on Document Analysis and
Information Retrieval, pages 301–315, 1995.
[11] T. Joachims. A probabilistic analysis of the Rocchio
algorithm with TFIDF for text categorization. In
Proceedings of ICML-97, 14th International
Conference on Machine Learning, pages 143–151.
Morgan Kaufmann Publishers, 1997.
[12] T. Joachims. Text categorization with support vector
machines: learning with many relevant features. In
Proceedings of ECML-98, 10th European Conference
on Machine Learning, pages 137–142. Springer Verlag,
1998.
[13] V. Lertnattee and T. Theeramunkong. Effect of term
distributions on centroid-based text categorization.
Information Sciences, 158(1):89–115, 2004.
[14] J. MacQueen. Some methods for classification and
analysis of multivariate observations. In 5th Berkeley
Symposium on Mathematical Statistics and
Probability, pages 281–297, 1967.
[15] A. McCallum and K. Nigam. Employing EM in
pool-based active learning for text classification. In
Proceedings of ICML-98, 15th International
Conference on Machine Learning, pages 350–358.
Morgan Kaufmann Publishers, 1998.
[16] D. Miller and H. Uyar. A mixture of experts classifier
with learning based on both labelled and unlabelled
data. In Advances in Neural Information Processing
Systems, volume 9, pages 571–577. MIT Press, 1997.
[17] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell.
Text classification from labeled and unlabeled
documents using EM. Machine Learning,
39(2/3):103–134, 2000.
[18] G. Salton. Automatic Text Processing: The
Transformation Analysis and Retrieval of Information
by Computer. Addison-Wesley, 1989.
[19] H. Schütze, D. Hull, and J. Pedersen. A comparison of
classifiers and document representations for the
routing problem. In Proceedings of SIGIR-95, 18th
ACM International Conference on Research and
Development in Information Retrieval, pages 229–237,
1995.
[20] F. Sebastiani. Machine learning in automated text
categorization. ACM Computing Surveys, 34(1):1–47,
2002.
[21] B. Shahshahani and D. Landgrebe. The effect of
unlabeled samples in reducing the small sample size
problem and mitigating the Hughes Phenomenon.
IEEE Transactions on Geoscience and Remote
Sensing, 32(5):1087–1095, 1994.
[22] S. Shankar and G. Karypis. Weight adjustment
schemes for a centroid based classifier, 2000.
Computer Science Technical Report TR00-035,
Department of Computer Science, University of
Minnesota, Minneapolis, Minnesota.
[23] E. Wiener, J. Pedersen, and A. Weigend. A neural
network approach to topic spotting. In Proceedings of
SDAIR-95, 4th Annual Symposium on Document
Analysis and Information Retrieval, pages 317–332,
1995.