Task 3
Task 3
Task 3
Task 3
1 Introduction
Since the advent of digital documents, automatic text categorization has been a significant application and
area of study. Given that we often need to examine a vast quantity of text material, text categorization is
crucial. Content summarization and genre-based text categorization are only two examples of the many
specialized areas that fall under the umbrella of text classification. Texts may be organized into distinct
collections using a process called thematic text classification. Scientific papers, news stories, movie reviews,
and advertising are just a few of the many different types of written content that may be produced. A text's
genre may be inferred from its style of composition, level of editing, choice of language, and intended
audience. Previous research has shown that classifying genres independently of classifying topics is possible.
Most of the information used to classify media into genres comes from online sources such as message boards,
weblogs, magazines, and television programs. Because of the diversity of voices involved, even within a single
genre, there may be a vast variety in presentation and word choice. To put it bluntly, information is dispersed.
Intuitively An article is classified when it is assigned to one of many predetermined groups. This study only
considers Hard Categorization research if D is the whole collection of documents for some scheme c, c,..., c.
(where each document is assigned to a single category). In addition, no methods are offered that account for
factors beyond the text itself, such as the hierarchy of the text or when it was published. The meat of this
research is a methodology for extracting value from papers only on the basis of their contents. Sebastiani does a
great job of providing a concise overview of the area of text classification. Therefore, we provide a concise
analysis of the text categorization and point to a few more publications and more current investigations that
were not included in Sebastiani's work. Figure 1 provides a graphical depiction of the text categorization
For every supervised machine learning assignment, the first step is to collect data. The Project is funded in equal parts by the
European Social Fund and the Public Resources, highlighting a peculiarity of the text summarization challenge: the possibility,
without much effort, of forming hundreds or thousands of features (unique phrases or phrases). This is why there are strategies for
dealing with problems of high dimensionality. One approach would be to choose a subset of the characteristics, while another
would be to compute new features as functions of the existing ones. Sections 3 and 4 of this research include in-depth analyses of
each subcomponent individually. When finished, a Machine Learning technique may be used. Algorithms like Support Vector
Machines are being adopted because they have been shown to be better in Natural Language processing jobs. In the fifth part, we
take a high-level look at how learning processes have been developed and used to Text Summarization in recent years.
Effectiveness of learning algorithms for Classification Task may be measured in a variety of ways. Section 6 covers all but one of
these approaches. A brief look at some of the questions that remain is provided in the last paragraph..
3 Feature Selection
By excluding characteristics that are deemed unnecessary for the classification, feature-selection algorithms
attempt to reduce the dimensionality [6] of a given dataset. Text classification algorithms (particularly those
that don't scale well with the size of the feature set) benefit from a smaller search space and less data storage
space when using this method. The goal of this effort is to minimize the number of dimensions in the data in
order to enhance the accuracy of categorization. As an added bonus, feature selection often reduces training
error, the phenomenon in which a classification is fine-tuned not only to the physical features of the groups but
also to the conditional properties of the learning method, leading to improved generalization. Methods for
selecting feature subsets in the context of text classification issues often make use of single-word evaluation
functions. Document frequency, phrase occurrence, mutual information, gini index, coefficient of
determination, 2 statistic, and phrase strength are only some of the metrics that may be used to evaluate and
rank individual words (Best Individual Features). Each of these approaches to rating features ultimately results
in a ranking of those features based on the ratings given to them individually, with the most popular features
being chosen being those that have the average rating. Its most common dimensions are shown in Table 1.
Table 2 gives explanations for the data shown in Table 1. Sequential forward variety (SFS) strategies, in
contrast to Best Adult Features (BIF) projects, identify the best individual portions and then build on them
incrementally until the number of rehearsed lines reaches the target k. SFS processes, in contrast to BIF
methods, analyze relationships between words and may not necessarily provide the best subset of words. In this
case, SFS is preferable than BIF. Participants are seldom used in text classification because of the heavy
computational overhead associated with SFS's enormous vocabulary size. As an example, Forman evaluates 12
distinct measures using widely-used datasets in an attempt to establish a standard. Forman found the greatest
performance from BNS with between 500 and 1000 features, whereas the best performance from Information
Gain was shown with just 20-50 features. Performance-wise, there was zero difference between Accuracy 2 and
Term Frequency. Compared to Information Gain, chi-square repeatedly underperformed. Since no one statistic
consistently outperforms the others, researchers sometimes combine two metrics to boost their effectiveness.
The SFS used by Novovicova et al. considered not only the degree of similarity between a class and a word, but
also the degree of mutual information between a class and two words. The outcomes improved somewhat. Even
though machine learning-based text classification performs well, it is inefficient due to the large size of the
training corpus. This means that picking the right features isn't enough; picking the right instances is often
required as well.
P t, c
P t, clog
Gain Ratio
cci ,ci ttk ,tk P t P c
GR t , c
k i
P c log P c
cci ,ci
CMI C | S H C H C | S1, S2 ,..., Sn
Document DF tk P tk
Term tf f ,d
i j
max freq
Inversed D
idf log
Frequency i
# f
2 f ,c
D # c , f # c , f # c , f # c
i j
i j i j i j i j
# c , f # c , f # c , f # c , f # c ,
i j i j i j i j i
f j # c , f # c ,
i j i
Term s t P t y | t x
P fi | c j 1 P f | c
i j
OddsRatio fi , c j log
1 P f | c P f | c
i j ij
Logarithmic P w | c
LogProbRatio w log
P w | c
Pointwise P x, y
I x, y log
P x P y
Category # f i /# c j
Factor CRF fi , c j log
# f , c /# c
i j j
OddsNum w, c P w | c 1 P w | c
Probability P w | c
Pr R w | c
P w | c
F 1 P w | c F 1 P w | c
k k
Pow 1 P w | c 1 P w | c
Relevance DFn w, c db
M DFn w, c log log
P c 1 P c | w
i i
4 Feature Transformation
While reducing the number of features in a dataset is a goal shared by Segmentation And feature Transformation,
these two techniques take quite different approaches to getting the job done. Instead of relying on word weights to
filter out unimportant terms, this technique streamlines the vocab on the grounds of feature concurrencies.
Principal component analysis is widely acknowledged as one of the most powerful methods for transforming
features. The objective is to find a way to translate the high-dimensional classification model into a reduce feature
space without compromising classification accuracy by developing a discriminative transformation matrix. The
foundation for the transformation is found in the corresponding eigenvectors. With principal component analysis,
the covariance matrix of the data is just the average of the homogeneous transformation matrix and its
transposition. The rows of the covariance matrix indicate the occurrences of words together in the texts. For this
reason, we may use the terms "themes" or "semantic concepts" to describe the unidirectional components of the
most significant eigenvalues of the matrix. A transform matrix constructed from these eigenvectors projects onto
the "latent semantic concepts" that the text was meant to communicate, and the magnitudes of these projections
constitute the new low-dimensional representation. A sparse variant of principal component analysis of the
manuscript matrix may be utilised to do the eigenanalysis more quickly. Latent Semantic Indexing (LSI) is a name
created by the information retrieval field to define this approach. This tactic works effectively despite being
invisible to the human brain. Qiang et al. performed experiments using k- NN LSI, an innovative combination of
the conventional k-NN technique on top of LSI, and Semi-Discrete Matrix Decomposition, a revolutionary matrix
decomposition methodology. The results of the experiments showed that text categorization worked better and
used less processing power in this setting. In this article, the authors examine the performance of several text
categorization methods on two separate data sets and offer thorough commentary on the findings. Their focus is on
classifiers based on Support Vector Machines (SVM), such as Vector and LSI, and on k-Nearest Neighbor
variations of these techniques. Their results reveal that SVMs like k-NN LSI have superior statistical performance
compared to the other methods...
Due to its simplicity and effectiveness, Naive Bayes is commonly used in text classification applications and
experiments. However, its efficacy is often hindered by its limited text modelling skills. Schneider explained
the problems and showed how to solve them easily. The empirical evaluation results of a Bernoulli multinet
classifier created utilising a unique technique to learning enormously big tree-like Proposed approaches were
reported to Klopotek and Woch. The findings of the study show that tree-like Probabilistic models are
incredibly precise enough to cope with a recognition task containing 100,000 variables. Back propagation
machines (SVM) are quite accurate but do have a poor recall rate when employed for text classification. One
technique to optimise a machine learning approach for higher recall is to adjust the machine's threshold.
Shanahan and Roma provided an automated method of adjusting thresholds to improve the efficiency of generic
SVM. Lim produced a strategy that boosts the performance of multilayer perceptron classification task by
makes use of well-estimated features, whereas Johnson et al. released a quick decision tree construction
approach that takes advantage of the sparse of text data. In order to find the optimal conditional probability, k
value, overall feature set, many variants of the kNN method were constructed and compared. A neural network
- based network called a corner characterization (CC) system might be utilised for fast document
categorization. We provide TextCC, a novel training approach. The difficulty of text classification tasks may
vary naturally. The amount of both the training set needed for a given wide range of courses increases
exponentially with the task's difficulty. When tasked with classifying a text into many categories, some of them
will be more difficult to categorise than others. Causes might include (1) a lack of suitable training data, or (2) a
lack of strong predictive qualities for the relevant class. The training of a base classification per category is
possible in text categorization by considering all papers in the training sample that belong to such a type as skill
enhancement data and all articles belonging to all other categories as – anti training data. When there is a large
set of categories but only a small number of documents allotted to each category, a "imbalanced data problem"
occurs, causing an overwhelming quantity of irrelevant training documents. Even more so, this presents a
challenge for classifiers, which may achieve high precision by simply labelling all cases as negative. To address
this problem, we need to learn more efficiently at lower costs. This article explores the sustainability of many
different text classifiers. Vinciarelli presents the results of his text categorization experiments done on noisy
text. Unclean texts are those that have been erroneously obtained from places else than written text (e.g.
transcriptions of speech recordings extracted with a recognition system). The performance of the classification
system is measured using both clean and chaotic (Word Error Rate of 10% and 50%) versions of the same texts.
The garbled texts are the result of a hybrid approach using Handwriting Recognition and fake Optical Character
Recognition. Here, the results show that a little drop in performance is acceptable. Other authors have argued
that text classification should be parallelized and decentralised as well. This strategy has the potential to
improve classifiers' accuracy and productivity. Recently, in the area of Machine Learning, the concept of
merging learners has been proposed as a novel technique to improve the efficiency of individual classifiers.
Several strategies have been presented for creating a group of classifiers to work together. The following
techniques are used to generate ensembles of classifiers: I employing several learning approaches to the same
data; ii) using different training parameters for the same training approach (for instance, using under ci, or what
would be judged the proper category) It's a metric for evaluating the precision with which a classifier can place
a document in a category, as contrast to the uniformity with which all documents in a given category are placed
i TPi FPi
For each given document dx, recall I is the likelihood that it will be assigned to category (ci) if that is the best
i TPi FNi
The effectiveness of a classification method is often evaluated according to how well it performs. Contrarily
to precision and recall, accuracy levels are significantly more sensitive to changes in the proportion of right
judgments made.
Each network during an ensemble has its own unique set of weights to start with; iii) The ensemble uses a variety
of learning techniques.
Combining several classifiers for character recognition has been found to increase classification accuracy by a
number of studies [1, 29].
When comparing the combined technique's performance to that of the best individual classifier, the combined
method emerges victorious [2]. As an alternative to manual text classification, the "boosting"-based learners
described by Nardiello et al. [21] have shown promising results in automated text classification tasks.
6 Evaluation
There are various methods to determine
When organising texts, it's not unusual to come across a few instances that qualify as fascinating. Since
information retrieval problems tend to cluster in the negative class, the accuracy score is typically misleading.
For skewed datasets, accuracy is not a helpful indicator, so recall and precision are used rather to judge the
performance of classification systems. It is also common practise to combine correctness and recall to give a
fuller picture of the classifier's performance. For this reason, the following formula may be used to merge
precision, recall, and accuracy are more often employed, but "effectiveness" is also a valid synonym. So as to find
these,The first step in analysing a document's categorization is determining if that was a correct positive (TP), true
positive (FP), a negative result (TN), or a negative test (FN) (see Table 3).
A document's accuracy I is measured by its likelihood of being correctly labelled given a sample of documents.
around where P y R represent the ability to plan ahead and remember information, respectively. The evaluation's
goals may be satisfied with the help of a positive parameter, represented by. When precision is given more weight
than recall, its value drops close to zero. On the other hand, converges to infinity if remembering is so much more
important than anticipating. Setting this to 1 means that each presentation and recall is given the same importance.
To facilitate research, Reuters, Ltd. has made available a corpus of more than 800,000 manually annotated
newswire articles (referred to as Reuters Corpus Volume I, or RCV1). Research methods may be compared using
the information provided here. Despite years of studies suggesting that training corpus could influence
categorization performance, the underlying causes were not well investigated. The authors of this work seek to
present a method to keep building training corpuses for enhanced classification performance by studying the
characteristics of teaching corpora and proposing and algorithm for their moderately development..
7 Conclusion
Since there is a plethora of textual information available online as part of websites, emails, forum postings, and
certain other digital files, the problem of classifying this information has become an area of research in the science
of Artificial Intelligence.
It has been shown that even given a stated classification approach, the outputs of classifiers based on different
training text corpora might vary, and in certain cases, these disparities can be rather considerable. Given these
results, it seems reasonable to assume that (a) the quality of the training corpus has an effect on performance of
classifier it yields, and (b) better-quality training corpora may provide better-performing classifiers. Using text
corpora for training has not been explored much in the literature as a means of improving classifier accuracy.
The topic of "which feature selection algorithms are both physically efficient and high-performing across
classes and collections" remains unanswered. Is there any method that can successfully cope with the many text
collections available today? Is it possible to boost efficiency by combining methods that each show promise when
used alone? Consider replacing your current mental model of individual words in a subspaces with one of concepts
instead. Determine whether the approach of concept-level feature selection may enhance text categorization via
investigation. Boost the efficiency of dimensionality reduction for massive datasets. Additional problems in text
mining include the existence of polysemy and synonymy. Polysemy refers to a word's capacity to be understood in
several ways. Learning the context that a word is used is essential for the process of word sense disambiguation. In
this context, "synonymy" means that two sounds may mean the same thing.
[1] Bao Y. and Ishii N., “Combining Multiple kNN Classifiers for Text Categorization by Reducts”, LNCS
2534, 2002, pp. 340-347
[2] Bi Y., Bell D., Wang H., Guo G., Greer K., ”Combining Multiple Classifiers Using Dempster's Rule of
Combination for Text Categorization”, MDAI, 2004, 127-138.
[3] Brank J., Grobelnik M., Milic-Frayling N., Mladenic D., “Interaction of Feature Selection Methods and
Linear Classification Models”, Proc. of the 19th International Conference on Machine Learning, Australia,
[4] Ana Cardoso-Cachopo, Arlindo L. Oliveira, An Empirical Comparison of Text Categorization Methods,
Lecture Notes in Computer Science,
Volume 2857, Jan 2003, Pages 183 - 196
[5] Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P., “SMOTE: Synthetic Minority Over-
sampling Technique,” Journal of AI Research, 16 2002, pp. 321-357.
[6] Forman, G., An Experimental Study of Feature Selection Metrics for Text Categorization. Journal of
Machine Learning Research, 3 2003, pp. 1289-1305
[7] Fragoudis D., Meretakis D., Likothanassis S., “Integrating Feature and Instance Selection for Text
Classification”, SIGKDD ’02, July 23-26, 2002, Edmonton, Alberta, Canada.
[8] Guan J., Zhou S., “Pruning Training Corpus to Speedup Text Classification”, DEXA 2002, pp. 831-840
[9] D. E. Johnson, F. J. Oles, T. Zhang, T. Goetz, “A decision-tree-based symbolic rule induction system for
text categorization”, IBM Systems Journal, September 2002.
[10] Han X., Zu G., Ohyama W., Wakabayashi T., Kimura F., Accuracy Improvement of Automatic Text
Classification Based on Feature Transformation and Multi-classifier Combination, LNCS, Volume 3309,
Jan 2004, pp. 463-468
[11] Ke H., Shaoping M., “Text categorization based on Concept indexing and principal component
analysis”, Proc. TENCON 2002 Conference on Computers, Communications, Control and Power
Engineering, 2002, pp. 51- 56.
[12] Kehagias A., Petridis V., Kaburlasos V., Fragkou P., “A Comparison of Word- and Sense-Based Text
Categorization Using Several Classification Algorithms”, JIIS, Volume 21, Issue 3, 2003, pp. 227-247.
[13] B. Kessler, G. Nunberg, and H. Schutze.