Task 3

Assignment 1
Task 3
Assignment 1 task 3 :Text Classiﬁcation Using Machine Learning Techniques

Abstract:
In order to manage and handle the ever-growing volume of digital documents, an increasing number of
individuals have come to see automated text categorization as essential. Information extraction, summarization,
text retrieval, and question answering all rely heavily on accurate text categorization. This study demonstrates
the feasibility of using automation for precise text categorization. The books listed provide solutions to the
researcher's most urgent theoretical issues and suggestions for future lines of inquiry. You should be
comfortable with concepts like text mining, learning algorithms, feature selection, and text representation...
1 Introduction
Since the advent of digital documents, automatic text categorization has been a significant application and
area of study. Given that we often need to examine a vast quantity of text material, text categorization is
crucial. Content summarization and genre-based text categorization are only two examples of the many
specialized areas that fall under the umbrella of text classification. Texts may be organized into distinct
collections using a process called thematic text classification. Scientific papers, news stories, movie reviews,
and advertising are just a few of the many different types of written content that may be produced. A text's
genre may be inferred from its style of composition, level of editing, choice of language, and intended
audience. Previous research has shown that classifying genres independently of classifying topics is possible.
Most of the information used to classify media into genres comes from online sources such as message boards,
weblogs, magazines, and television programs. Because of the diversity of voices involved, even within a single
genre, there may be a vast variety in presentation and word choice. To put it bluntly, information is dispersed.
Intuitively An article is classified when it is assigned to one of many predetermined groups. This study only
considers Hard Categorization research if D is the whole collection of documents for some scheme c, c,..., c.
(where each document is assigned to a single category). In addition, no methods are offered that account for
factors beyond the text itself, such as the hierarchy of the text or when it was published. The meat of this
research is a methodology for extracting value from papers only on the basis of their contents. Sebastiani does a
great job of providing a concise overview of the area of text classification. Therefore, we provide a concise
analysis of the text categorization and point to a few more publications and more current investigations that
were not included in Sebastiani's work. Figure 1 provides a graphical depiction of the text categorization
procedure...
Read Tokenize Text Stemming

Document
Vector Representation of Delete

Text Stopwords
Feature Selection and/or Feature Learning

Transformation algorithm
.
Fig. 1. Method for Categorizing Text
text categorization places documents into one of the sets of categories.
For every supervised machine learning assignment, the first step is to collect data. The Project is funded in equal parts by the
European Social Fund and the Public Resources, highlighting a peculiarity of the text summarization challenge: the possibility,
without much effort, of forming hundreds or thousands of features (unique phrases or phrases). This is why there are strategies for
dealing with problems of high dimensionality. One approach would be to choose a subset of the characteristics, while another
would be to compute new features as functions of the existing ones. Sections 3 and 4 of this research include in-depth analyses of
each subcomponent individually. When finished, a Machine Learning technique may be used. Algorithms like Support Vector
Machines are being adopted because they have been shown to be better in Natural Language processing jobs. In the fifth part, we
take a high-level look at how learning processes have been developed and used to Text Summarization in recent years.
Effectiveness of learning algorithms for Classification Task may be measured in a variety of ways. Section 6 covers all but one of
these approaches. A brief look at some of the questions that remain is provided in the last paragraph..
2 Vector space document representations

Words strung together together up a document. That's why a collection of words may stand in for an essay. The
words or phrases that make up a dataset used for training are known as the vocabulary, or feature set. A binary
vector may be used to represent each document, with 1 representing the presence of the feature-word on the page
and 0 representing its absence. Putting this into practice, you would save a document in Certain have cast doubt on
the utility of some highly aggressive stemming approaches like Porter's, but most professionals in the area of Text
Classification agree that stemming may improve classifier performance. Compared to other design considerations,
the representation of the feature's worth is rather unimportant. Boolean AND/OR operator indicating the term's
existence or absence in the text is generally enough. The ratio of the length of the document to the number of times
the phrase occurs may be used, or it could just be a basic count. When there is a wide variation in document length,
it may be useful to normalize the data. Also, since shorter works have a lower likelihood of repetition, Boolean
word indicators are almost as valuable as counts. This aids both the search capabilities and the learning possibilities
of the induction application. Alternately, one might try to find the best segment size and attribute criteria. As a
general rule, text classification algorithms in the literature treat texts like keywords. Indirect modification is a
strategy that has not been investigated to its full potential (also known as meanings). To compare the efficacy of
sense-based classifiers with those based on words, Kehagias et al. utilized a variety of techniques. Our study only
looked at a selection of texts similar to those in the tagged Brown Corpus to ensure semantic consistency. Extensive
research has shown that increasing the number of senses involved does not significantly enhance classification
accuracy..
RV parking, with V denoting vehicle length.
language use Vocabulary V.

All of the words in a book cannot be used to train the classifier [19]. Useless, useless, auxiliary, and article are all
examples of words that fit within this category. This class of words is sometimes referred to as "stopwords." There
is a standard procedure for filtering out duplicate words from many databases. This use stems from the phrases'
frequent occurrences in historical documents.
Another common preprocessing technique is removing stems from words. We can quickly reduce the size of the
first feature set by filtering out words with common spelling errors or stems. Words that share a stem are removed
by a stemmer (the algorithm responsible for stemming) and the stalk or most frequent of the remaining words is
kept as a feature. One word, "train," may mean "trainer," "training," and "trains."
3 Feature Selection
By excluding characteristics that are deemed unnecessary for the classification, feature-selection algorithms
attempt to reduce the dimensionality [6] of a given dataset. Text classification algorithms (particularly those
that don't scale well with the size of the feature set) benefit from a smaller search space and less data storage
space when using this method. The goal of this effort is to minimize the number of dimensions in the data in
order to enhance the accuracy of categorization. As an added bonus, feature selection often reduces training
error, the phenomenon in which a classification is fine-tuned not only to the physical features of the groups but
also to the conditional properties of the learning method, leading to improved generalization. Methods for
selecting feature subsets in the context of text classification issues often make use of single-word evaluation
functions. Document frequency, phrase occurrence, mutual information, gini index, coefficient of
determination, 2 statistic, and phrase strength are only some of the metrics that may be used to evaluate and
rank individual words (Best Individual Features). Each of these approaches to rating features ultimately results
in a ranking of those features based on the ratings given to them individually, with the most popular features
being chosen being those that have the average rating. Its most common dimensions are shown in Table 1.
Table 2 gives explanations for the data shown in Table 1. Sequential forward variety (SFS) strategies, in
contrast to Best Adult Features (BIF) projects, identify the best individual portions and then build on them
incrementally until the number of rehearsed lines reaches the target k. SFS processes, in contrast to BIF
methods, analyze relationships between words and may not necessarily provide the best subset of words. In this
case, SFS is preferable than BIF. Participants are seldom used in text classification because of the heavy
computational overhead associated with SFS's enormous vocabulary size. As an example, Forman evaluates 12
distinct measures using widely-used datasets in an attempt to establish a standard. Forman found the greatest
performance from BNS with between 500 and 1000 features, whereas the best performance from Information
Gain was shown with just 20-50 features. Performance-wise, there was zero difference between Accuracy 2 and
Term Frequency. Compared to Information Gain, chi-square repeatedly underperformed. Since no one statistic
consistently outperforms the others, researchers sometimes combine two metrics to boost their effectiveness.
The SFS used by Novovicova et al. considered not only the degree of similarity between a class and a word, but
also the degree of mutual information between a class and two words. The outcomes improved somewhat. Even
though machine learning-based text classification performs well, it is inefficient due to the large size of the
training corpus. This means that picking the right features isn't enough; picking the right instances is often
required as well.
Metrics mathematical forms

m m m
Information
Gain IG t    P ci log P ci   P t   P ci | t log P ci | t   P  t
  P ci | t log P ci | t 

i1 i1 i1
P t, c
  P t, clog
Gain Ratio
cci ,ci  ttk ,tk  P t  P  c 
GR t , c 
k i
  P  c  log P  c 
cci ,ci 
Conditional
CMI C | S   H C   H C | S1, S2 ,..., Sn 
Mutual
Information
Document DF tk   P tk 
Frequency
freqij
Term tf  f ,d  
Frequency
i j
max freq
kj
k
Inversed D
idf  log
Document
Frequency i
# f 
i
2 f ,c
D  # c , f   # c , f   # c , f   # c
Chi-square

2
 ,f
i j
i j i j i j i j
 #  c , f   #  c , f    #  c , f   #  c , f    #  c ,
i j i j i j i j i
f j  #  c , f    #  c ,
i j i
Term s t   P t  y | t  x
Metrics mathematical forms

Strength
Weighted WOddsRatio  w   P  w   OddsRatio  w 
Ratio
OddsRatio
P  fi | c j  1 P  f | c 
i j
OddsRatio  fi , c j   log
1 P  f | c P f | c
i j    ij
Logarithmic P  w | c
LogProbRatio  w   log
Probability
Ratio
P  w | c
Pointwise P  x, y 
I  x, y   log
Mutual
Information
P  x P  y 
Category #  f i  /#  c j 
Relevance
Factor CRF  fi , c j   log
(CRF)
#  f , c  /#  c 
i j j
Odds
OddsNum w, c   P  w | c   1 P  w | c
Numerator
Probability P  w | c
Pr R  w | c 
Ratio
P  w | c
Bi-Normal
F 1  P  w | c    F 1  P  w | c
Separation
k k
Pow 1 P  w | c    1 P  w | c  
Topic
Relevance DFn  w, c  db
using
M DFn  w, c   log  log
Relative DFn w, db c

Word
Position
Topic
Relevance DF  w, c  db
M DFn  w, c   log  log
using
Document DF w, db c
Frequenc
y
Topic
Relevance
~
using
DF  w, c  db
Modified M DFn  w, c   log  log
Documen
~
t c
Frequency
DF w, db
Topic
TF  w, c  db
Relevance M TFn  w, c   log  log
using
TF w, db c
Term
Frequency
Weight of
P  ci | w 1
evidence
for Text P  ci   Weight  w  i P  ci   P  w 
log
P  c  1 P  c | w 
i i
Table 1. Feature Selection metrics

To speed things up, Guan and Wu proposed a training-corpus reduction based approach. Their results show that
using this method significantly reduces the size of the trained corpus while still achieving the same level of
classifier performance as when no training documents were omitted.
Fragoudis et al. successfully combined feature selection with example selection to improve the performance of
text summarization. There are two stages to their procedure. Their starting point is a careful selection of
characteristics that have a history of success in defining the target class. Publications lacking even a single such
property have a negative impact on the learning process as a whole. Part two of their process involves picking
characteristics from the first subset of the dataset that have a track record of wrongly predicting the opposite class
from the target class. Using the documents chosen in the first step as a test dataset, a new feature set is created by
combining the features picked on the first and second world wars..
4 Feature Transformation
While reducing the number of features in a dataset is a goal shared by Segmentation And feature Transformation,
these two techniques take quite different approaches to getting the job done. Instead of relying on word weights to
filter out unimportant terms, this technique streamlines the vocab on the grounds of feature concurrencies.
Principal component analysis is widely acknowledged as one of the most powerful methods for transforming
features. The objective is to find a way to translate the high-dimensional classification model into a reduce feature
space without compromising classification accuracy by developing a discriminative transformation matrix. The
foundation for the transformation is found in the corresponding eigenvectors. With principal component analysis,
the covariance matrix of the data is just the average of the homogeneous transformation matrix and its
transposition. The rows of the covariance matrix indicate the occurrences of words together in the texts. For this
reason, we may use the terms "themes" or "semantic concepts" to describe the unidirectional components of the
most significant eigenvalues of the matrix. A transform matrix constructed from these eigenvectors projects onto
the "latent semantic concepts" that the text was meant to communicate, and the magnitudes of these projections
constitute the new low-dimensional representation. A sparse variant of principal component analysis of the
manuscript matrix may be utilised to do the eigenanalysis more quickly. Latent Semantic Indexing (LSI) is a name
created by the information retrieval field to define this approach. This tactic works effectively despite being
invisible to the human brain. Qiang et al. performed experiments using k- NN LSI, an innovative combination of
the conventional k-NN technique on top of LSI, and Semi-Discrete Matrix Decomposition, a revolutionary matrix
decomposition methodology. The results of the experiments showed that text categorization worked better and
used less processing power in this setting. In this article, the authors examine the performance of several text
categorization methods on two separate data sets and offer thorough commentary on the findings. Their focus is on
classifiers based on Support Vector Machines (SVM), such as Vector and LSI, and on k-Nearest Neighbor
variations of these techniques. Their results reveal that SVMs like k-NN LSI have superior statistical performance
compared to the other methods...
5 Machine learning algorithms

After features have been selected and transformed, expressing the document in a form that can be used by an
Optimization technique is a breeze. Various text classifiers based on machine learning, probabilistic models,
etc., have been proposed in the past. Several approaches exist that adopt a somewhat different approach, such as
decision trees, apse, rule induction, neural, nearest neighbours, and, more recently, the support vector machines.
Because current automatic classifiers are not faultless there is still opportunity for advancement, automated text
classification remains an important field of research.
Due to its simplicity and effectiveness, Naive Bayes is commonly used in text classification applications and
experiments. However, its efficacy is often hindered by its limited text modelling skills. Schneider explained
the problems and showed how to solve them easily. The empirical evaluation results of a Bernoulli multinet
classifier created utilising a unique technique to learning enormously big tree-like Proposed approaches were
reported to Klopotek and Woch. The findings of the study show that tree-like Probabilistic models are
incredibly precise enough to cope with a recognition task containing 100,000 variables. Back propagation
machines (SVM) are quite accurate but do have a poor recall rate when employed for text classification. One
technique to optimise a machine learning approach for higher recall is to adjust the machine's threshold.
Shanahan and Roma provided an automated method of adjusting thresholds to improve the efficiency of generic
SVM. Lim produced a strategy that boosts the performance of multilayer perceptron classification task by
makes use of well-estimated features, whereas Johnson et al. released a quick decision tree construction
approach that takes advantage of the sparse of text data. In order to find the optimal conditional probability, k
value, overall feature set, many variants of the kNN method were constructed and compared. A neural network
- based network called a corner characterization (CC) system might be utilised for fast document
categorization. We provide TextCC, a novel training approach. The difficulty of text classification tasks may
vary naturally. The amount of both the training set needed for a given wide range of courses increases
exponentially with the task's difficulty. When tasked with classifying a text into many categories, some of them
will be more difficult to categorise than others. Causes might include (1) a lack of suitable training data, or (2) a
lack of strong predictive qualities for the relevant class. The training of a base classification per category is
possible in text categorization by considering all papers in the training sample that belong to such a type as skill
enhancement data and all articles belonging to all other categories as – anti training data. When there is a large
set of categories but only a small number of documents allotted to each category, a "imbalanced data problem"
occurs, causing an overwhelming quantity of irrelevant training documents. Even more so, this presents a
challenge for classifiers, which may achieve high precision by simply labelling all cases as negative. To address
this problem, we need to learn more efficiently at lower costs. This article explores the sustainability of many
different text classifiers. Vinciarelli presents the results of his text categorization experiments done on noisy
text. Unclean texts are those that have been erroneously obtained from places else than written text (e.g.
transcriptions of speech recordings extracted with a recognition system). The performance of the classification
system is measured using both clean and chaotic (Word Error Rate of 10% and 50%) versions of the same texts.
The garbled texts are the result of a hybrid approach using Handwriting Recognition and fake Optical Character
Recognition. Here, the results show that a little drop in performance is acceptable. Other authors have argued
that text classification should be parallelized and decentralised as well. This strategy has the potential to
improve classifiers' accuracy and productivity. Recently, in the area of Machine Learning, the concept of
merging learners has been proposed as a novel technique to improve the efficiency of individual classifiers.
Several strategies have been presented for creating a group of classifiers to work together. The following
techniques are used to generate ensembles of classifiers: I employing several learning approaches to the same
data; ii) using different training parameters for the same training approach (for instance, using under ci, or what
would be judged the proper category) It's a metric for evaluating the precision with which a classifier can place
a document in a category, as contrast to the uniformity with which all documents in a given category are placed
there.:
TPi
 
i TPi  FPi
For each given document dx, recall I is the likelihood that it will be assigned to category (ci) if that is the best
fit.
TPi
 
i TPi  FNi
The effectiveness of a classification method is often evaluated according to how well it performs. Contrarily
to precision and recall, accuracy levels are significantly more sensitive to changes in the proportion of right
judgments made.
Each network during an ensemble has its own unique set of weights to start with; iii) The ensemble uses a variety
of learning techniques.
Ai 
TPi TNi
TPi TNi FPi FNi
Combining several classifiers for character recognition has been found to increase classification accuracy by a
number of studies [1, 29].
When comparing the combined technique's performance to that of the best individual classifier, the combined
method emerges victorious [2]. As an alternative to manual text classification, the "boosting"-based learners
described by Nardiello et al. [21] have shown promising results in automated text classification tasks.
6 Evaluation
There are various methods to determine
When organising texts, it's not unusual to come across a few instances that qualify as fascinating. Since
information retrieval problems tend to cluster in the negative class, the accuracy score is typically misleading.
For skewed datasets, accuracy is not a helpful indicator, so recall and precision are used rather to judge the
performance of classification systems. It is also common practise to combine correctness and recall to give a
fuller picture of the classifier's performance. For this reason, the following formula may be used to merge
them.:

precision, recall, and accuracy are more often employed, but "effectiveness" is also a valid synonym. So as to find
these,The first step in analysing a document's categorization is determining if that was a correct positive (TP), true
positive (FP), a negative result (TN), or a negative test (FN) (see Table 3).
TP determined as a document that belongs in a

certain category.
FP identified as a piece of writing that has been
mistakenly attributed to a certain category.
FN found to be a file that should be filed under a
certain heading but isn't yet tagged as such.
TN Not-categorized documents that really need to
be there.
Table 3. Classification of a document
A document's accuracy I is measured by its likelihood of being correctly labelled given a sample of documents.
around where P y R represent the ability to plan ahead and remember information, respectively. The evaluation's
goals may be satisfied with the help of a positive parameter, represented by. When precision is given more weight
than recall, its value drops close to zero. On the other hand, converges to infinity if remembering is so much more
important than anticipating. Setting this to 1 means that each presentation and recall is given the same importance.
To facilitate research, Reuters, Ltd. has made available a corpus of more than 800,000 manually annotated
newswire articles (referred to as Reuters Corpus Volume I, or RCV1). Research methods may be compared using
the information provided here. Despite years of studies suggesting that training corpus could influence
categorization performance, the underlying causes were not well investigated. The authors of this work seek to
present a method to keep building training corpuses for enhanced classification performance by studying the
characteristics of teaching corpora and proposing and algorithm for their moderately development..
7 Conclusion
Since there is a plethora of textual information available online as part of websites, emails, forum postings, and
certain other digital files, the problem of classifying this information has become an area of research in the science
of Artificial Intelligence.
It has been shown that even given a stated classification approach, the outputs of classifiers based on different
training text corpora might vary, and in certain cases, these disparities can be rather considerable. Given these
results, it seems reasonable to assume that (a) the quality of the training corpus has an effect on performance of
classifier it yields, and (b) better-quality training corpora may provide better-performing classifiers. Using text
corpora for training has not been explored much in the literature as a means of improving classifier accuracy.
The topic of "which feature selection algorithms are both physically efficient and high-performing across
classes and collections" remains unanswered. Is there any method that can successfully cope with the many text
collections available today? Is it possible to boost efficiency by combining methods that each show promise when
used alone? Consider replacing your current mental model of individual words in a subspaces with one of concepts
instead. Determine whether the approach of concept-level feature selection may enhance text categorization via
investigation. Boost the efficiency of dimensionality reduction for massive datasets. Additional problems in text
mining include the existence of polysemy and synonymy. Polysemy refers to a word's capacity to be understood in
several ways. Learning the context that a word is used is essential for the process of word sense disambiguation. In
this context, "synonymy" means that two sounds may mean the same thing.
References:
[1] Bao Y. and Ishii N., “Combining Multiple kNN Classifiers for Text Categorization by Reducts”, LNCS
2534, 2002, pp. 340-347
[2] Bi Y., Bell D., Wang H., Guo G., Greer K., ”Combining Multiple Classifiers Using Dempster's Rule of
Combination for Text Categorization”, MDAI, 2004, 127-138.
[3] Brank J., Grobelnik M., Milic-Frayling N., Mladenic D., “Interaction of Feature Selection Methods and
Linear Classification Models”, Proc. of the 19th International Conference on Machine Learning, Australia,
2002.
[4] Ana Cardoso-Cachopo, Arlindo L. Oliveira, An Empirical Comparison of Text Categorization Methods,
Lecture Notes in Computer Science,
Volume 2857, Jan 2003, Pages 183 - 196
[5] Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P., “SMOTE: Synthetic Minority Over-
sampling Technique,” Journal of AI Research, 16 2002, pp. 321-357.
[6] Forman, G., An Experimental Study of Feature Selection Metrics for Text Categorization. Journal of
Machine Learning Research, 3 2003, pp. 1289-1305
[7] Fragoudis D., Meretakis D., Likothanassis S., “Integrating Feature and Instance Selection for Text
Classification”, SIGKDD ’02, July 23-26, 2002, Edmonton, Alberta, Canada.
[8] Guan J., Zhou S., “Pruning Training Corpus to Speedup Text Classification”, DEXA 2002, pp. 831-840
[9] D. E. Johnson, F. J. Oles, T. Zhang, T. Goetz, “A decision-tree-based symbolic rule induction system for
text categorization”, IBM Systems Journal, September 2002.
[10] Han X., Zu G., Ohyama W., Wakabayashi T., Kimura F., Accuracy Improvement of Automatic Text
Classification Based on Feature Transformation and Multi-classifier Combination, LNCS, Volume 3309,
Jan 2004, pp. 463-468
[11] Ke H., Shaoping M., “Text categorization based on Concept indexing and principal component
analysis”, Proc. TENCON 2002 Conference on Computers, Communications, Control and Power
Engineering, 2002, pp. 51- 56.
[12] Kehagias A., Petridis V., Kaburlasos V., Fragkou P., “A Comparison of Word- and Sense-Based Text
Categorization Using Several Classification Algorithms”, JIIS, Volume 21, Issue 3, 2003, pp. 227-247.
[13] B. Kessler, G. Nunberg, and H. Schutze.
Proceedings of the Thirty-Fifth ACL and EACL, pages 32–38, 1997.

[14] Kim S. B., Rim H. C., Yook D. S. and Lim
H. S., “Effective Methods for Improving Naive Bayes Text Classifiers”, LNAI 2417, 2002, pp. 414-423
[15] Klopotek M. and Woch M., “Very Large Bayesian Networks in Text Classification”,
ICCS 2003, LNCS 2657, 2003, pp. 397-406
[16] Leopold, Edda & Kindermann, Jörg, “Text Categorization with Support Vector Machines. How to
Represent Texts in Input Space?”, Machine Learning 46, 2002, pp. 423 - 444.
[17] Lewis D., Yang Y., Rose T., Li F., “RCV1: A New Benchmark Collection for Text Categorization
Research”, Journal of Machine Learning Research 5, 2004, pp. 361-397.
[18] Heui Lim, Improving kNN Based Text Classification with Well Estimated Parameters, LNCS, Vol. 3316,
Oct 2004, Pages 516 - 523.
[19] Madsen R. E., Sigurdsson S., Hansen L. K. and Lansen J., “Pruning the Vocabulary for Better Context
Recognition”, 7th International Conference on Pattern Recognition, 2004
[20] Montanes E., Quevedo J. R. and Diaz I., “A Wrapper Approach with Support Vector Machines for Text
Categorization”, LNCS 2686, 2003, pp. 230-237
[21] Nardiello P., Sebastiani F., Sperduti A., “Discretizing Continuous Attributes in AdaBoost for Text
Categorization”, LNCS,
Volume 2633, Jan 2003, pp. 320-334
[22] Novovicova J., Malik A., and Pudil P., “Feature Selection Using Improved Mutual Information for Text
Classification”, SSPR&SPR 2004, LNCS 3138, pp. 1010– 1017, 2004
[23] Qiang W., XiaoLong W., Yi G., “A Study of Semi-discrete Matrix Decomposition for LSI in Automated
Text Categorization”, LNCS, Volume 3248, Jan 2005, pp. 606-615.
[24] Schneider, K., Techniques for Improving the Performance of Naive Bayes for Text Classification, LNCS,
Vol. 3406, 2005, 682- 693.
[25] Sebastiani F., “Machine Learning in Automated Text Categorization”, ACM Computing Surveys, vol. 34
(1),2002, pp. 1-47.
[26] Shanahan J. and Roma N., Improving SVM Text Classification Performance through Threshold
Adjustment, LNAI 2837, 2003, 361- 372
[27] Soucy P. and Mineau G., “Feature Selection Strategies for Text Categorization”, AI 2003, LNAI 2671,
2003, pp. 505-509
[28] Sousa P., Pimentao J. P., Santos B. R. and Moura-Pires F., “Feature Selection Algorithms to Improve
Documents Classification Performance”, LNAI 2663, 2003, pp. 288-296
[29] Sung-Bae Cho, Jee-Haeng Lee, Learning Neural Network Ensemble for Practical Text Classification,
Lecture Notes in Computer Science, Volume 2690, Aug 2003, Pages 1032
– 1036.
[30] Torkkola K., “Discriminative Features for Text Document Classification”, Proc. International
Conference on Pattern Recognition, Canada, 2002.
[31] Vinciarelli A., “Noisy Text Categorization, Pattern Recognition”, 17th International Conference on
(ICPR'04) , 2004, pp. 554-557
[32] Y. Yang, J. Zhang and B. Kisiel., “A scalability analysis of classifiers in text categorization”, ACM
SIGIR'03, 2003, pp 96- 103
[33] Y. Yang. An evaluation of statistical approaches to text categorization. Journal of Information
Retrieval, 1(1/2):67–88, 1999.
[34] Zhenya Zhang, Shuguang Zhang, Enhong Chen, Xufa Wang, Hongmei Cheng, TextCC: New Feed
Forward Neural Network for Classifying Documents Instantly, Lecture Notes in Computer Science,
Volume 3497, Jan 2005, Pages 232 – 237.
[35] Shuigeng Zhou, Jihong Guan, Evaluation and Construction of Training Corpuses for Text
Classification: A Preliminary Study, Lecture Notes in Computer Science, Volume 2553, Jan 2002, Page 97-
108.
[36] Verayuth Lertnattee, Thanaruk Theeramunkong, Parallel Text Categorization for Multi-
dimensional Data, Lecture Notes in Computer Science, Volume 3320, Jan 2004,
Pages 38 - 41
[37] Wang Qiang, Wang XiaoLong, Guan Yi, A Study of Semi-discrete Matrix Decomposition for LSI in
Automated Text Categorization, Lecture Notes in Computer Science, Volume 3248, Jan 2005, Pages 606 –
615.
[38] Zu G., Ohyama W., Wakabayashi T., Kimura F., "Accuracy improvement of automatic text
classification based on feature transformation": Proc: the 2003 ACM Symposium on Document
Engineering, November 20-22, 2003, pp.118-120
View publication stats

Task 3

Uploaded by

Copyright:

Available Formats

Task 3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Task 3

Uploaded by

Copyright:

Available Formats

Assignment 1

Assignment 1 task 3 :Text Classiﬁcation Using Machine Learning Techniques

Read Tokenize Text Stemming

Vector Representation of Delete

Feature Selection and/or Feature Learning

text categorization places documents into one of the sets of categories.

2 Vector space document representations

language use Vocabulary V.

Metrics mathematical forms

  P ci | t log P ci | t 

Metrics mathematical forms

Relative DFn w, db c

Table 1. Feature Selection metrics

5 Machine learning algorithms

TP determined as a document that belongs in a

Proceedings of the Thirty-Fifth ACL and EACL, pages 32–38, 1997.

You might also like