Document Classification
275 Followers
Recent papers in Document Classification
Interest in the area of pattern recognition has been renewed recently due to emerging applications which are not only challenging but also computationally more demanding. These applications include data mining (identifying a "pattern",... more
In recent years, XML has been established as a major means for information management, and has been broadly utilized for complex data representation (e.g. multimedia objects). Owing to an unparalleled increasing use of the XML standard,... more
To help the growing qualitative and quantitative demands for information from the WWW, efficient automatic Web page classifiers are urgently needed. However, a classifier applied to the WWW faces a huge-scale dimensionality problem since... more
The web contains a wealth of product reviews, but sifting through them is a daunting task. Ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features,... more
The widespread use of information technologies for construction is considerably increasing the number of electronic text documents stored in construction management information systems. Consequently, automated methods for organizing and... more
Document image classification is an important step in Office Automation, Digital Libraries, and other document image analysis applications. There is great diversity in document image classifiers: they differ in the problems they solve, in... more
Frequent itemset mining (FIM) is a core operation for several data mining applications as association rules computation, correlations, document classification, and many others, which has been extensively studied over the last decades.... more
— With the increasing availability of electronic documents and the rapid growth of the World Wide Web, the task of automatic categorization of documents became the key method for organizing the information and know-ledge discovery. Proper... more
TWLT is an acronym of Twente Workshop(s) on Language Technology. These workshops on natural language theory and technology are organised by the Parlevink Project, a language theory and technology project of the . For each workshop... more
With the increasing availability of electronic documents and the rapid growth of the World Wide Web, the task of automatic categorization of documents became the key method for organizing the information and knowledge discovery. Proper... more
Instead of trying to produce a programme to simulate the adult mind, why not rather try to produce one which simulates the child's? If this were then subjected to an appropriate course of education one would obtain the adult brain. A mis... more
Integrating Different Strategies for Cross-Language Information Retrieval in the MIETTA Project Paul Buitelaar, Klaus Netter, Feiyu Xu DFKI Language Technology Lab Stuhlsatzenhausweg 3, 66123 Saarbrücken, Germany {paulb, netter, feiyu}@... more
Automatic classification has become an important research area due to the rapid increase of digital information. Evidently, manual classification of documents is a tough work due to occurrences of vocabulary ambiguities of classification... more
The use of ontology in order to provide a mechanism to enable machine reasoning has continuously increased during the last few years. This paper suggests an automated method for document classification using an ontology, which expresses... more
An increasing and overwhelming amount of biomedical information is available in the research literature mainly in the form of free-text. Biologists need tools that automate their information search and deal with the high volume and... more
Automatic document classification due to its various applications in data mining and information technology is one of the important topics in computer science. Classification plays a vital role in many information management and retrieval... more
A method of document comparison based on a hierarchical dictionary of topics (concepts) is described. The hierarchical links in the dictionary are supplied with the weights that are used for detecting the main topics of a document and for... more
It is well known that links are an important source of information when dealing with Web collections. However, the question remains on whether the same techniques that are used on the Web can be applied to collections of documents... more
Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark.
The combination of multiple features or views when representing documents or other kinds of objects usually leads to improved results in classification (and retrieval) tasks. Most systems assume that those views will be available both at... more
Pattern classification has been successfully applied in many problem domains, such as biometric recognition, document classification or medical diagnosis. Missing or unknown data are a common drawback that pattern recognition techniques... more
The amount of narrative clinical text documents stored in Electronic Patient Records (EPR) of Hospital Information Systems is increasing. Physicians spend a lot of time finding relevant patient-related information for medical decision... more
With the increased use of Internet, a large number of consumers first consult on line resources for their healthcare decisions. The problem of the existing information structure primarily lies in the fact that the vocabulary used in... more
In this work, we jointly apply several text mining methods to a corpus of legal documents in order to compare the separation quality of two inherently different document classification schemes. The classification schemes are compared with... more
The amount of narrative clinical text documents stored in Electronic Patient Records (EPR) of Hospital Information Systems is increasing. Physicians spend a lot of time finding relevant patient-related information for medical decision... more
The goal of the reported research is the development of a computational approach that could help a cognitive scientist to interactively represent a learner's mental models, and to automatically validate their coherence with respect to the... more
Classifier for the Internet Resource Discovery (ACIRD), which uses machine learning techniques to organize and retrieve Internet documents. ACIRD consists of a knowledge acquisition process, document classifier and two-phase search... more
0-7803-7868-7/03/$17.00 0 2003 IEEE.
Quantifying the concept of co-occurrence and iterated co-occurrence yields indices of similarity between words or between documents. These similarities are associated with a reversible Markov transition matrix, the formal properties of... more
We propose a simple Bayesian network-based text classifier, which may be considered as a discriminative counterpart of the generative multinomial naive Bayes classifier. The method relies on the use of a fixed network topology with the... more
We propose a method which, given a document to be classified, automatically generates an ordered set of appropriate descriptors extracted from a thesaurus. The method creates a Bayesian network to model the thesaurus and uses... more
This paper uses Systemic Functional Linguistic (SFL) theory as a basis for extracting semantic features of documents. We focus on the pronominal and determination system and the role it plays in constructing interpersonal distance. By... more
Email has become an important means of electronic communication but the viability of its usage is marred by Un-solicited Bulk Email (UBE) messages. UBE poses technical and socio-economic challenges to usage of emails. Besides, the... more
A new algorithm based on learning vector quantisation classifier is presented based on a modified proximity-measure, which enforces a predetermined correct classification level in training while using sliding-mode approach for stable... more
ABSTRACT: Improvements in hardware, communication technology and database have led to the explosion of multimedia information repositories. In order to provide the quality of information retrieval and the quality of services, it is... more
Feature selection is of paramount concern in document classification process which improves the efficiency and accuracy of text classifier. Vector Space Model is used to represent the "Bag of Word" BOW of the documents with term weighting... more
In this paper we propose a matching algorithm for measuring the structural similarity between an XML document and a DTD. The matching algorithm, by comparing the document structure against the one the DTD requires, is able to identify... more
We present a novel approach for classifying documents that combines different pieces of evidence (e.g., textual features of documents, links, and citations) transparently, through a data mining technique which generates rules associating... more
In this paper we present the Dual Support Apriori for Temporal data (DSAT) algorithm. This is a novel technique for discovering Jumping Emerging Patterns (JEPs) from time series data using a sliding window technique. Our approach is... more