The PPM data compression scheme has set the performance standard in lossless compression of text ... more The PPM data compression scheme has set the performance standard in lossless compression of text throughout the past decade. PPM is a "nite-context statistical modelling technique that can be viewed as blending together several "xed-order context models to predict the next character in the input sequence. This paper gives a brief introduction to PPM, and describes a variant of the algorithm, called PPM*, which exploits contexts of unbounded length. Although requiring considerably greater computational resources (in both time and space), this reliably achieves compression superior to the benchmark PPMC version. Its major contribution is that it shows that the full information available by considering all substrings of the input string can be used effectively to generate high-quality predictions. Hence, it provides a useful tool for exploring the bounds of compression.
ABSTRACT The state of the art in lossless text compression is the PPM data compression scheme. Tw... more ABSTRACT The state of the art in lossless text compression is the PPM data compression scheme. Two approaches to the problem of selecting the context models used in the scheme are described. One uses an a priori upper bound on the lengths of the contexts, while the other method is unbounded. Several techniques that improve the probability estimation are described, including four new methods: partial update exclusions for the unbounded approach, deterministic scaling, recency scaling and multiple probability estimators. Each of these methods improves the performance for both the bounded and unbounded approaches. In addition, further savings are possible by combining the two approaches. 1 Introduction The state of the art in lossless text compression is the PPM data compression scheme [1, 4]. PPM, or prediction by partial matching, is an adaptive statistical modeling technique based on blending together different length context models to predict the next character in the input sequence. The sche...
This paper describes the participation of the School of Informatics, University of Wales, Bangor ... more This paper describes the participation of the School of Informatics, University of Wales, Bangor in the 2004 Text Retrieval Conference. We present additions and modications to the QITEKAT system, initially developed as an entry for the 2003 QA evaluation, including automated regular expression induction, improved question matching, and application of our knowledge framework to the modied question types presented in the 2004 track. Results are presented which show improvements on last year's performance, and we discuss future directions for the system.
Third International Symposium on Parallel and Distributed Computing/Third International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks, 2004
We present ParCop, a decentralized peer-to-peer (P2P) computing system. In ParCop, the data and t... more We present ParCop, a decentralized peer-to-peer (P2P) computing system. In ParCop, the data and tasks are mobilized and flow freely between the computational resources (peers). ParCop allows each peer to utilize as well as to offer computing resources. ParCop uses the P2P model to guard against common problems that other systems suffer from, such as server failure and connection bottleneck.
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval - SIGIR '03, 2003
ABSTRACT We suggest a way for locating duplicates and plagiarisms in a text collection using an R... more ABSTRACT We suggest a way for locating duplicates and plagiarisms in a text collection using an R-measure, which is the normal-ized sum of the lengths of all suffixes of the text repeated in other documents of the collection. The R-measure can be effectively computed using ...
ABSTRACT We describe the background and motivation for a logic-based framework, based on the theo... more ABSTRACT We describe the background and motivation for a logic-based framework, based on the theory of “Knowing-Aboutness”, and its specific application to Question-Answering. We present the salient features of our system, and outline the benefits of our framework in terms of a more integrated architecture that is more easily evaluated. Favourable results are presented in the TREC 2004 Question-Answering evaluation.
Chinese is written without using spaces or other word delimiters. Although a text may be thought ... more Chinese is written without using spaces or other word delimiters. Although a text may be thought of as a corresponding sequence of words, there is considerable ambiguity in the placement of boundaries. Interpreting a text as a sequence of words is beneficial for some information retrieval and storage tasks:for example,full-text search, word-based compression, and keyphrase extraction. We describe a scheme that infers appropriate positions for word boundaries using an adaptive language model that is standard in text compression. It is trained on a corpus of presegmented text, and when applied to new text, interpolates word boundaries so as to maximize the compression obtained. This simple and general method performs well with respect to specialized schemes for Chinese language segmentation.
ABSTRACT A number of powerful modelling techniques have been developed in recent years to compres... more ABSTRACT A number of powerful modelling techniques have been developed in recent years to compress natural language text. The best of these are adaptive models operating on the character and word level which are able to perform almost as well as humans at predicting text. We show how to apply character based methods to five areas where language modelling is critical, providing novel solutions to each of these problems.
Text categorization is the problem of assigning text to any of a set of pre-specified categories.... more Text categorization is the problem of assigning text to any of a set of pre-specified categories. It is useful in indexing documents for later retrieval, as a stage in natural language processing systems, for content analysis, and in many other roles. We with to use language models developed for text compression as the basis of a text categorization scheme and
The PPM data compression scheme has set the performance standard in lossless compression of text ... more The PPM data compression scheme has set the performance standard in lossless compression of text throughout the past decade. PPM is a "nite-context statistical modelling technique that can be viewed as blending together several "xed-order context models to predict the next character in the input sequence. This paper gives a brief introduction to PPM, and describes a variant of the algorithm, called PPM*, which exploits contexts of unbounded length. Although requiring considerably greater computational resources (in both time and space), this reliably achieves compression superior to the benchmark PPMC version. Its major contribution is that it shows that the full information available by considering all substrings of the input string can be used effectively to generate high-quality predictions. Hence, it provides a useful tool for exploring the bounds of compression.
Proceeding of the Fifteenth Annual Conference Companion on Genetic and Evolutionary Computation Conference Companion, 2013
This paper describes a novel approach to multi-agent simulation where agents evolve freely within... more This paper describes a novel approach to multi-agent simulation where agents evolve freely within their environment. We present Template Based Evolution (TBE), a genetic evolution algorithm that evolves behaviour for embodied situated agents whose fitness is tested implicitly through repeated trials in an environment. All agents that survive in the environment breed freely, creating new agents based on the average genome of two parents. This paper describes the design of the algorithm and applies it to a model where virtual migratory creatures are evolved to survive the simulated environment. Comparisons made between the evolutionary responses of the artificial creatures and observations of natural systems justify the strength of the methodology for species simulation.
The PPM data compression scheme has set the performance standard in lossless compression of text ... more The PPM data compression scheme has set the performance standard in lossless compression of text throughout the past decade. PPM is a "nite-context statistical modelling technique that can be viewed as blending together several "xed-order context models to predict the next character in the input sequence. This paper gives a brief introduction to PPM, and describes a variant of the algorithm, called PPM*, which exploits contexts of unbounded length. Although requiring considerably greater computational resources (in both time and space), this reliably achieves compression superior to the benchmark PPMC version. Its major contribution is that it shows that the full information available by considering all substrings of the input string can be used effectively to generate high-quality predictions. Hence, it provides a useful tool for exploring the bounds of compression.
ABSTRACT The state of the art in lossless text compression is the PPM data compression scheme. Tw... more ABSTRACT The state of the art in lossless text compression is the PPM data compression scheme. Two approaches to the problem of selecting the context models used in the scheme are described. One uses an a priori upper bound on the lengths of the contexts, while the other method is unbounded. Several techniques that improve the probability estimation are described, including four new methods: partial update exclusions for the unbounded approach, deterministic scaling, recency scaling and multiple probability estimators. Each of these methods improves the performance for both the bounded and unbounded approaches. In addition, further savings are possible by combining the two approaches. 1 Introduction The state of the art in lossless text compression is the PPM data compression scheme [1, 4]. PPM, or prediction by partial matching, is an adaptive statistical modeling technique based on blending together different length context models to predict the next character in the input sequence. The sche...
This paper describes the participation of the School of Informatics, University of Wales, Bangor ... more This paper describes the participation of the School of Informatics, University of Wales, Bangor in the 2004 Text Retrieval Conference. We present additions and modications to the QITEKAT system, initially developed as an entry for the 2003 QA evaluation, including automated regular expression induction, improved question matching, and application of our knowledge framework to the modied question types presented in the 2004 track. Results are presented which show improvements on last year's performance, and we discuss future directions for the system.
Third International Symposium on Parallel and Distributed Computing/Third International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks, 2004
We present ParCop, a decentralized peer-to-peer (P2P) computing system. In ParCop, the data and t... more We present ParCop, a decentralized peer-to-peer (P2P) computing system. In ParCop, the data and tasks are mobilized and flow freely between the computational resources (peers). ParCop allows each peer to utilize as well as to offer computing resources. ParCop uses the P2P model to guard against common problems that other systems suffer from, such as server failure and connection bottleneck.
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval - SIGIR '03, 2003
ABSTRACT We suggest a way for locating duplicates and plagiarisms in a text collection using an R... more ABSTRACT We suggest a way for locating duplicates and plagiarisms in a text collection using an R-measure, which is the normal-ized sum of the lengths of all suffixes of the text repeated in other documents of the collection. The R-measure can be effectively computed using ...
ABSTRACT We describe the background and motivation for a logic-based framework, based on the theo... more ABSTRACT We describe the background and motivation for a logic-based framework, based on the theory of “Knowing-Aboutness”, and its specific application to Question-Answering. We present the salient features of our system, and outline the benefits of our framework in terms of a more integrated architecture that is more easily evaluated. Favourable results are presented in the TREC 2004 Question-Answering evaluation.
Chinese is written without using spaces or other word delimiters. Although a text may be thought ... more Chinese is written without using spaces or other word delimiters. Although a text may be thought of as a corresponding sequence of words, there is considerable ambiguity in the placement of boundaries. Interpreting a text as a sequence of words is beneficial for some information retrieval and storage tasks:for example,full-text search, word-based compression, and keyphrase extraction. We describe a scheme that infers appropriate positions for word boundaries using an adaptive language model that is standard in text compression. It is trained on a corpus of presegmented text, and when applied to new text, interpolates word boundaries so as to maximize the compression obtained. This simple and general method performs well with respect to specialized schemes for Chinese language segmentation.
ABSTRACT A number of powerful modelling techniques have been developed in recent years to compres... more ABSTRACT A number of powerful modelling techniques have been developed in recent years to compress natural language text. The best of these are adaptive models operating on the character and word level which are able to perform almost as well as humans at predicting text. We show how to apply character based methods to five areas where language modelling is critical, providing novel solutions to each of these problems.
Text categorization is the problem of assigning text to any of a set of pre-specified categories.... more Text categorization is the problem of assigning text to any of a set of pre-specified categories. It is useful in indexing documents for later retrieval, as a stage in natural language processing systems, for content analysis, and in many other roles. We with to use language models developed for text compression as the basis of a text categorization scheme and
The PPM data compression scheme has set the performance standard in lossless compression of text ... more The PPM data compression scheme has set the performance standard in lossless compression of text throughout the past decade. PPM is a "nite-context statistical modelling technique that can be viewed as blending together several "xed-order context models to predict the next character in the input sequence. This paper gives a brief introduction to PPM, and describes a variant of the algorithm, called PPM*, which exploits contexts of unbounded length. Although requiring considerably greater computational resources (in both time and space), this reliably achieves compression superior to the benchmark PPMC version. Its major contribution is that it shows that the full information available by considering all substrings of the input string can be used effectively to generate high-quality predictions. Hence, it provides a useful tool for exploring the bounds of compression.
Proceeding of the Fifteenth Annual Conference Companion on Genetic and Evolutionary Computation Conference Companion, 2013
This paper describes a novel approach to multi-agent simulation where agents evolve freely within... more This paper describes a novel approach to multi-agent simulation where agents evolve freely within their environment. We present Template Based Evolution (TBE), a genetic evolution algorithm that evolves behaviour for embodied situated agents whose fitness is tested implicitly through repeated trials in an environment. All agents that survive in the environment breed freely, creating new agents based on the average genome of two parents. This paper describes the design of the algorithm and applies it to a model where virtual migratory creatures are evolved to survive the simulated environment. Comparisons made between the evolutionary responses of the artificial creatures and observations of natural systems justify the strength of the methodology for species simulation.
Thinning is one of the critical processes for different applications in image analysis, in partic... more Thinning is one of the critical processes for different applications in image analysis, in particular for Optical Character Recognition (OCR) applications. The accuracy performance of OCR relies on the effectiveness of thinning algorithms. However, previously there has been little attention paid for proposing thinning algorithms for Arabic script. Also, there is a lack of quantitative performance measures of thinning techniques for Arabic script. Consequently, it is unclear which thinning algorithms are more appropriate for Arabic script. In this paper, a new thinning algorithm for Arabic script is proposed with several new performance metrics. An experiment is conducted to evaluate the proposed algorithm against two well established thinning algorithms with respect to the several proposed objective performance metrics. The experimental results show that the new algorithm has the best performance among the other two thinning algorithms.
Uploads
Papers by William Teahan