In this paper we discuss the application of web service technology to the language technology (LT... more In this paper we discuss the application of web service technology to the language technology (LT) and corpus processing domain. Motivated by a host of language technology tools which are wid3ely available but which lack common technical standards for integrating them into language technology applications we discuss the implementation of a web service-based API for web-based processing and accessing of large linguistic data resources.
... Access to Government Services through Telephone Interpreting 10: 3011: 00 Tea break 11: 001... more ... Access to Government Services through Telephone Interpreting 10: 3011: 00 Tea break 11: 0011: 30 Sonja E. BOSCH ... for Afrikaans, Based on Morphological Analysis 14: 0014: 30 David JOFFE, Gilles-Maurice DE SCHRYVER & DJ PRINSLOO Introducing TshwaneLex ...
In this paper we introduce a measure for calcu-lating statistically significant collocation sets ... more In this paper we introduce a measure for calcu-lating statistically significant collocation sets that is related to the Poisson distribution. We show that results calculated using this measure are comparable to well-known measures like the log-likelihood measure. Additionally, we ...
EVWUDFW Starting from text corpus analysis with linguistic and statistical analysis algorithms, a... more EVWUDFW Starting from text corpus analysis with linguistic and statistical analysis algorithms, an infrastructure for text mining is described which uses collocation analysis as a central tool. This text mining method may be applied to different domains as well as languages. Some examples taken form large reference databases motivate the applicability to knowledge management using declarative standards of information structuring and description. The ISO/IEC Topic Map standard is introduced as a candidate for rich metadata description of information resources and it is shown how text mining can be used for automatic topic map generation.
In most languages, named entities form regular patterns. Usually, a surname has a preceding first... more In most languages, named entities form regular patterns. Usually, a surname has a preceding first name, which in turn might have a preceding title or profession. Similar rules hold for differ-ent kinds of named entities consisting of more than one word. Moreover, some ...
We describe an infrastructure for the collection and management of large amounts of text, and dis... more We describe an infrastructure for the collection and management of large amounts of text, and discuss the possibility of information extraction and visualisation from text corpora with statistical methods. The paper gives an overview of processing steps, the contents of our text databases as well as different query facilities. Our focus is on the extraction and visualisation of collocations and their usage for aiding web searches.
This paper describes the application of statistical analysis of large corpora to the problem of e... more This paper describes the application of statistical analysis of large corpora to the problem of extracting semantic relations from unstructured text. We regard this approach as a viable method for generating input for the construction of ontologies as ontologies use well-defined semantic relations as building blocks (cf. van der Vet & Mars 1998). Starting from a short description of our corpora as well as our language analysis tools, we discuss in depth the automatic generation of collocation sets. We further give examples of different types of relations that may be found in collocation sets for arbitrary terms. The central question we deal with here is how to postprocess statistically generated collocation sets in order to extract named relations. We show that for different types of relations like cohyponyms or instance-of-relations, different extraction methods as well as additional sources of information can be applied to the basic collocation sets in order to verify the existence of a specific type of semantic relation for a given set of terms.
This paper describes the application of filtering techniques to collocation sets calculated for v... more This paper describes the application of filtering techniques to collocation sets calculated for very large text corpora. Additional information like patterns, grammatical information, subject areas and numerical values associated with the collocations are used to identify collocations with given semantic structure. Various examples and different techniques for applying such filters are described. We also give several examples of practical applications for this type of information extraction.
Abstract. In this paper we describe a flexible, portable and language-independent infrastructure ... more Abstract. In this paper we describe a flexible, portable and language-independent infrastructure for setting up large monolingual language corpora. The approach is based on collecting a large amount of monolingual text from various sources. The input data is processed on the ...
Der Beitrag beschreibt ein flexibles und modulares System zur automatischen Beschlagwortung von T... more Der Beitrag beschreibt ein flexibles und modulares System zur automatischen Beschlagwortung von Texten, das auf einer Text Mining-Engine aufbaut. Dabei liegt eine Methode der differentiellen Corpusanalyse zugrunde: Der zu verarbeitende Text wird im Vergleich mit einem unfangreichen Referenzcorpus analysiert und Unterschiede in relativen Häufigkeitsklassen dienen der Auswahl geeigneter Schlagworte. Zusätzlich kommen Datenbanken zum Einsatz, die eine Expansion von Termen hinsichtlich Grundform, Schreibvarianten, Synonymen und Mehrwortbegriffen erlauben. Das System ist als web service realisiert und lässt sich problemlos in Content Management-Systeme integrieren.
The regularity of named entities is u sed to learn names and to extract named entities. Having on... more The regularity of named entities is u sed to learn names and to extract named entities. Having only a few name elements and a set of patterns the a l- gorithm learns new names and its elements. A verification step assures quality using a large background corpus. Further improvement is reached through classifying the newly learnt elements on character level.
Abstract. In this paper we describe a flexible, portable and language-independent infrastructure ... more Abstract. In this paper we describe a flexible, portable and language-independent infrastructure for setting up large monolingual language corpora. The approach is based on collecting a large amount of monolingual text from various sources. The input data is processed on the ...
In this paper we discuss the application of web service technology to the language technology (LT... more In this paper we discuss the application of web service technology to the language technology (LT) and corpus processing domain. Motivated by a host of language technology tools which are wid3ely available but which lack common technical standards for integrating them into language technology applications we discuss the implementation of a web service-based API for web-based processing and accessing of large linguistic data resources.
... Access to Government Services through Telephone Interpreting 10: 3011: 00 Tea break 11: 001... more ... Access to Government Services through Telephone Interpreting 10: 3011: 00 Tea break 11: 0011: 30 Sonja E. BOSCH ... for Afrikaans, Based on Morphological Analysis 14: 0014: 30 David JOFFE, Gilles-Maurice DE SCHRYVER & DJ PRINSLOO Introducing TshwaneLex ...
In this paper we introduce a measure for calcu-lating statistically significant collocation sets ... more In this paper we introduce a measure for calcu-lating statistically significant collocation sets that is related to the Poisson distribution. We show that results calculated using this measure are comparable to well-known measures like the log-likelihood measure. Additionally, we ...
EVWUDFW Starting from text corpus analysis with linguistic and statistical analysis algorithms, a... more EVWUDFW Starting from text corpus analysis with linguistic and statistical analysis algorithms, an infrastructure for text mining is described which uses collocation analysis as a central tool. This text mining method may be applied to different domains as well as languages. Some examples taken form large reference databases motivate the applicability to knowledge management using declarative standards of information structuring and description. The ISO/IEC Topic Map standard is introduced as a candidate for rich metadata description of information resources and it is shown how text mining can be used for automatic topic map generation.
In most languages, named entities form regular patterns. Usually, a surname has a preceding first... more In most languages, named entities form regular patterns. Usually, a surname has a preceding first name, which in turn might have a preceding title or profession. Similar rules hold for differ-ent kinds of named entities consisting of more than one word. Moreover, some ...
We describe an infrastructure for the collection and management of large amounts of text, and dis... more We describe an infrastructure for the collection and management of large amounts of text, and discuss the possibility of information extraction and visualisation from text corpora with statistical methods. The paper gives an overview of processing steps, the contents of our text databases as well as different query facilities. Our focus is on the extraction and visualisation of collocations and their usage for aiding web searches.
This paper describes the application of statistical analysis of large corpora to the problem of e... more This paper describes the application of statistical analysis of large corpora to the problem of extracting semantic relations from unstructured text. We regard this approach as a viable method for generating input for the construction of ontologies as ontologies use well-defined semantic relations as building blocks (cf. van der Vet & Mars 1998). Starting from a short description of our corpora as well as our language analysis tools, we discuss in depth the automatic generation of collocation sets. We further give examples of different types of relations that may be found in collocation sets for arbitrary terms. The central question we deal with here is how to postprocess statistically generated collocation sets in order to extract named relations. We show that for different types of relations like cohyponyms or instance-of-relations, different extraction methods as well as additional sources of information can be applied to the basic collocation sets in order to verify the existence of a specific type of semantic relation for a given set of terms.
This paper describes the application of filtering techniques to collocation sets calculated for v... more This paper describes the application of filtering techniques to collocation sets calculated for very large text corpora. Additional information like patterns, grammatical information, subject areas and numerical values associated with the collocations are used to identify collocations with given semantic structure. Various examples and different techniques for applying such filters are described. We also give several examples of practical applications for this type of information extraction.
Abstract. In this paper we describe a flexible, portable and language-independent infrastructure ... more Abstract. In this paper we describe a flexible, portable and language-independent infrastructure for setting up large monolingual language corpora. The approach is based on collecting a large amount of monolingual text from various sources. The input data is processed on the ...
Der Beitrag beschreibt ein flexibles und modulares System zur automatischen Beschlagwortung von T... more Der Beitrag beschreibt ein flexibles und modulares System zur automatischen Beschlagwortung von Texten, das auf einer Text Mining-Engine aufbaut. Dabei liegt eine Methode der differentiellen Corpusanalyse zugrunde: Der zu verarbeitende Text wird im Vergleich mit einem unfangreichen Referenzcorpus analysiert und Unterschiede in relativen Häufigkeitsklassen dienen der Auswahl geeigneter Schlagworte. Zusätzlich kommen Datenbanken zum Einsatz, die eine Expansion von Termen hinsichtlich Grundform, Schreibvarianten, Synonymen und Mehrwortbegriffen erlauben. Das System ist als web service realisiert und lässt sich problemlos in Content Management-Systeme integrieren.
The regularity of named entities is u sed to learn names and to extract named entities. Having on... more The regularity of named entities is u sed to learn names and to extract named entities. Having only a few name elements and a set of patterns the a l- gorithm learns new names and its elements. A verification step assures quality using a large background corpus. Further improvement is reached through classifying the newly learnt elements on character level.
Abstract. In this paper we describe a flexible, portable and language-independent infrastructure ... more Abstract. In this paper we describe a flexible, portable and language-independent infrastructure for setting up large monolingual language corpora. The approach is based on collecting a large amount of monolingual text from various sources. The input data is processed on the ...
Uploads
Papers by U. Quasthoff