Papers by Maciej Ogrodniczuk
This paper presents a novel multilingual framework integrating linguistic services around a Web-b... more This paper presents a novel multilingual framework integrating linguistic services around a Web-based content management system. The language tools provide semantic foundation for advanced CMS functions such as machine translation, automatic categorization or text summarization. The tools are integrated into processing chains on the basis of UIMA architecture and using uniform annotation model. The CMS is used to prepare two sample online services illustrating the advantages of applying language technology to content administration.
1 Faktycznie, jak wynika to z przytoczonych dat realizacja projektu trwała tylko osiem miesięcy (... more 1 Faktycznie, jak wynika to z przytoczonych dat realizacja projektu trwała tylko osiem miesięcy (od początku maja do końca grudnia 2009 r.).
Abstract The aim of this paper is to present current efforts towards the creation of a comprehens... more Abstract The aim of this paper is to present current efforts towards the creation of a comprehensive open repository of Polish language resources and tools (LRTs). The work described here is carried out within the CESAR project, member of the META-NET consortium. It has already resulted in the creation of the Computational Linguistics in Poland website containing an exhaustive collection of Polish LRTs. Current work is focused on the creation of new LRTs and, esp., the enhancement of existing LRTs, such as parallel ...
Creating a coreference resolution tool for a new language is a challenging task due to substantia... more Creating a coreference resolution tool for a new language is a challenging task due to substantial effort required by development of associated linguistic data, regardless of rule-based or statistical nature of the approach. In this paper, we test the translation-and projection-based method for an inflectional language, evaluate the result on a corpus of general coreference and compare the results with state-of-the-art solutions of this type for other languages.
This paper reports on the preliminary experiment aimed at verification whether extraction of nomi... more This paper reports on the preliminary experiment aimed at verification whether extraction of nominal facts corresponding to world knowledge from both structured and unstructured data could be effectively performed and its results used as a source of pragmatic knowledge for coreference resolution in Polish. Being the proof-of-concept only, this approach is work in progress and is intended to be further validated in a full-scale project.
The aim of this demo is to present multilingual resources made available in the new META-SHARE op... more The aim of this demo is to present multilingual resources made available in the new META-SHARE open infrastructure by partners of the CESAR consortium (CEntral and South-east europeAn Resources, a European CIP ICT-PSP project, Grant Agreement 271022, http://www.cesar-project.net) in November 2011, within the first batch of resources to be delivered in 2011-2013.
This paper attempts a preliminary interpretation of the occurrence of different types of linguist... more This paper attempts a preliminary interpretation of the occurrence of different types of linguistic constructs in the manually-annotated Polish Coreference Corpus by providing analyses of various statistical properties related to mentions, clusters and near-identity links. Among others, frequency of mentions, zero subjects and singleton clusters is presented, as well as the average mention and cluster size. We also show that some coreference clustering constraints, such as gender or number agreement, are frequently not valid in case of Polish. The need for lemmatization for automatic coreference resolution is supported by an empirical study. Correlation between cluster and mention count within a text is investigated, with short characteristics of outlier cases. We also examine this correlation in each of the 14 text domains present in the corpus and show that none of them has abnormal frequency of outlier texts regarding the cluster/mention ratio. Finally, we report on our negative experiences concerning the annotation of the near-identity relation. In the conclusion we put forward some guidelines for the future research in the area.
Keywords: coreference, reference, identity, near-identity
This paper presents a robust linguistic Web service framework for Polish, combining several matur... more This paper presents a robust linguistic Web service framework for Polish, combining several mature offline linguistic tools in a common online platform. The toolset comprise paragraph-, sentence-and token-level segmenter, morphological analyser, disambiguating tagger, shallow and deep parser, named entity recognizer and coreference resolver. Uniform access to processing results is provided by means of a stand-off packaged adaptation of National Corpus of Polish TEI P5-based representation and interchange format.
This paper presents the results of the first attempt of coreference resolution for Polish running... more This paper presents the results of the first attempt of coreference resolution for Polish running on true mention boundaries and using a few rich rules, corresponding to syntactic constraints (elimination of nested nominal groups), syntactic filters (elimination of syntactic incompatible heads), semantic filters (wordnet-derived compatibility) and selection (weighted scoring). The results are compared to human annotation and presented in four sets: with two common baselines: allsingletons/head-match, and two slightly more complex settings with four and five rules.
Uploads
Papers by Maciej Ogrodniczuk