Papers by Peter Bourgonje
Lecture Notes in Computer Science
Coreference Resolution is the process of identifying all words and phrases in a text that refer t... more Coreference Resolution is the process of identifying all words and phrases in a text that refer to the same entity. It has proven to be a useful intermediary step for a number of natural language processing applications. In this paper, we describe three implementations for performing coreference resolution: rule-based, statistical, and projectionbased (from English to German). After a comparative evaluation on benchmark datasets, we conclude with an application of these systems on German and English texts from different scenarios in digital curation such as an archive of personal letters, excerpts from a museum exhibition, and regional news articles.
We explore to what extent knowledge about the pre-trained language model that is used is benefici... more We explore to what extent knowledge about the pre-trained language model that is used is beneficial for the task of abstractive summarization. To this end, we experiment with conditioning the encoder and decoder of a Transformer-based neural model on the BERT language model. In addition, we propose a new method of BERT-windowing, which allows chunk-wise processing of texts longer than the BERT window size. We also explore how locality modeling, i.e., the explicit restriction of calculations to the local context, can affect the summarization ability of the Transformer. This is done by introducing 2-dimensional convolutional self-attention into the first layers of the encoder. The results of our models are compared to a baseline and the state-of-the-art models on the CNN/Daily Mail dataset. We additionally train our model on the SwissText dataset to demonstrate usability on German. Both models outperform the baseline in ROUGE scores on two datasets and show its superiority in a manual...
The work described in this paper was carried out in the context of task 1 of the CUTE workshop ta... more The work described in this paper was carried out in the context of task 1 of the CUTE workshop taking place at the conference for Digital Humanities in the DACH-area in Bern in February 2017. The data released for this task consisted of four transcripts of debates as held in the German parliament (on October 28th, 1999, December 16th, 2004, November 15th, 2007 and March 17th, 2011), a series of letters from Goethe’s Die Leiden des jungen Werther, the section Zur Theorie des Kunstwerks from Adorno’s Ästhetische Theorie and books 3 to 6 from Wolfram von Eschenbach’s Parzival. These four data sets represent four different domains and display a diversity in language (standard German as spoken today and Mittelhochdeutsch from Parzival) and language usage (spoken, i. e., the transcripts of the parliament debates, and written, i. e., the other two corpora). This means that typical entity recognition tools without any specific training data from these four domains are expected to perform wi...
Digital content and online media have reached an unprecedented level of relevance and importance.... more Digital content and online media have reached an unprecedented level of relevance and importance. In the context of a research and technology transfer project on Digital Curation Technologies for online content we develop a Semantic Storytelling prototype. The approach is based on the semantic analysis of document collections, in which, among others, individual analysis results are, if possible, mapped to external knowledge bases. We interlink key information contained in the documents of the collection, which can be essentially conceptualised as automatic hypertext generation. With this semantic layer on top of the set of documents in place, we attempt to identify interesting, surprising, eye-opening relationships between different concepts or entities mentioned in the document collection. In this article we concentrate on the current state of the user interfaces of our Semantic Storytelling prototype.
In this paper we outline easily implementable procedures to leverage multilingual Linked Open Dat... more In this paper we outline easily implementable procedures to leverage multilingual Linked Open Data (LOD) resources such as the DBpedia in open-source Statistical Machine Translation (SMT) systems such as Moses. Using open standards such as RDF (Resource Description Framework) Schema, NIF (Natural language processing Interchange Format), and SPARQL (SPARQL Protocol and RDF Query Language) queries, we demonstrate the efficacy of translating named entities and thereby improving the quality and consistency of SMT outputs. We also give a brief overview of two funded projects that are actively working on this topic. These are the (1) BMBF funded project DKT (Digitale Kuratierungstechnologien) on digital curation technologies, and (2) EU Horizon 2020 funded project FREME (Open Framework of e-services for Multilingual and Semantic Enrichment of Digital Content). This is a step towards designing a Semantic Web-aware Machine Translation (MT) system and keeping SMT algorithms up-to-date with t...
Proceedings of the 28th International Conference on Computational Linguistics
In this paper we focus on connective identification and sense classification for explicit discour... more In this paper we focus on connective identification and sense classification for explicit discourse relations in German, as two individual sub-tasks of the overarching Shallow Discourse Parsing task. We successively augment a purely-empirical approach based on contextualised embeddings with linguistic knowledge encoded in a connective lexicon. In this way, we improve over published results for connective identification, achieving a final F 1-score of 87.93; and we introduce, to the best of our knowledge, first results for German sense classification, achieving an F 1-score of 87.13. Our approach demonstrates that a connective lexicon can be a valuable resource for those languages that do not have a large PDTB-style-annotated coprus available.
Shallow Discourse Parsing (SDP), the identification of coherence relations between text spans, re... more Shallow Discourse Parsing (SDP), the identification of coherence relations between text spans, relies on large amounts of training data, which so far exists only for English - any other language is in this respect an under-resourced one. For those languages where machine translation from English is available with reasonable quality, MT in conjunction with annotation projection can be an option for producing an SDP resource. In our study, we translate the English Penn Discourse TreeBank into German and experiment with various methods of annotation projection to arrive at the German counterpart of the PDTB. We describe the key characteristics of the corpus as well as some typical sources of errors encountered during its creation. Then we evaluate the GermanPDTB by training components for selected sub-tasks of discourse parsing on this silver data and compare performance to the same components when trained on the gold, original PDTB corpus.
ArXiv, 2020
We present a workflow manager for the flexible creation and customisation of NLP processing pipel... more We present a workflow manager for the flexible creation and customisation of NLP processing pipelines. The workflow manager addresses challenges in interoperability across various different NLP tasks and hardware-based resource usage. Based on the four key principles of generality, flexibility, scalability and efficiency, we present the first version of the workflow manager by providing details on its custom definition language, explaining the communication components and the general system architecture and setup. We currently implement the system, which is grounded and motivated by real-world industry use cases in several innovation and transfer projects.
In recent years the automatic detection of abusive language, offensive language and hate speech i... more In recent years the automatic detection of abusive language, offensive language and hate speech in several different forms of online communication has received a lot of attention by the Computational Linguistics and Language Technology community. While most approaches work on English data, publications on languages other than English are rare. This paper, submitted to the GermEval 2018 Shared Task on the Identification of Offensive Language, provides the results of several experiments regarding the classification of offensive language in German language tweets.
Named Entity Recognition tools and datasets are widely used. The standard pre-trained models, how... more Named Entity Recognition tools and datasets are widely used. The standard pre-trained models, however often do not cover specific application needs as these models are too generic. We introduce a methodology to automatically induce fine-grained classes of named entities for the legal domain. Specifically, given a corpus which has been annotated with instances of coarse entity classes, we show how to induce fine-grained, domain specific (sub-)classes. The method relies on predictions of the masked tokens generated by a pre-trained language model. These predictions are then collected and clustered. The clusters are then taken as the new candidate classes. We develop an implementation of the introduced method and experiment with a large legal corpus in German language that is manually annotated with almost 54,000 named entities.
ArXiv, 2019
In this paper, we focus on the classification of books using short descriptive texts (cover blurb... more In this paper, we focus on the classification of books using short descriptive texts (cover blurbs) and additional metadata. Building upon BERT, a deep neural language model, we demonstrate how to combine text representations with metadata and knowledge graph embeddings, which encode author information. Compared to the standard BERT approach we achieve considerably better results for the classification task. For a more coarse-grained classification using eight labels we achieve an F1- score of 87.20, while a detailed classification using 343 labels yields an F1-score of 64.70. We make the source code and trained models of our experiments publicly available
We present the Potsdam Commentary Corpus 2.2, a German corpus of news editorials annotated on sev... more We present the Potsdam Commentary Corpus 2.2, a German corpus of news editorials annotated on several different levels. New in the 2.2 version of the corpus are two additional annotation layers for coherence relations following the Penn Discourse TreeBank framework. Specifically, we add relation senses to an already existing layer of discourse connectives and their arguments, and we introduce a new layer with additional coherence relation types, resulting in a German corpus that mirrors the PDTB. The aim of this is to increase usability of the corpus for the task of shallow discourse parsing. In this paper, we provide inter-annotator agreement figures for the new annotations and compare corpus statistics based on the new annotations to the equivalent statistics extracted from the PDTB.
In this paper we focus on connective identification and sense classification for explicit discour... more In this paper we focus on connective identification and sense classification for explicit discourse relations in German, as two individual sub-tasks of the overarching Shallow Discourse Parsing task. We successively augment a purely-empirical approach based on contextualised embeddings with linguistic knowledge encoded in a connective lexicon. In this way, we improve over published results for connective identification, achieving a final F1-score of 87.93; and we introduce, to the best of our knowledge, first results for German sense classification, achieving an F1-score of 87.13. Our approach demonstrates that a connective lexicon can be a valuable resource for those languages that do not have a large PDTB-style-annotated coprus available.
We present a workflow manager for the flexible creation and customisation of NLP processing pipel... more We present a workflow manager for the flexible creation and customisation of NLP processing pipelines. The workflow manager addresses challenges in interoperability across various different NLP tasks and hardware-based resource usage. Based on the four key principles of generality, flexibility, scalability and efficiency, we present the first version of the workflow manager by providing details on its custom definition language, explaining the communication components and the general system architecture and setup. We currently implement the system, which is grounded and motivated by real-world industry use cases in several innovation and transfer projects.
We present a new version of the Potsdam Commentary Corpus; a German corpus of news commentary art... more We present a new version of the Potsdam Commentary Corpus; a German corpus of news commentary articles annotated on several different layers. This new release includes additional annotation layers for dependency trees and information-structural aboutness topics as well as some bug fixes. In addition to discussing the additional layers, we demonstrate the added value of loading the corpus in ANNIS3, a tool to merge different annotation layers on the same corpus and allow for queries combining information from different annotation layers. Using several cross-layer example queries we demonstrate its suitability to corpus analysis for various different areas.
Online media are ubiquitous and consumed by billions of people globally. Recently, however, sever... more Online media are ubiquitous and consumed by billions of people globally. Recently, however, several phenomena regarding online media have emerged that pose a severe threat to media consumption and reception as well as to the potential of manipulating opinions and, thus, (re)actions, on a large scale. Lumped together under the label “fake news”, these phenomena comprise, among others, maliciously manipulated content, bad journalism, parodies, satire, propaganda and several other types of false news; related phenomena are the often cited filter bubble (echo chamber) effect and the amount of abusive language used online. In an earlier paper we describe an architectural and technological approach to empower users to handle these online media phenomena. In this article we provide the first approach of a metadata scheme to enable, eventually, the standardised annotation of these phenomena in online media. We also show an initial version of a tool that enables the creation, visualisation a...
We develop a system that aims at generating stories or, rather, potential story paths, based on t... more We develop a system that aims at generating stories or, rather, potential story paths, based on the semantic analysis of multiple source documents (including news articles) using template-filling. The final system will be realised by additional methods, also taking specific domains and topics into account. For the processing we use NLP methods such as named entity recognition, we also use a triple store and classic document indexing modules. The analysis information is filtered, rearranged and recombined to fit the respective template. The system’s use case is to support knowledge workers (journalists, editors, curators etc.) in their tasks of processing large amounts of (incoming) documents, to identify important entities, relationships between entities and to suggest individual story paths between entities, eventually to come up with more efficient processes for content curation.
In all domains and sectors, the demand for intelligent systems to support the processing and gene... more In all domains and sectors, the demand for intelligent systems to support the processing and generation of digital content is rapidly increasing. The availability of vast amounts of content and the pressure to publish new content quickly and in rapid succession requires faster, more efficient and smarter processing and generation methods. With a consortium of ten partners from research and industry and a broad range of expertise in AI, Machine Learning and Language Technologies, the QURATOR project, funded by the German Federal Ministry of Education and Research, develops a sustainable and innovative technology platform that provides services to support knowledge workers in various industries to address the challenges they face when curating digital content. The project's vision and ambition is to establish an ecosystem for content curation technologies that significantly pushes the current state of the art and transforms its region, the metropolitan area Berlin-Brandenburg, int...
We present an approach to the extraction of arguments for explicit discourse relations in German,... more We present an approach to the extraction of arguments for explicit discourse relations in German, as a sub-task of the larger task of shallow discourse parsing for German. Using the Potsdam Commentary Corpus, we evaluate two methods (one based on constituency trees, the other based on dependency trees) to extract both the internal and the external argument, for which our best results are 86.73 and 77.85 respectively. We demonstrate portability of this set of heuristics to another language and also put these scores into perspective by applying the same method to English and compare this to published results.
We present the concept of extending a multilingual verb lexicon also to include German. In this l... more We present the concept of extending a multilingual verb lexicon also to include German. In this lexicon, verbs are grouped by meaning and by semantic properties (following frame semantics) to form multilingual classes, linking Czech and English verbs. Entries are further linked to external lexical resources like VerbNet and PropBank. In this paper, we present our plan also to include German verbs, by experimenting with word alignments to obtain candidates linked to existing English entries, and identify possible approaches to obtain semantic role information. We further identify German-specific lexical resources to link to. This small-scale pilot study aims to provide a blueprint for extending a lexical resource with a new language.
Uploads
Papers by Peter Bourgonje