The STEVIN programme was not only an important scientific endeavour in the Low Countries, but als... more The STEVIN programme was not only an important scientific endeavour in the Low Countries, but also a quite rare case of a tight inter-institutional cross-border collaboration within the Dutch-speaking linguistic area. Four funding agencies, three ministerial departments and one intergovernmental organisation in Flanders and the Netherlands were involved in this programme. STEVIN is an excellent illustration of how a medium European language can set an example in the domain of language (technology) policy. It remains extremely important that citizens can use their native language in all circumstances, including when they deal with modern ICT and leisure devices. E.g., a very recent trend is that devices such as smart-phones and television sets become voice-controlled. But usually English speaking people are the first to benefit from such an evolution; other linguistic communities have to wait-some for ever ? Not only does this pose a danger of reducing the overall functionality of a language (and an impoverishment of an entire culture), but also it threatens those groups in society that do not master the universal language. E.g., elderly or disabled people, who deserve most to enjoy the blessings of modern technology, are in many cases the last ones to benefit from it. Therefore, R&D programmes that support the local language are needed. Also in the future, the Dutch Language Union will continue to emphasise this issue. Many individuals have contributed to make STEVIN a success story, of all which I sincerely want to thank for their committment. A particular mention goes to the funding government organisations from the Netherlands and Flanders. I am confident that the STEVIN results will boost research at academia and technology development in industry so that the Dutch language can continue to "serve" its speakers well under all circumstances. Hence, it is with great pleasure that I invite you to discover the scientific results of the STEVIN programme.
International Conference on Language Resources and Evaluation, 2016
This paper presents a framework and methodology for the annotation of perspectives in text. In th... more This paper presents a framework and methodology for the annotation of perspectives in text. In the last decade, different aspects of linguistic encoding of perspectives have been targeted as separated phenomena through different annotation initiatives. We propose an annotation scheme that integrates these different phenomena. We use a multilayered annotation approach, splitting the annotation of different aspects of perspectives into small subsequent subtasks in order to reduce the complexity of the task and to better monitor interactions between layers. Currently, we have included four layers of perspective annotation: events, attribution, factuality and opinion. The annotations are integrated in a formal model called GRaSP, which provides the means to represent instances (e.g. events, entities) and propositions in the (real or assumed) world in relation to their mentions in text. Then, the relation between the source and target of a perspective is characterized by means of perspective annotations. This enables us to place alternative perspectives on the same entity, event or proposition next to each other.
Revista de Procesamiento de Lenguaje Natural (SEPLN), 2017
Sentiment Analysis is a well-known task of Natural Language Processing that has been studied in d... more Sentiment Analysis is a well-known task of Natural Language Processing that has been studied in different domains such as movies, phones or hotels. However, other areas like medical domain remain yet unexplored. In this paper we study different polarity classification techniques applied on health domain. We present a corpus of patient reviews composed by a Dutch part (COPOD: Corpus of Patient Opinions in Dutch) and a Spanish part (COPOS: Corpus of Patient Opinions in Spanish). Experiments have been carried out using a supervised method (SVM), a cross-domain method (OpeNER) and a dictionary lookup method for both languages. Obtained results overcome the baseline in almost all the cases and are higher than other polarity classifiers in patient domain. Regarding the bilingualism, the developed systems for Dutch and Spanish have a similar performance for F1-measure and Accuracy.
In this paper we propose a method to build fine-grained subjectivity lexicons including nouns, ve... more In this paper we propose a method to build fine-grained subjectivity lexicons including nouns, verbs and adjectives. The method, which is applied for Dutch, is based on the comparison of word frequencies of three corpora: Wikipedia, News and News comments. Comparison of the corpora is carried out with two measures: log-likelihood ratio and a percentage difference calculation. The first step of the method involves subjectivity identification, i.e. determining if a word is subjective or not. The second step aims at the identification of more fine-grained subjectivity which is the distinction between actor subjectivity and speaker / writer subjectivity. The results suggest that this approach can be usefully applied producing subjectivity lexicons of high quality.
One of the goals of the STEVIN programme is the realisation of a digital infrastructure that will... more One of the goals of the STEVIN programme is the realisation of a digital infrastructure that will enforce the position of the Dutch language in the modern information and communication technology. A semantic database for Dutch is a crucial component for this infrastructure for three reasons: (1) it enables the development of
Cornetto is a two-year project, funded by the Flemish-Dutch Taalunie in the Stevin-programme (pro... more Cornetto is a two-year project, funded by the Flemish-Dutch Taalunie in the Stevin-programme (project number STE05039). It produces a lexical semantic database for Dutch. The database combines Wordnet (Fellbaum 1998) with FrameNet-like information. The data is derived from two existing lexical re-
Opinion mining is a natural language analysis task aimed at obtaining the overall sentiment regar... more Opinion mining is a natural language analysis task aimed at obtaining the overall sentiment regarding a particular topic. This paper presents a prototype that presents the overall sentiment of a topic based on the geographical distribution of the sources on this topic. The prototype was developed in a single day during the hackathon organised by the OpeNER project in Amsterdam last year. The OpeNER infrastructure was used to process a large set of news articles in four different languages. Using these tools, an overall sentiment analysis was obtained for a set of topics mentioned in the news articles and presented on an interactive worldmap.
This paper presents a lexicon model for subjectivity description of Dutch verbs that offers a fra... more This paper presents a lexicon model for subjectivity description of Dutch verbs that offers a framework for the development of sentiment analysis and opinion mining applications based on a deep syntactic-semantic approach. The model aims to describe the detailed subjectivity relations that exist between the participants of the verbs, expressing multiple attitudes for each verb sense. Validation is provided by an annotation study that shows that these subtle subjectivity relations are reliably identifiable by human annotators.
Semantic change and concept drift are studied in many different academic fields. Different domain... more Semantic change and concept drift are studied in many different academic fields. Different domains have different understandings of what a concept and, thus, concept drift is making it harder for researchers to build upon work in other disciplines. In this paper, we aim to address this challenge and propose definitions for these phenomena which apply across fields. We provide formal definitions and illustrate how concept drift and related phenomena can be modeled in RDF through the use of context. We explain and support the definitions through an example from historical research and argue that a formal modeling of semantic change in RDF can help to better interpret data.
Many sentiment-analysis methods for the classification of reviews use training and test-data base... more Many sentiment-analysis methods for the classification of reviews use training and test-data based on star ratings provided by reviewers. However, when reading reviews it appears that the reviewers’ ratings do not always give an accurate measure of the sentiment of the review. We performed an annotation study which showed that reader perceptions can also be expressed in ratings in a reliable way and that they are closer to the text than the reviewer ratings. Moreover, we applied two common sentiment-analysis techniques and evaluated them on both reader and reviewer ratings. We come to the conclusion that it would be better to train models on reader ratings, rather than on reviewer ratings (as is usually done).
This paper presents a framework and methodology for the annotation of perspectives in text. In th... more This paper presents a framework and methodology for the annotation of perspectives in text. In the last decade, different aspects of linguistic encoding of perspectives have been targeted as separated phenomena through different annotation initiatives. We propose an annotation scheme that integrates these different phenomena. We use a multilayered annotation approach, splitting the annotation of different aspects of perspectives into small subsequent subtasks in order to reduce the complexity of the task and to better monitor interactions between layers. Currently, we have included four layers of perspective annotation: events, attribution, factuality and opinion. The annotations are integrated in a formal model called GRaSP, which provides the means to represent instances (e.g. events, entities) and propositions in the (real or assumed) world in relation to their mentions in text. Then, the relation between the source and target of a perspective is characterized by means of perspec...
In this paper we present some of the features of the model used in the DOT-project. The aim of th... more In this paper we present some of the features of the model used in the DOT-project. The aim of this pilot project is to find out how to deal with official governmental terminological data in an efficient, consistent and multifunctional way, assuring a maximum of accessibility and user-friendliness. The project has started on January, 1, 1999 and will be finalised at the end of June 2000. Although the project shows both representational and acquisitional aspects (such as term acquisition), this paper will only focus on aspects of the datamodel such as entities and links.
In this paper we focus on the creation of general-purpose (as opposed to domain-specific) polarit... more In this paper we focus on the creation of general-purpose (as opposed to domain-specific) polarity lexicons in five languages: French, Italian, Dutch, English and Spanish using WordNet propagation. WordNet propagation is a commonly used method to generate these lexicons as it gives high coverage of general purpose language and the semantically rich WordNets where concepts are organised in synonym , antonym and hyperonym/hyponym structures seem to be well suited to the identification of positive and negative words. However, WordNets of different languages may vary in many ways such as the way they are compiled, the number of synsets, number of synonyms and number of semantic relations they include. In this study we investigate whether this variability translates into differences of performance when these WordNets are used for polarity propagation. Although many variants of the propagation method are developed for English, little is known about how they perform with WordNets of other ...
In this paper we present the Vaccination Corpus, a corpus of texts related to the online vaccinat... more In this paper we present the Vaccination Corpus, a corpus of texts related to the online vaccination debate that has been annotated with three layers of information about perspectives: attribution, claims and opinions. Additionally, events related to the vaccination debate are also annotated. The corpus contains 294 documents from the Internet which reflect different views on vaccinations. It has been compiled to study the language of online debates, with the final goal of experimenting with methodologies to extract and contrast perspectives in the framework of the vaccination debate.
In this paper we propose a method to build fine-grained subjectivity lexicons including nouns, ve... more In this paper we propose a method to build fine-grained subjectivity lexicons including nouns, verbs and adjectives. The method, which is applied for Dutch, is based on the comparison of word frequencies of three corpora: Wikipedia, News and News comments. Comparison of the corpora is carried out with two measures: log-likelihood ratio and a percentage difference calculation. The first step of the method involves subjectivity identification, i.e. determining if a word is subjective or not. The second step aims at the identification of more fine-grained subjectivity which is the distinction between actor subjectivity and speaker / writer subjectivity. The results suggest that this approach can be usefully applied producing subjectivity lexicons of high quality.
Complexity of event data in texts makes it difficult to assess its content, especially when consi... more Complexity of event data in texts makes it difficult to assess its content, especially when considering larger collections in which different sources report on the same or similar situations. We present a system that makes it possible to visually analyze complex event and emotion data extracted from texts. We show that we can abstract from different data models for events and emotions to a single data model that can show the complex relations in four dimensions. The visualization has been applied to analyze 1) dynamic developments in how people both conceive and express emotions in theater plays and 2) how stories are told from the perspectyive of their sources based on rich event data extracted from news or biographies.
Results are presented of an ongoing project of the Dutch TST-centre for language and speech techn... more Results are presented of an ongoing project of the Dutch TST-centre for language and speech technology aiming at linking of various lexical databases. The project involves four Dutch monolingual lexicons: WlNT05, e-Lex, RBN and RBBN. These databases differ in organisational structure and content. To enable linkage between these lexicons, we developed a common feature value set and a common organisational structure. Both are based upon existing standards for the creation and reusability of lexicons: the Lexical Markup Framework and the EAGLES standard. Examples of the content and structure of each of the lexical databases are presented in their original form. Also, the structure and content is shown when mapped onto the common framework and feature value set. Thus, the commonalities and the complementarity of the lexical databases are more readily apparent. Besides, this elaboration of the databases opens up the opportunity for mutual enrichment.
The STEVIN programme was not only an important scientific endeavour in the Low Countries, but als... more The STEVIN programme was not only an important scientific endeavour in the Low Countries, but also a quite rare case of a tight inter-institutional cross-border collaboration within the Dutch-speaking linguistic area. Four funding agencies, three ministerial departments and one intergovernmental organisation in Flanders and the Netherlands were involved in this programme. STEVIN is an excellent illustration of how a medium European language can set an example in the domain of language (technology) policy. It remains extremely important that citizens can use their native language in all circumstances, including when they deal with modern ICT and leisure devices. E.g., a very recent trend is that devices such as smart-phones and television sets become voice-controlled. But usually English speaking people are the first to benefit from such an evolution; other linguistic communities have to wait-some for ever ? Not only does this pose a danger of reducing the overall functionality of a language (and an impoverishment of an entire culture), but also it threatens those groups in society that do not master the universal language. E.g., elderly or disabled people, who deserve most to enjoy the blessings of modern technology, are in many cases the last ones to benefit from it. Therefore, R&D programmes that support the local language are needed. Also in the future, the Dutch Language Union will continue to emphasise this issue. Many individuals have contributed to make STEVIN a success story, of all which I sincerely want to thank for their committment. A particular mention goes to the funding government organisations from the Netherlands and Flanders. I am confident that the STEVIN results will boost research at academia and technology development in industry so that the Dutch language can continue to "serve" its speakers well under all circumstances. Hence, it is with great pleasure that I invite you to discover the scientific results of the STEVIN programme.
International Conference on Language Resources and Evaluation, 2016
This paper presents a framework and methodology for the annotation of perspectives in text. In th... more This paper presents a framework and methodology for the annotation of perspectives in text. In the last decade, different aspects of linguistic encoding of perspectives have been targeted as separated phenomena through different annotation initiatives. We propose an annotation scheme that integrates these different phenomena. We use a multilayered annotation approach, splitting the annotation of different aspects of perspectives into small subsequent subtasks in order to reduce the complexity of the task and to better monitor interactions between layers. Currently, we have included four layers of perspective annotation: events, attribution, factuality and opinion. The annotations are integrated in a formal model called GRaSP, which provides the means to represent instances (e.g. events, entities) and propositions in the (real or assumed) world in relation to their mentions in text. Then, the relation between the source and target of a perspective is characterized by means of perspective annotations. This enables us to place alternative perspectives on the same entity, event or proposition next to each other.
Revista de Procesamiento de Lenguaje Natural (SEPLN), 2017
Sentiment Analysis is a well-known task of Natural Language Processing that has been studied in d... more Sentiment Analysis is a well-known task of Natural Language Processing that has been studied in different domains such as movies, phones or hotels. However, other areas like medical domain remain yet unexplored. In this paper we study different polarity classification techniques applied on health domain. We present a corpus of patient reviews composed by a Dutch part (COPOD: Corpus of Patient Opinions in Dutch) and a Spanish part (COPOS: Corpus of Patient Opinions in Spanish). Experiments have been carried out using a supervised method (SVM), a cross-domain method (OpeNER) and a dictionary lookup method for both languages. Obtained results overcome the baseline in almost all the cases and are higher than other polarity classifiers in patient domain. Regarding the bilingualism, the developed systems for Dutch and Spanish have a similar performance for F1-measure and Accuracy.
In this paper we propose a method to build fine-grained subjectivity lexicons including nouns, ve... more In this paper we propose a method to build fine-grained subjectivity lexicons including nouns, verbs and adjectives. The method, which is applied for Dutch, is based on the comparison of word frequencies of three corpora: Wikipedia, News and News comments. Comparison of the corpora is carried out with two measures: log-likelihood ratio and a percentage difference calculation. The first step of the method involves subjectivity identification, i.e. determining if a word is subjective or not. The second step aims at the identification of more fine-grained subjectivity which is the distinction between actor subjectivity and speaker / writer subjectivity. The results suggest that this approach can be usefully applied producing subjectivity lexicons of high quality.
One of the goals of the STEVIN programme is the realisation of a digital infrastructure that will... more One of the goals of the STEVIN programme is the realisation of a digital infrastructure that will enforce the position of the Dutch language in the modern information and communication technology. A semantic database for Dutch is a crucial component for this infrastructure for three reasons: (1) it enables the development of
Cornetto is a two-year project, funded by the Flemish-Dutch Taalunie in the Stevin-programme (pro... more Cornetto is a two-year project, funded by the Flemish-Dutch Taalunie in the Stevin-programme (project number STE05039). It produces a lexical semantic database for Dutch. The database combines Wordnet (Fellbaum 1998) with FrameNet-like information. The data is derived from two existing lexical re-
Opinion mining is a natural language analysis task aimed at obtaining the overall sentiment regar... more Opinion mining is a natural language analysis task aimed at obtaining the overall sentiment regarding a particular topic. This paper presents a prototype that presents the overall sentiment of a topic based on the geographical distribution of the sources on this topic. The prototype was developed in a single day during the hackathon organised by the OpeNER project in Amsterdam last year. The OpeNER infrastructure was used to process a large set of news articles in four different languages. Using these tools, an overall sentiment analysis was obtained for a set of topics mentioned in the news articles and presented on an interactive worldmap.
This paper presents a lexicon model for subjectivity description of Dutch verbs that offers a fra... more This paper presents a lexicon model for subjectivity description of Dutch verbs that offers a framework for the development of sentiment analysis and opinion mining applications based on a deep syntactic-semantic approach. The model aims to describe the detailed subjectivity relations that exist between the participants of the verbs, expressing multiple attitudes for each verb sense. Validation is provided by an annotation study that shows that these subtle subjectivity relations are reliably identifiable by human annotators.
Semantic change and concept drift are studied in many different academic fields. Different domain... more Semantic change and concept drift are studied in many different academic fields. Different domains have different understandings of what a concept and, thus, concept drift is making it harder for researchers to build upon work in other disciplines. In this paper, we aim to address this challenge and propose definitions for these phenomena which apply across fields. We provide formal definitions and illustrate how concept drift and related phenomena can be modeled in RDF through the use of context. We explain and support the definitions through an example from historical research and argue that a formal modeling of semantic change in RDF can help to better interpret data.
Many sentiment-analysis methods for the classification of reviews use training and test-data base... more Many sentiment-analysis methods for the classification of reviews use training and test-data based on star ratings provided by reviewers. However, when reading reviews it appears that the reviewers’ ratings do not always give an accurate measure of the sentiment of the review. We performed an annotation study which showed that reader perceptions can also be expressed in ratings in a reliable way and that they are closer to the text than the reviewer ratings. Moreover, we applied two common sentiment-analysis techniques and evaluated them on both reader and reviewer ratings. We come to the conclusion that it would be better to train models on reader ratings, rather than on reviewer ratings (as is usually done).
This paper presents a framework and methodology for the annotation of perspectives in text. In th... more This paper presents a framework and methodology for the annotation of perspectives in text. In the last decade, different aspects of linguistic encoding of perspectives have been targeted as separated phenomena through different annotation initiatives. We propose an annotation scheme that integrates these different phenomena. We use a multilayered annotation approach, splitting the annotation of different aspects of perspectives into small subsequent subtasks in order to reduce the complexity of the task and to better monitor interactions between layers. Currently, we have included four layers of perspective annotation: events, attribution, factuality and opinion. The annotations are integrated in a formal model called GRaSP, which provides the means to represent instances (e.g. events, entities) and propositions in the (real or assumed) world in relation to their mentions in text. Then, the relation between the source and target of a perspective is characterized by means of perspec...
In this paper we present some of the features of the model used in the DOT-project. The aim of th... more In this paper we present some of the features of the model used in the DOT-project. The aim of this pilot project is to find out how to deal with official governmental terminological data in an efficient, consistent and multifunctional way, assuring a maximum of accessibility and user-friendliness. The project has started on January, 1, 1999 and will be finalised at the end of June 2000. Although the project shows both representational and acquisitional aspects (such as term acquisition), this paper will only focus on aspects of the datamodel such as entities and links.
In this paper we focus on the creation of general-purpose (as opposed to domain-specific) polarit... more In this paper we focus on the creation of general-purpose (as opposed to domain-specific) polarity lexicons in five languages: French, Italian, Dutch, English and Spanish using WordNet propagation. WordNet propagation is a commonly used method to generate these lexicons as it gives high coverage of general purpose language and the semantically rich WordNets where concepts are organised in synonym , antonym and hyperonym/hyponym structures seem to be well suited to the identification of positive and negative words. However, WordNets of different languages may vary in many ways such as the way they are compiled, the number of synsets, number of synonyms and number of semantic relations they include. In this study we investigate whether this variability translates into differences of performance when these WordNets are used for polarity propagation. Although many variants of the propagation method are developed for English, little is known about how they perform with WordNets of other ...
In this paper we present the Vaccination Corpus, a corpus of texts related to the online vaccinat... more In this paper we present the Vaccination Corpus, a corpus of texts related to the online vaccination debate that has been annotated with three layers of information about perspectives: attribution, claims and opinions. Additionally, events related to the vaccination debate are also annotated. The corpus contains 294 documents from the Internet which reflect different views on vaccinations. It has been compiled to study the language of online debates, with the final goal of experimenting with methodologies to extract and contrast perspectives in the framework of the vaccination debate.
In this paper we propose a method to build fine-grained subjectivity lexicons including nouns, ve... more In this paper we propose a method to build fine-grained subjectivity lexicons including nouns, verbs and adjectives. The method, which is applied for Dutch, is based on the comparison of word frequencies of three corpora: Wikipedia, News and News comments. Comparison of the corpora is carried out with two measures: log-likelihood ratio and a percentage difference calculation. The first step of the method involves subjectivity identification, i.e. determining if a word is subjective or not. The second step aims at the identification of more fine-grained subjectivity which is the distinction between actor subjectivity and speaker / writer subjectivity. The results suggest that this approach can be usefully applied producing subjectivity lexicons of high quality.
Complexity of event data in texts makes it difficult to assess its content, especially when consi... more Complexity of event data in texts makes it difficult to assess its content, especially when considering larger collections in which different sources report on the same or similar situations. We present a system that makes it possible to visually analyze complex event and emotion data extracted from texts. We show that we can abstract from different data models for events and emotions to a single data model that can show the complex relations in four dimensions. The visualization has been applied to analyze 1) dynamic developments in how people both conceive and express emotions in theater plays and 2) how stories are told from the perspectyive of their sources based on rich event data extracted from news or biographies.
Results are presented of an ongoing project of the Dutch TST-centre for language and speech techn... more Results are presented of an ongoing project of the Dutch TST-centre for language and speech technology aiming at linking of various lexical databases. The project involves four Dutch monolingual lexicons: WlNT05, e-Lex, RBN and RBBN. These databases differ in organisational structure and content. To enable linkage between these lexicons, we developed a common feature value set and a common organisational structure. Both are based upon existing standards for the creation and reusability of lexicons: the Lexical Markup Framework and the EAGLES standard. Examples of the content and structure of each of the lexical databases are presented in their original form. Also, the structure and content is shown when mapped onto the common framework and feature value set. Thus, the commonalities and the complementarity of the lexical databases are more readily apparent. Besides, this elaboration of the databases opens up the opportunity for mutual enrichment.
Uploads
Papers by Isa Maks