Proceedings of The Eighteenth Joint ACL - ISO Workshop on Interoperable Semantic Annotation, 2022
Reasoning about spatial information is fundamental in natural language to fully understand relati... more Reasoning about spatial information is fundamental in natural language to fully understand relationships between entities and/or between events. However, the complexity underlying such reasoning makes it hard to represent formally spatial information. Despite the growing interest on this topic, and the development of some frameworks, many problems persist regarding, for instance, the coverage of a wide variety of linguistic constructions and of languages. In this paper, we present a proposal of integrating ISO-Space into a ISO-based multilayer annotation scheme, designed to annotate news in European Portuguese. This scheme already enables annotation at three levels, temporal, referential and thematic, by combining postulates from ISO 24617-1, 4 and 9. Since the corpus comprises news articles, and spatial information is relevant within this kind of texts, a more detailed account of space was required. The main objective of this paper is to discuss the process of integrating ISO-Space with the existing layers of our annotation scheme, assessing the compatibility of the aforementioned parts of ISO 24617, and the problems posed by the harmonization of the four layers and by some specifications of ISO-Space.
... Steve Moyle (1), Alípio Jorge (2)(3) (1) Oxford University Computing Laboratory, UK, (2) LIAC... more ... Steve Moyle (1), Alípio Jorge (2)(3) (1) Oxford University Computing Laboratory, UK, (2) LIACC University of Porto, Rua Campo Alegre 823, 4150 Porto, Portugal. (3) Faculty of Economics, University of Porto, Portugal [email protected], http://www.niaad.liacc.up.pt/~amjorge ...
In this paper we study a new technique we call post-bagging, which consists in resampling parts o... more In this paper we study a new technique we call post-bagging, which consists in resampling parts of a classification model rather then the data. We do this with a particular kind of model: large sets of classification association rules, and in combination with ordinary best rule and weighted voting approaches. We empirically evaluate the effects of the technique in terms of classification accuracy. We also discuss the predictive power of different metrics used for association rule mining, such as confidence, lift, conviction and χ 2 . We conclude that, for the described experimental conditions, post-bagging improves classification results and that the best metric is conviction.
Page 1. MODEL-BASED COLLABORATIVE FILTERING FOR TEAM BUILDING SUPPORT Miguel Veloso Enabler Sol... more Page 1. MODEL-BASED COLLABORATIVE FILTERING FOR TEAM BUILDING SUPPORT Miguel Veloso Enabler Solutions for Retailing, Av. da Boavista, 1223, 4100-130 Porto - Portugal, [email protected], [email protected] ...
We present a system to monitor the quality of the meta-data used to describe content in web porta... more We present a system to monitor the quality of the meta-data used to describe content in web portals. It implements meta-data analysis using statistics, visualization and association rules. The system enables the site's editor to detect and correct problems in the description of contents, thus improving the quality of the web portal and the satisfaction of its users. We have developed this system and tested it on a Portuguese portal for business executives.
The analysis, design and maintenance of Web sites involves two significant challenges: managing t... more The analysis, design and maintenance of Web sites involves two significant challenges: managing the services and content available, and secondly, making the site dynamically adequate to user's needs. The Site-O-Matic project (SOM) aims to develop a comprehensive framework for automating several of the management activities of a Web site. Such framework must comprehend a suitable database infrastructure, where all the information about the activity of the site is stored. In this paper we propose a data warehouse for that purpose, and relate it with the most common Web mining applications. We also present a case study where we integrate the data warehouse with an existing Web site through a simple recommender system.
In this paper we describe a platform that enables Web site automation and monitoring. The platfor... more In this paper we describe a platform that enables Web site automation and monitoring. The platform automatically gathers high quality site activity data, both from the server and client sides. Web adapters, such as recommender systems, can be easily plugged into the platform, and take advantage of the up-to-date activity data. The platform also includes a module to support the editor of the site to monitor and assess the effects of automation. We illustrate the features of the platform on a case study, where we show how it can be used to gather information not only to model the behavior of users but also the impact of the personalization mechanism.
To emphasize the common themes, we will give a combined summary of the contributions in this volu... more To emphasize the common themes, we will give a combined summary of the contributions in this volume. To make it easier to understand the papers in the organizational context for which they were written and in which they were discussed, we have ordered them by workshop in the table of contents.
Desde muito cedo que a espécie Humana sentiu a necessidade de manter registos da sua actividade, ... more Desde muito cedo que a espécie Humana sentiu a necessidade de manter registos da sua actividade, para que possam ser facilmente consultados futuramente. A nossa própria evolução depende, em larga medida, deste processo iterativo em que cada iteração se baseia nestes registos. O aparecimento da web e o seu sucesso incrementaram significativamente a disponibilidade da informação que rapidamente se tornou ubíqua. No entanto, a ausência de controlo editorial origina uma grande heterogeneidade sob vários aspectos. As técnicas tradicionais em recuperação de informação provam ser insuficientes para este novo meio. A recuperação de informação na web é a evolução natural da área de recuperação de informação para o meio web. Neste artigo apresentamos uma análise retrospectiva e, esperamos, abrangente desta área do conhecimento Humano.
The web is a vast repository of human knowledge that is much explored as an information source; h... more The web is a vast repository of human knowledge that is much explored as an information source; however, web characteristics, such as volume, dynamics and heterogeneity require significant amounts of effort to satisfy private or institutional information needs. Topic focused portals try to reduce the effort that is required to end-users at the expense of extra time and effort that is required to an editor, who manages the portal on behalf of all its end-users. This editorial effort can be reduced by semi-automating some of editor's tasks. The higher aim of this work is to contribute to the development of a computational framework able to assist web site editors to retrieve and manage web content. We expect to reduce the cost of maintaining web sites and also to improve end-user satisfaction.
In some classification tasks, such as those related to the automatic building and maintenance of ... more In some classification tasks, such as those related to the automatic building and maintenance of text corpora, it is expensive to obtain labeled examples to train a classifier. In such circumstances it is common to have massive corpora where a few examples are labeled (typically a minority) while others are not. Semi-supervised learning techniques try to leverage the intrinsic information in unlabeled examples to improve classification models. However, these techniques assume that the labeled examples cover all the classes to learn which might not be the case. Moreover, when in the presence of an imbalanced class distribution, getting labeled examples from minority classes might be very costly, requiring extensive labeling, if queries are randomly selected. Active learning allows asking an oracle to label new examples, which are criteriously selected, aiming to reduce the labeling effort. D-Confidence is an active learning approach that is effective when in presence of imbalanced training sets. In this paper we discuss the performance of d-Confidence over text corpora. We show empirically that d-Confidence reduces the number of queries required to identify examples from all classes to learn when in presence of imbalanced data in text corpora.
In some classification tasks, such as those related to the automatic building and maintenance of ... more In some classification tasks, such as those related to the automatic building and maintenance of text corpora, it is expensive to obtain labeled examples to train a classifier. In such circumstances it is common to have massive corpora where a few examples are labeled (typically a minority) while others are not. Semi-supervised learning techniques try to leverage the intrinsic information in unlabeled examples to improve classification models. However, these techniques assume that the labeled examples cover all the classes to learn which might not stand. In the presence of an imbalanced class distribution getting labeled examples from minority classes might be very costly if queries are randomly selected. Active learning allows asking an oracle to label new examples, that are criteriously selected, and does not assume a previous knowledge of all classes. D-Confidence is an active learning approach that is effective when in presence of imbalanced training sets. In this paper we discu...
We present a generic model and software module of spread-ing activation, and its specialisation t... more We present a generic model and software module of spread-ing activation, and its specialisation to support a number of specific models in the literature. We hope the unification thus provided helps understand spreading activation in general and compare specific mod-els. We also provide a new specific model, Watermark, that reduces the number of parameters of a class of specific models.
In this paper we discuss how trip time prediction can be useful for operational optimization in m... more In this paper we discuss how trip time prediction can be useful for operational optimization in mass transit companies and which machine learning techniques can be used to improve results. Firstly, we analyze which departments need trip time prediction and when. Secondly, we review related work and thirdly we present the analysis of trip time over a particular path. We proceed by presenting experimental results conducted on real data with the forecasting techniques we found most adequate, and conclude by discussing guidelines for future work.
In this paper we discuss how trip time prediction can be useful for operational optimization in m... more In this paper we discuss how trip time prediction can be useful for operational optimization in mass transit companies and how data mining techniques can be used to improve results. Firstly, we an- alyze which departments need trip time prediction and when. Secondly, we review related work and thirdly we present the analysis of trip time over a particular path. We proceed by presenting experimental results conducted on real data with the forecasting techniques we found most adequate, and conclude by discussing guidelines for future work.
The first multidimensional algorithm for recommender systems is the well known combined reduction... more The first multidimensional algorithm for recommender systems is the well known combined reduction-based, which treats additional dimensions as labels for segmenting/filtering sessions, using the segmented sessions to build the recommendation model. This algorithm only uses the additional dimensions when it outperforms the traditional two-dimensional algorithm. Otherwise, it reverts to the traditional two-dimensional algorithm to generate the top-N recommendations. In this paper, we propose to improve the combined reduction-based algorithm by using the DaVI approach, which handles additional dimensions as virtual items. Incorporating the DaVI approach into the combined reductionbased, the multidimensional algorithm uses the additional dimensions not only as labels for segmenting sessions but also as virtual items to improve the recommendation model. The empirical results demonstrate that our proposal reduces the needs of reverting to the traditional two-dimensional algorithm to gener...
We present a schema for documenting and classifying completed Data Mining, Decision Support and T... more We present a schema for documenting and classifying completed Data Mining, Decision Support and Text and Web Mining cases. Project descriptions from these areas are unified in a hierachically structured relational database. The main objectives and benefits of the repository are presented and discussed.
Proceedings of The Eighteenth Joint ACL - ISO Workshop on Interoperable Semantic Annotation, 2022
Reasoning about spatial information is fundamental in natural language to fully understand relati... more Reasoning about spatial information is fundamental in natural language to fully understand relationships between entities and/or between events. However, the complexity underlying such reasoning makes it hard to represent formally spatial information. Despite the growing interest on this topic, and the development of some frameworks, many problems persist regarding, for instance, the coverage of a wide variety of linguistic constructions and of languages. In this paper, we present a proposal of integrating ISO-Space into a ISO-based multilayer annotation scheme, designed to annotate news in European Portuguese. This scheme already enables annotation at three levels, temporal, referential and thematic, by combining postulates from ISO 24617-1, 4 and 9. Since the corpus comprises news articles, and spatial information is relevant within this kind of texts, a more detailed account of space was required. The main objective of this paper is to discuss the process of integrating ISO-Space with the existing layers of our annotation scheme, assessing the compatibility of the aforementioned parts of ISO 24617, and the problems posed by the harmonization of the four layers and by some specifications of ISO-Space.
... Steve Moyle (1), Alípio Jorge (2)(3) (1) Oxford University Computing Laboratory, UK, (2) LIAC... more ... Steve Moyle (1), Alípio Jorge (2)(3) (1) Oxford University Computing Laboratory, UK, (2) LIACC University of Porto, Rua Campo Alegre 823, 4150 Porto, Portugal. (3) Faculty of Economics, University of Porto, Portugal [email protected], http://www.niaad.liacc.up.pt/~amjorge ...
In this paper we study a new technique we call post-bagging, which consists in resampling parts o... more In this paper we study a new technique we call post-bagging, which consists in resampling parts of a classification model rather then the data. We do this with a particular kind of model: large sets of classification association rules, and in combination with ordinary best rule and weighted voting approaches. We empirically evaluate the effects of the technique in terms of classification accuracy. We also discuss the predictive power of different metrics used for association rule mining, such as confidence, lift, conviction and χ 2 . We conclude that, for the described experimental conditions, post-bagging improves classification results and that the best metric is conviction.
Page 1. MODEL-BASED COLLABORATIVE FILTERING FOR TEAM BUILDING SUPPORT Miguel Veloso Enabler Sol... more Page 1. MODEL-BASED COLLABORATIVE FILTERING FOR TEAM BUILDING SUPPORT Miguel Veloso Enabler Solutions for Retailing, Av. da Boavista, 1223, 4100-130 Porto - Portugal, [email protected], [email protected] ...
We present a system to monitor the quality of the meta-data used to describe content in web porta... more We present a system to monitor the quality of the meta-data used to describe content in web portals. It implements meta-data analysis using statistics, visualization and association rules. The system enables the site's editor to detect and correct problems in the description of contents, thus improving the quality of the web portal and the satisfaction of its users. We have developed this system and tested it on a Portuguese portal for business executives.
The analysis, design and maintenance of Web sites involves two significant challenges: managing t... more The analysis, design and maintenance of Web sites involves two significant challenges: managing the services and content available, and secondly, making the site dynamically adequate to user's needs. The Site-O-Matic project (SOM) aims to develop a comprehensive framework for automating several of the management activities of a Web site. Such framework must comprehend a suitable database infrastructure, where all the information about the activity of the site is stored. In this paper we propose a data warehouse for that purpose, and relate it with the most common Web mining applications. We also present a case study where we integrate the data warehouse with an existing Web site through a simple recommender system.
In this paper we describe a platform that enables Web site automation and monitoring. The platfor... more In this paper we describe a platform that enables Web site automation and monitoring. The platform automatically gathers high quality site activity data, both from the server and client sides. Web adapters, such as recommender systems, can be easily plugged into the platform, and take advantage of the up-to-date activity data. The platform also includes a module to support the editor of the site to monitor and assess the effects of automation. We illustrate the features of the platform on a case study, where we show how it can be used to gather information not only to model the behavior of users but also the impact of the personalization mechanism.
To emphasize the common themes, we will give a combined summary of the contributions in this volu... more To emphasize the common themes, we will give a combined summary of the contributions in this volume. To make it easier to understand the papers in the organizational context for which they were written and in which they were discussed, we have ordered them by workshop in the table of contents.
Desde muito cedo que a espécie Humana sentiu a necessidade de manter registos da sua actividade, ... more Desde muito cedo que a espécie Humana sentiu a necessidade de manter registos da sua actividade, para que possam ser facilmente consultados futuramente. A nossa própria evolução depende, em larga medida, deste processo iterativo em que cada iteração se baseia nestes registos. O aparecimento da web e o seu sucesso incrementaram significativamente a disponibilidade da informação que rapidamente se tornou ubíqua. No entanto, a ausência de controlo editorial origina uma grande heterogeneidade sob vários aspectos. As técnicas tradicionais em recuperação de informação provam ser insuficientes para este novo meio. A recuperação de informação na web é a evolução natural da área de recuperação de informação para o meio web. Neste artigo apresentamos uma análise retrospectiva e, esperamos, abrangente desta área do conhecimento Humano.
The web is a vast repository of human knowledge that is much explored as an information source; h... more The web is a vast repository of human knowledge that is much explored as an information source; however, web characteristics, such as volume, dynamics and heterogeneity require significant amounts of effort to satisfy private or institutional information needs. Topic focused portals try to reduce the effort that is required to end-users at the expense of extra time and effort that is required to an editor, who manages the portal on behalf of all its end-users. This editorial effort can be reduced by semi-automating some of editor's tasks. The higher aim of this work is to contribute to the development of a computational framework able to assist web site editors to retrieve and manage web content. We expect to reduce the cost of maintaining web sites and also to improve end-user satisfaction.
In some classification tasks, such as those related to the automatic building and maintenance of ... more In some classification tasks, such as those related to the automatic building and maintenance of text corpora, it is expensive to obtain labeled examples to train a classifier. In such circumstances it is common to have massive corpora where a few examples are labeled (typically a minority) while others are not. Semi-supervised learning techniques try to leverage the intrinsic information in unlabeled examples to improve classification models. However, these techniques assume that the labeled examples cover all the classes to learn which might not be the case. Moreover, when in the presence of an imbalanced class distribution, getting labeled examples from minority classes might be very costly, requiring extensive labeling, if queries are randomly selected. Active learning allows asking an oracle to label new examples, which are criteriously selected, aiming to reduce the labeling effort. D-Confidence is an active learning approach that is effective when in presence of imbalanced training sets. In this paper we discuss the performance of d-Confidence over text corpora. We show empirically that d-Confidence reduces the number of queries required to identify examples from all classes to learn when in presence of imbalanced data in text corpora.
In some classification tasks, such as those related to the automatic building and maintenance of ... more In some classification tasks, such as those related to the automatic building and maintenance of text corpora, it is expensive to obtain labeled examples to train a classifier. In such circumstances it is common to have massive corpora where a few examples are labeled (typically a minority) while others are not. Semi-supervised learning techniques try to leverage the intrinsic information in unlabeled examples to improve classification models. However, these techniques assume that the labeled examples cover all the classes to learn which might not stand. In the presence of an imbalanced class distribution getting labeled examples from minority classes might be very costly if queries are randomly selected. Active learning allows asking an oracle to label new examples, that are criteriously selected, and does not assume a previous knowledge of all classes. D-Confidence is an active learning approach that is effective when in presence of imbalanced training sets. In this paper we discu...
We present a generic model and software module of spread-ing activation, and its specialisation t... more We present a generic model and software module of spread-ing activation, and its specialisation to support a number of specific models in the literature. We hope the unification thus provided helps understand spreading activation in general and compare specific mod-els. We also provide a new specific model, Watermark, that reduces the number of parameters of a class of specific models.
In this paper we discuss how trip time prediction can be useful for operational optimization in m... more In this paper we discuss how trip time prediction can be useful for operational optimization in mass transit companies and which machine learning techniques can be used to improve results. Firstly, we analyze which departments need trip time prediction and when. Secondly, we review related work and thirdly we present the analysis of trip time over a particular path. We proceed by presenting experimental results conducted on real data with the forecasting techniques we found most adequate, and conclude by discussing guidelines for future work.
In this paper we discuss how trip time prediction can be useful for operational optimization in m... more In this paper we discuss how trip time prediction can be useful for operational optimization in mass transit companies and how data mining techniques can be used to improve results. Firstly, we an- alyze which departments need trip time prediction and when. Secondly, we review related work and thirdly we present the analysis of trip time over a particular path. We proceed by presenting experimental results conducted on real data with the forecasting techniques we found most adequate, and conclude by discussing guidelines for future work.
The first multidimensional algorithm for recommender systems is the well known combined reduction... more The first multidimensional algorithm for recommender systems is the well known combined reduction-based, which treats additional dimensions as labels for segmenting/filtering sessions, using the segmented sessions to build the recommendation model. This algorithm only uses the additional dimensions when it outperforms the traditional two-dimensional algorithm. Otherwise, it reverts to the traditional two-dimensional algorithm to generate the top-N recommendations. In this paper, we propose to improve the combined reduction-based algorithm by using the DaVI approach, which handles additional dimensions as virtual items. Incorporating the DaVI approach into the combined reductionbased, the multidimensional algorithm uses the additional dimensions not only as labels for segmenting sessions but also as virtual items to improve the recommendation model. The empirical results demonstrate that our proposal reduces the needs of reverting to the traditional two-dimensional algorithm to gener...
We present a schema for documenting and classifying completed Data Mining, Decision Support and T... more We present a schema for documenting and classifying completed Data Mining, Decision Support and Text and Web Mining cases. Project descriptions from these areas are unified in a hierachically structured relational database. The main objectives and benefits of the repository are presented and discussed.
Uploads
Papers by Alipio Jorge