Papers by Jyrki Nummenmaa
Applied Sciences
In developing NoSQL databases, a major motivation is to achieve better efficient query performanc... more In developing NoSQL databases, a major motivation is to achieve better efficient query performance compared with relational databases. The graph database is a NoSQL paradigm where navigation is based on links instead of joining tables. Links can be implemented as pointers, and following a pointer is a constant time operation, whereas joining tables is more complicated and slower, even in the presence of foreign keys. Therefore, link-based navigation has been seen as a more efficient query approach than using join operations on tables. Existing studies strongly support this assumption. However, query complexity has received less attention. For example, in enterprise information systems, queries are usually complex so data need to be collected from several tables or by traversing paths of graph nodes of different types. In the present study, we compared the query performance of a graph-based database system (Neo4j) and relational database systems (MySQL and MariaDB). The effect of dif...
IEEE/WIC/ACM International Conference on Web Intelligence
The widespread use of information systems has become a valuable source of semi-structured data. I... more The widespread use of information systems has become a valuable source of semi-structured data. In this context, Entity Resolution (ER) emerges as a fundamental task to integrate multiple knowledge bases or identify similarities between data items (i.e., entities). Since ER is an inherently quadratic task, blocking techniques are often used to improve efficiency. Beyond the challenges related to the data volume and heterogeneity, blocking techniques also face two other challenges: streaming data and incremental processing. To address these challenges, we propose PRIME, a novel incremental schema-agnostic blocking technique that utilizes parallelism to enhance blocking efficiency. The proposed technique deals with streaming and incremental data using a distributed computational infrastructure. To improve efficiency, the technique avoids unnecessary comparisons and applies a time window strategy to prevent excessive memory consumption.
2018 IEEE International Conference on Progress in Informatics and Computing (PIC), 2018
We introduce Multilingual Grammatical Question Answering (MuG-QA), a system for answering questio... more We introduce Multilingual Grammatical Question Answering (MuG-QA), a system for answering questions in the English, German, Italian and French languages over DBpedia. The natural language modelling and parsing is implemented using Grammatical Framework (GF), a grammar formalism having natural support for multilinguality. The question analysis is based on forming an abstract conceptual grammar from the questions, and then using linearisation of the abstract grammar into different languages to parse the questions. Once a natural language question is parsed, the resulting abstract grammar tree is matched with the knowledge base schema and contents to formulate a SPARQL query. A particular strength of our approach is that once the abstract grammar has been designed, implementation for a new concrete language is relatively quick, supposing that the language has basic support in the GF Resource Grammar Library. MuG-QA has been tested with data from the QALD-7 benchmark and showed competitive results.
Journal of Intelligent Information Systems, 2021
Recently, group recommendations have gained much attention. Nevertheless, most approaches conside... more Recently, group recommendations have gained much attention. Nevertheless, most approaches consider only one round of recommendations. However, in a real-life scenario, it is expected that the history of previous recommendations is exploited to tailor the recommendations towards meeting the needs of the group members. Such history should include not only which items the system suggested, but also the reaction of the members to these items. This work introduces the problem of sequential group recommendations, by exploiting the concept of satisfaction and disagreement. Satisfaction describes how well the group received the suggested items. Disagreement describes the satisfaction bias among the group members. We utilize these concepts in three new aggregation methods, SDAA, SIAA and Average+, designed to address the specific challenges introduced by sequential group recommendations. We experimentally show the effectiveness of our methods using big real datasets for both stable and ephem...
Advanced Data Mining and Applications, 2018
Multiplayer Online Battle Arena (MOBA) game is currently one of the most popular genres of online... more Multiplayer Online Battle Arena (MOBA) game is currently one of the most popular genres of online games. In a MOBA game, players in a team compete against an opposing team. Typically, each MOBA game is a larger battle composed of a series of combat events. During a combat, the behavior of each player varies and the outcome of a game is determined both by the variation of each player’s behavior and by the interactions within each instance of combat. However, both the variation and interaction are highly dynamic and difficult to master, making it hard to predict the outcome of a game. In this paper, we present a player behavior model (called pb-model). The model allows us to predict the result of a game once we have collected enough data on the behaviour of the players. We first use convolution to extract the features of player behavior variation in each combat and model them as sequences by time. Then we use a recurrent neural network to process the interaction among these sequences. Finally, we combine these two structures in a network to predict the result of a game. Experiments performed on typical MOBA game dataset verify that our pb-model is effective and achieves as high as 87.85% prediction accuracy.
Journal of the Association for Information Science and Technology, 2021
We present a search system for grammatically analyzed corpora of Finnish parliamentary records an... more We present a search system for grammatically analyzed corpora of Finnish parliamentary records and interviews with former parliamentarians, annotated with metadata of talk structure and involved parliamentarians, and discuss their use through carefully chosen digital humanities case studies. We first introduce the construction, contents, and principles of use of the corpora. Then we discuss the application of the search system and the corpora to study how politicians talk about power, how ideological terms are used in political
Proceedings of the 5th ACM Workshop on Information Hiding and Multimedia Security, 2017
In this paper, a new multi-attribute and high capacity image-based watermarking technique for rel... more In this paper, a new multi-attribute and high capacity image-based watermarking technique for relational data is proposed. The embedding process causes low distortion into the data considering the usability restrictions defined over the marked relation. The conducted experiments show the high resilience of the proposed technique against tuple deletion and tuple addition attacks. An interesting trend of the extracted watermark is analyzed when, within certain limits, if the number of embedded marks is small, the watermark signal far from being compromised, discretely improves in the case of tuple addition attacks. According to the results, marking 13% of the attributes and under an attack of 100% of tuples addition, 96% of the watermark is extracted. Also, while previous techniques embed up to 61% of the watermark, under the same conditions, we guarantee to embed 99.96% of the marks.
With the development of e-commerce, online shopping becomes increasingly popular. Very often, onl... more With the development of e-commerce, online shopping becomes increasingly popular. Very often, online shopping customers read reviews written by other customers to compare similar items. However, the number of customer reviews is typically too large to look through in a reasonable amount of time. To extract information that can be used for online shopping decision support, this paper investigates a novel data mining problem of mining distinguishing customer focus sets from customer reviews. We demonstrate that this problem has many applications, and at the same time, is challenging. We present dFocus-Miner, a mining method with various techniques that makes the mined results interpretable and user-friendly. Our experimental results on real world data sets verify the effectiveness and efficiency of our method.
Database Systems for Advanced Applications, 2021
Database Systems for Advanced Applications, 2021
Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020
Nonnegative matrix factorization (NMF) is a popular approach to model data, however, most models ... more Nonnegative matrix factorization (NMF) is a popular approach to model data, however, most models are unable to flexibly take into account multiple matrices across sources and time or apply only to integer-valued data. We introduce a probabilistic, Gaussian Process-based, more inclusive NMF-based model which jointly analyzes nonnegative data such as text data word content from multiple sources in a temporal dynamic manner. The model collectively models observed matrix data, source-wise latent variables, and their dependencies and temporal evolution with a full-fledged hierarchical approach including flexible nonparametric temporal dynamics. Experiments on simulated data and real data show the model out-performs, comparable models. A case study on social media and news demonstrates the model discovers semantically meaningful topical factors and their evolution
Uncertainty Management with Fuzzy and Rough Sets, 2019
This chapter shows an application of fuzzy set theory to preventive health support systems where ... more This chapter shows an application of fuzzy set theory to preventive health support systems where adherence to medical treatment is an important measure to promote health and reduce health care costs. Preventive health care information technology systems design include ensuring adherence to treatment through Just-In-Time Adaptive Interventions (JITAI). Determining the timing of the intervention and the appropriate intervention strategy are two of the main difficulties facing current systems. In this work, a JITAI system called Health-e-living (Heli) was developed for a group of patients with type-2 diabetes. During the development stages of Heli it was verified that the state of each user is fuzzy and it is difficult to get the right moment to send motivational message without being annoying. A fuzzy formula is proposed to measure the adherence of patients to their goals. As the adherence measurement needed more data, it was introduce the DisCo software toolset for formal specificati...
Increasing the on-time rate of bus service can prompt the people’s willingness to travel by bus, ... more Increasing the on-time rate of bus service can prompt the people’s willingness to travel by bus, which is an effective measure to mitigate the city traffic congestion. Performing queries on the bus arrival can be used to identify and analyze various kinds of non-on-time events that happened during the bus journey, which is helpful for detecting the factors of delaying events, and providing decision support for optimizing the bus schedules. We propose a data management model, called Bus-OLAP, for querying bus monitoring data, considering the characteristics of bus monitoring data and the scenarios of on-time analysis. While fulfilling typical requirements of bus monitoring data analysis, Bus-OLAP not only provides a flexible way to manage the data and to implement multiple granularity data query and update, but also supports distributed query and computation. The experiments on real-world bus monitoring data verify that Bus-OLAP is effective and efficient.
We introduce PIHVI: a novel interactive system for visualizing and exploring a large hierarchical... more We introduce PIHVI: a novel interactive system for visualizing and exploring a large hierarchical text corpus of online forum postings. The main view of the visual interface shows a largescale scatter plot, created by flexible nonlinear dimensionality reduction based on text contents of the postings, and we couple it with a coloring optimized by a second dimensionality reduction to represent the forum hierarchy. We exploit the hierarchy to provide data-driven summaries of plot areas at multiple levels of detail, allowing the user to quickly see and compare both the content-based similarity of groups of posts and how near they arise in the forum hierarchy. A user can move between hierarchy levels, mark posts or spots of interest, filter posts by content similarity and by location within the hierarchy, and inspect post contents. Experiments show the interface can reveal hidden semantic relationships between postings that would be hard to find based on the known hierarchy alone. ACM Cl...
Information hiding techniques have been useful for passing secret messages unnoticed since old ti... more Information hiding techniques have been useful for passing secret messages unnoticed since old times, but nowadays it also has been purposeful to prove ownership of digital assets. The increment of the internet services has provoked easy accessing to illegal or unauthorized copies of datasets, so the piracy is at its best. With watermarking emerging as a tool for ownership proof, traitor tracing, etc., there have been several techniques for multimedia data but no so over relational data. Due to the differences of these data types, another angle is necessary to the conceptions of its watermarking schemes also to deal with new problems that have emerged. With our research, we seek to develop a robust technique based on meaningful signals oriented to watermarking relational data. The watermark must be resilient against common updates but also, it must be resilient against bit level attacks that tries to destroy the watermark.
With the availability of diverse data reflecting people’s behavior, behavior analysis has been st... more With the availability of diverse data reflecting people’s behavior, behavior analysis has been studied extensively. Detecting anom-alies can improve the monitoring and understanding of the objects’ (e.g., people’s) behavior. This work considers the situation where objects behave significantly differently from their previous (past) similar objects. We call this locally anomalous behavior change. Locally anomalous behavior change detection is relevant to various practical applications, e.g., detecting elderly people with abnormal behavior. In this paper, making use of objects, behavior and their associated attributes as well as the relations between them, we propose a behavior information sequence (BIS) constructed from behavior data, and design a novel graph information propagation autoencoder framework called LOCATE (locally anomalous behavior change detection), to detect the anomalies involving the locally anomalous behavior change in the BIS. Two real-world datasets were used to a...
Feature Engineering for Machine Learning and Data Analytics, 2018
Proceedings of the 35th Annual ACM Symposium on Applied Computing, 2020
Currently, a wide number of information systems produce a large amount of data continuously. Sinc... more Currently, a wide number of information systems produce a large amount of data continuously. Since these sources may have overlapping knowledge, the Entity Resolution (ER) task emerges as a fundamental step to integrate multiple knowledge bases or identify similarities between entities. Considering the quadratic cost of the ER task, blocking techniques are often used to improve efficiency. Such techniques face two main challenges related to data volume (i.e., large data sources) and variety (i.e., heterogeneous data). Besides these challenges, blocking techniques also face two other ones: streaming data and incremental processing. To address these four challenges simultaneously, we propose PI-Block, a novel incremental schema-agnostic blocking technique that utilizes parallelism (through distributed computational infrastructure) to enhance blocking efficiency. In our experimental evaluation, we use four real-world data source pairs, and highlight that PI-Block achieves better results regarding efficiency and effectiveness compared to the state-of-the-art technique.
Proceedings of the 35th Annual ACM Symposium on Applied Computing, 2020
Recommender systems • Content-based method : recommended items are similar to other items that th... more Recommender systems • Content-based method : recommended items are similar to other items that the user has already consumed in the past • Collaborative filtering method : a relevance function get items from similar enough users (ratings, feedback are used, textual review or like/dislike) produces the list of relevant items to the target user • Group recommendations: a group can make a query as well by applying a recommendation method to each member individually, and then aggregate the separate lists into one for the group. Group recommendation process Average approach : calculate the average score across all the group members' preference scores for that item. In such a way, all members of the group are considered equals Least misery approach : use the minimum function rather than the average
Missouri Journal of Mathematical Sciences, 2007
We consider the problem of finding all the permutations corresponding to a given Young tableau. W... more We consider the problem of finding all the permutations corresponding to a given Young tableau. We introduce a recursive algorithm, deleting the elements from the tableau in reverse order which were inserted by the Schensted algorithm.
Uploads
Papers by Jyrki Nummenmaa