Challengesforrs

Challenges for Recommender Systems Evaluation
Francesco Ricci
David Massimo
Antonella De Angeli
[email protected]
[email protected]
[email protected]
Free University of Bozen-Bolzano
Bolzano, Italy
ABSTRACT process where the RS generates recommendations that a user may

Many businesses and web portals adopt Recommender Systems appreciate or not. In order to perform this simulation it is necessary
(RSs) to help their users to tame information overload and make to know how the user could actually react to the recommendations.
better choices. Despite the fact that RSs should support user decision This is obtained by employing existing ratings or choices data sets
making, academic researchers, when evaluating the effectiveness and splitting them in two parts: train and test [8, 12]. The train set is
of a RS, largely adopt offline rather than live user studies methods. used to train the recommendation algorithm and the test set is used
We discuss the relationships between these evaluation methods to simulate the user reactions to the generated recommendations:
by considering a tourism RS case study. We then suggest future if the user’s evaluation of a recommended item is present in the
directions to be taken by HCI and RS research to better assess the test set, then that is used to decide whether the recommendation is
user’s value of RSs. correct or not.
Conversely, in online studies, real users are requested to evaluate
CCS CONCEPTS the recommendations [7, 13, 18]. These experiments are also called
“user studies” and their focus is not only the evaluation of the “pre-
• Information systems → Evaluation of retrieval results; Rec-
cision” of the recommendation algorithm but also the assessment
ommender systems; • Human-centered computing → HCI
of a range of complementary properties of the whole user/system
design and evaluation methods.
interaction [15, 21]: the perceived recommendation quality and the
KEYWORDS system effectiveness.
Despite the user-centric nature of RSs, offline studies are more
Recommender Systems, evaluation methods popular than user studies, which are more complex and time con-
ACM Reference Format: suming to set up and conduct. In fact, a user study entails a full
Francesco Ricci, David Massimo, and Antonella De Angeli. 2021. Challenges working system which heavily impacts on the time required to
for Recommender Systems Evaluation. In CHItaly 2021: 14th Biannual Con- set up the experiment. Moreover, users willing to participate in
ference of the Italian SIGCHI Chapter (CHItaly ’21), July 11–13, 2021, Bolzano, the study need to be identified via recruiting campaigns, e.g., typi-
Italy. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3464385.
cally by asking colleagues, students and by sending invitations via
3464733
mailing-list. As a matter of fact, a large number of participants is
difficult to be found. Moreover, collecting reliable recommendation
1 INTRODUCTION
evaluations is not always possible: often the true experience asso-
Recommender Systems (RSs) are software tools aiming at support- ciated with the consumption of the recommended item cannot be
ing human decision-making, especially when choices are made adequately simulated in a user study. Even music, which is rather
over large products or services catalogues [29]. RSs are used in easy to be played for the user to evaluate, if not listened to in the
several online platforms, for media streaming suggestions (Net- typical listening context of the user, may produce an unreliable eval-
flix, Spotify), or in Location-Based Social Networks for Restaurant uation [17, 19, 32]. It is even more difficult to evaluate a tourism RS,
recommendation (Foursquare). which is normally used in the pre-trip planning phase, or during a
RSs are evaluated by means of two methods: offline and online travel, to get up to date information about what to visit next [14, 33].
experiments [5, 6, 11, 15]. Offline experiments focus on the core In both cases, the actual visit to a recommended Point of Interest
recommendation algorithm and simulate the (online) interactive (POI) can only be “illustrated” to the user; the user must rely on
Permission to make digital or hard copies of all or part of this work for personal or the provided information to decide whether the recommendation
classroom use is granted without fee provided that copies are not made or distributed is relevant or not. As a matter of fact, the real visit to the POI may
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
produce a totally different experience and evaluation.
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or Researchers often ignore these issues, and rarely conduct both
republish, to post on servers or to redistribute to lists, requires prior specific permission offline and online studies for the same RS. The assumption is that
and/or a fee. Request permissions from [email protected].
CHItaly ’21, July 11–13, 2021, Bolzano, Italy
any of the two may still bring useful knowledge about the effect of
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM. the RS on the target users. Conversely, in this paper, we leverage
ACM ISBN 978-1-4503-8977-8/21/06. . . $15.00 the knowledge that we have acquired in performing combined
https://doi.org/10.1145/3464385.3464733
CHItaly ’21, July 11–13, 2021, Bolzano, Italy F. Ricci, D. Massimo and A. De Angeli
Table 1: Recommendation performance (Top-1)
Model Model description Reward Precision Novelty

Q-BASE maximal reward 0.073 0.043 0.061
Q-POP PUSH balance reward and popularity -0.002 0.099 0.000
SKNN popular among similar visitors -0.007 0.109 0.000
offline and online analysis of the same RS. We discuss some of what a user would visit anyway, which should correspond to the
their limitations by referring to a specific case study in the travel POI-visits contained in the test set, but also to suggest novel (not
and tourism domain. In particular, we retrospectively discuss the yet known) items that possibly lead to “rewarding” tourism experi-
offline and online evaluation we carried out in order to assess the ences. As we mentioned above, the assessment whether a “novel”
quality of a tourism RS that suggests next-POI to users who are recommendation is also relevant is conceptually impossible in a
visiting a city [23–25]. We compare the recommendation accuracy, standard offline analysis. Hence, we optimise the “reward” a user
novelty and utility, which are measured offline, to the users’ explicit obtains from a POI visit, with a method that can estimate reward
feedback given to recommendations presented in a live user study. also for novel items.
This case study is used to ground a more general discussion of new Operatively, at first, we cluster observed POI-visits trajectories
research directions for HCI and RS to overcome the identified issues so that similarly interested/behaving users are grouped. Then we
and limitations of RSs evaluation methods. apply Inverse Reinforcement Learning (IRL) [1, 3] to learn a user
generalised behaviour model, one per cluster, i.e., the (unknown) re-
2 OFFLINE EXPERIMENTS ward that users in a cluster obtain by performing specific POI-visits.
Then, next POI-visit recommendations are generated by exploiting
The assessment of the quality of RSs has been performed offline by the reward function estimated by the IRL algorithm [23, 24].
computing prediction accuracy metrics [8, 20]. Accuracy relates to In Table 1 we show the results obtained by two different IRL-
the ability of the RS to predict either the observed user choices or based recommendation strategies that we devised, Q-BASE and
the explicit evaluations for items. In the first case one measures: Q-POP PUSH, plus a baseline RS named SKNN. Q-BASE exploits the
precision, which is the fraction of relevant items among the rec- learnt user behaviour model by suggesting the most rewarding POI-
ommendations; and recall, the fraction of relevant items that are visit given the last visit performed by the target user. Q-POP PUSH
recommended. In the second case accuracy is measured by regres- instead optimizes simultaneously two criteria: a) the reward of the
sion type error metrics, such as Mean Absolute Error (MAE) or next-POI visit (as for Q-BASE); b) the popularity of the POIs. SKNN is
Root Mean Square Error (RMSE) [12, 27]. a nearest neighbour-based RS approach which given the POI-visits
However, it has been pointed out that optimizing an RS for trajectory of a user seeks for users who performed similar visits
accuracy yields suggestions that are often uninteresting, as they and recommends the most popular next POI-visits performed by
too closely match what the user is observed to typically choose. similar users.
Moreover, these recommendations self-reinforce the consumption In Table 1 we show the systems’ performance: Reward, is the av-
of blockbuster items [4, 26, 30, 34]. In [4, 26] it is argued that a erage increase of reward a user obtains if she acts as recommended
proper assessment of an RS should be based on a wider spectrum rather than as she did (test set); Precision, is the proportion of the
of indicators of the recommendation quality. Hence, in [30, 31] recommendations found in the test set; and Novelty is the percent-
specific metrics to measure the novelty of recommendations are age of the recommendations for the less popular items in the data
proposed. Furthermore, in [22] a metric to assess the similarity of (see [23, 24]).
the properties of suggested items to those in the test set is proposed. The first observation is that Q-BASE, if compared to the SKNN
Finally, in [10, 12] metrics capable to assess global properties of an baseline, suggests POI-visits that are more rewarding and novel at
RS are indicated. the cost of a lower precision. Secondly, we note that by trying to
Offline studies offer a quick and inexpensive tool for evaluating optimise both the reward and the popularity of the recommended
an RS performance at the cost of making the restrictive assumption POIs Q-POP PUSH actually loses the capability to suggest high
that only the evaluations present in the test set define the users’ reward POIs of Q-BASE and scores similarly to SKNN. This clearly
preferences. But, preferences for novel items that an RS could sug- illustrates the “dilemma” of many RSs: being more precise and
gest are not present in that data set, i.e., items not yet “evaluated” hence recommend more popular and possibly obvious items or to
by the user are all considered as bad recommendations. Moreover, recommend novel items that are however estimated to be relevant
the interaction context is not considered in offline studies and it is for the user.
erroneously assumed that when a user evaluates an item, already
evaluated in another context, the same evaluation would be given.
In our recent work [23, 25], we have designed an RS that can 3 USER STUDY
assist users in sequential decision-making in a tourism application. In a second stage we conducted a live user study to understand to
We dealt with the next-POI recommendation problem: supporting what extent the results of the offline study could be confirmed by the
a tourist looking for an additional attraction to visit after having evaluations of real users. In general, in a user study a few alternative
visited some other attractions. Our RS tries, not only to predict RSs are compared by letting the users to try them while performing
Challenges for Recommender Systems Evaluation CHItaly ’21, July 11–13, 2021, Bolzano, Italy
Figure 1: Evaluation GUI. From top to bottom: itinerary detail; info box; recommendations and POI details
a decision making task [6, 11, 21, 28]. When a recommendation for each user an itinerary (Figure 1, top) that the user should imag-
list is presented to a user, she may select one or more items, as a ine she has completed up to the recommendation request, i.e., when
sign of positive evaluation. Then the collected interactions of the the user has to decide which POI to visit next. Users are then asked
users with the recommendations are analysed. Hence, online the to evaluate a list of next-POI recommendations (Figure 1, bottom)
situation is very different from the simulated offline evaluation generated by combining the suggestions computed by the three
where there is a single and static reference set of hypothesised RSs, without being informed which RS recommends what. Users
good recommendations, i.e., the items in the user’s test set [11]. In should judge if the recommended POIs are: previously “visited”,
online experiments the user analyses the recommended items, and “liked” or “novel”. By doing so, we aimed at eliciting behavioural
decides which one is relevant or not. Hence, the “ground truth” is responses as in a real situation.
generated online and depends on the observed recommendations. In practice we built an interaction design that enables the users
In other words, online there are no stored “preferences” or choices to reason as they were going to perform a real next-POI visit; the
that the RS must “predict”, as offline, but preferences and choices experimental system tries to generate the specific context of a true
are “constructed” while the user is interacting with the RS. visit. The users should imagine the real context and make decisions
We have evaluated our RS by implementing a web-based ap- that will be likely to be taken when the users are faced with that
plication accessible from desktop and mobile browsers [24]. 158 decision task. A critical point is surely related the possibility to
experiment participants that visited Florence (ecological study re- actually express a reliable “like” judgement on a POI that the user
quirement) were recruited via social media and mailing lists. The does not know, i.e., a “novel” POI, by simply relying on the system’s
study goal was to measure the users’ perceived novelty and reward presentation of the POI.
of the next-POI visit recommendations generated by Q-BASE, Q-POP Table 2 shows the measured probability that a user marks as
PUSH and SKNN. The application first profiles users by asking them “visited”, “novel”, “liked” or both “liked” and “novel” a POI recom-
to elicit as many previously visited POIs as possible. To facilitate mended by a specific RS. Probabilities are computed by dividing the
users to recall their past activities and to offer ecologically valid total number of items marked as visited (liked, novel and both liked
stimuli, the user interface shows pictures and descriptions (from and novel), for each RS, by the total number of items suggested
Wikipedia) of the POIs. The selected (known) POIs are used to build by the RS. Q-BASE recommends POIs that are less likely to have
been already visited by the user and more likely to be novel than
CHItaly ’21, July 11–13, 2021, Bolzano, Italy F. Ricci, D. Massimo and A. De Angeli
Table 2: Probability to evaluate a POI recommendation as visited, novel, liked and both novel and liked.
Recommender System Visited Novel Liked Liked & Novel

Q-BASE 0.165* 0.517* 0.361* 0.091
Q-POP PUSH 0.245 0.376 0.464 0.076
SKNN 0.238 0.371 0.466 0.082
* indicates significant difference from the other two RSs perf. (two proportion z-test, p < 0.05)
those suggested by Q-POP PUSH and SKNN. It is interesting to note We have explained the rationale of this experimental outcome.
that Q-POP PUSH and SKNN performs similarly. These results are in High precision and large probability to like the online recommen-
accordance with the offline study outcome. But, Q-BASE offers less dations, are both associated to the popularity of the recommended
POIs that are liked compared to the other two recommendation items: these items are often in the users’ test sets and users are
strategies. Hence, these results seems to: a) confirm the standard likely to be familiar with them. Travellers are keen to positively
assumption that RS that are evaluated offline as more precise are judge POIs that they did not experience when advertisements or
also recommending online items that the users will like more; b) other information sources make them more desirable [2, 9]. This
falsify our hypothesis that optimising the reward of a recommen- is also supported by our analysis of the user feedback. Users are
dation, as Q-BASE is implementing, will produce recommendations more likely to express their appreciation (like) for known items (not
that the user will like. However Q-BASE suggests more novel POIs novel and already visited) which are mostly suggested by the popu-
and, interestingly, more recommendations that are both liked and larity biased RSs Q-POP PUSH and SKNN. But, also according to some
novel (last column in Table 2). users’ feedback, useful recommendations do not necessarily come
To better interpret these results we further analysed the users’ from (offline) precise and (online) liked recommendations. In fact,
evaluations by computing the probability that a user will like a popular POIs can be autonomously discovered by the users, e.g., on
recommendation given the following three conditions: a) she knows travel portals and there is no need to implement a sophisticated RS
the item but has not yet visited it; b) she already visited it; c) the item for accomplishing that goal.
is totally novel. From this analysis (precise results are not shown Moreover, despite its lower (offline) precision performance and
here for lack of space), we derived the following conclusions. The (online) like of the recommendations, Q-BASE is the RS that better
novel POI-visit suggestions generated by SKNN and Q-POP PUSH accomplishes the true goal of a tourism RS: it suggests more next-
are liked more than those produced by Q-BASE, and this is due to POIs that are liked and novel. So, by optimising the reward Q-BASE
the tendency of Q-BASE to suggests items that have the properties is capable to discover novel items that are also liked (when users
typically liked by the user, e.g., they are of the same historical period are able to assess them).
of the POIs liked by the user, but are not popular POIs, hence are Moreover, we have illustrated that the two evaluation methods
also probably novel and hard to appreciate. tend to reveal a coherent scenario. But, unfortunately, both of them
The very general conclusion that we have derived from this study are still limited in assessing the full usefulness of an RS. This is due
is that users tend to like more the items they are familiar with, e.g., to the fact that both evaluation methods are based on (different)
previously visited items or items that are not novel. In fact, for all simulations of the real usage conditions of an RS. Hence, we believe
the three RSs, the probability that a user likes a recommended next- that in order to overcome this limitation we need to devise novel
POI that she visited tends to be much larger than the probability to solutions that combines HCI and RSs practices and knowledge.
like a novel one. This has been confirmed by a post-survey in which We think that it is important not only to combine offline and
participants declared that it is difficult to like something unknown. online analysis, but also to better match the performance indicators
Hence, when an item is not immediately “recognised” by the user, with the true RS goals. This could allow to better understand the
then the evaluation is necessarily mild and doubtful. This points effectiveness of the RS. For instance, in our user study, notwith-
out a basic difficulty of RSs online evaluation: how users can judge standing the effort made to create a reliable testing platform, the
items that have not yet experienced? How an evaluation based on context of the visit to a POI could not be adequately taken into
the user perceived utility for an item can measure the actual utility consideration by the users. In fact, in a user study, it is very difficult
that the user will gain in the real experience with the recommended to consider and manage all the many factors that can influence the
item? decision-making process of a user. Most importantly, in a user study
the user relates the recommendations to her knowledge and mem-
ory of similar experiences. When a user evaluates recommended
items that are new to her she has to reason by analogy on the base
4 DISCUSSION of her past experiences. This often leads to biased evaluations of the
By looking at the outcome of the offline and online experiments recommendations: if recommendations are judged as relevant or
one could conclude that the typical assumption made while testing not on the base of the user memory then these evaluations are likely
RSs seems to hold: RSs with a high (offline) precision are also liked to be wrong for items that do not match well with past experiences
most by real users. Hence, we could state the superiority of Q-POP [16].
PUSH and SKNN over Q-BASE.
Challenges for Recommender Systems Evaluation CHItaly ’21, July 11–13, 2021, Bolzano, Italy
5 CHALLENGES AND CONCLUSION Systems 60, 1 (2019), 523–544.

[15] Dietmar Jannach, Oren Sar Shalom, and Joseph A. Konstan. 2019. Towards More
Here we list a few new research directions for HCI and RSs to over- Impactful Recommender Systems Research. In Proceedings of the 1st Workshop
come the above-mentioned issues in evaluating an RS. According on the Impact of Recommender Systems co-located with 13th ACM Conference on
Recommender Systems, ImpactRS@RecSys 2019), Copenhagen, Denmark, September
to the experience illustrated in this article, we believe that to effec- 19, 2019 (CEUR Workshop Proceedings, Vol. 2462), Oren Sar Shalom, Dietmar
tively test an RS technology it is important to leverage a data set Jannach, and Ido Guy (Eds.). CEUR-WS.org.
of users observed behaviour (e.g., ratings) and item descriptions [16] Daniel Kahneman, Barbara L. Fredrickson, Charles A. Schreiber, and Donald A.
Redelmeier. 1993. When More Pain Is Preferred to Less: Adding a Better End.
that is representative of the real usage of the system. For instance, Psychological Science 4, 6 (1993), 401–405.
through the aforementioned post-survey, we understood that the [17] Marius Kaminskas, Ignacio Fernández-Tobías, Francesco Ricci, and Iván Cantador.
employed database of POIs was presenting items that were not 2014. Knowledge-based identification of music suited for places of interest. J.
Inf. Technol. Tour. 14, 1 (2014), 73–95.
recognizable as real touristic landmarks by the majority of the users. [18] Marius Kaminskas and Francesco Ricci. 2016. Emotion-Based Matching of Music
This stresses the importance of a definition of a “recommendable to Places. Springer International Publishing, Cham, 287–310.
[19] Marius Kaminskas and Francesco Ricci. 2017. Emotion-Based Matching of Music
item”, i.e., an item that the user can recognise as a possible valid to Places. In Emotions and Personality in Personalized Services - Models, Evaluation
choice. For instance, in our application, many relatively small POIs and Applications, Marko Tkalcic, Berardina De Carolis, Marco de Gemmis, Ante
(e.g., the door of a church) should have been better collapsed into a Odic, and Andrej Kosir (Eds.). Springer, 287–310.
[20] George Karypis. 2001. Evaluation of Item-Based Top-N Recommendation Al-
unique POI (the church itself). gorithms. In Proceedings of the Tenth International Conference on Information
In addition, we believe that it is still an open research question and Knowledge Management (Atlanta, Georgia, USA) (CIKM ’01). Association for
how to map offline metrics, not only precision, to the perceived Computing Machinery, New York, NY, USA, 247–254.
[21] Bart P. Knijnenburg, Martijn C. Willemsen, Zeno Gantner, Hakan Soncu, and
qualities and the experience of a recommended item. By better Chris Newell. 2012. Explaining the user experience of recommender systems.
understanding which factors make a recommendation satisfactory User Modeling and User-Adapted Interaction 22, 4 (2012), 441–504.
[22] Gunjan Kumar, Houssem Jerbi, and Michael O’Mahony. 2017. Towards the
for a user, one can operationalize offline metrics that quantify those Recommendation of Personalised Activity Sequences in the Tourism Domain.
factors. This can help to link real perceptions to quantifiable offline [23] David Massimo and Francesco Ricci. 2018. Harnessing a generalised user be-
properties of the recommendations. haviour model for next-POI recommendation. In Proceedings of the 12th ACM
Conference on Recommender Systems, RecSys 2018, Vancouver, BC, Canada, October
2-7, 2018, Sole Pera, Michael D. Ekstrand, Xavier Amatriain, and John O’Donovan
(Eds.). ACM, 402–406.
REFERENCES [24] David Massimo and Francesco Ricci. 2020. Next-POI Recommendations for
[1] Pieter Abbeel and Andrew Y. Ng. 2004. Apprenticeship learning via inverse the Smart Destination Era. In Information and Communication Technologies in
reinforcement learning. In Machine Learning, Proceedings of the Twenty-first Tourism 2020, Julia Neidhardt and Wolfgang Wörndl (Eds.). Springer International
International Conference (ICML 2004), Banff, Alberta, Canada, July 4-8, 2004. Publishing, Cham, 130–141.
[2] Gregory Ashworth and Stephen J. Page. 2011. Urban tourism research: Recent [25] David Massimo and Francesco Ricci. 2021. Next-POI Recommendations Matching
progress and current paradoxes. Tourism Management 32, 1 (2011), 1–15. User’s Visit Behaviour. In Information and Communication Technologies in Tourism
[3] Monica Babeş-Vroman, Vukosi Marivate, Kaushik Subramanian, and Michael 2021, Wolfgang Wörndl, Chulmo Koo, and Jason L. Stienmetz (Eds.). Springer
Littman. 2011. Apprenticeship Learning about Multiple Intentions. In Proceed- International Publishing, Cham, 45–57.
ings of the 28th International Conference on International Conference on Machine [26] Sean M. McNee, John Riedl, and Joseph A. Konstan. 2006. Being Accurate is Not
Learning (Bellevue, Washington, USA) (ICML’11). Omnipress, Madison, WI, USA, Enough: How Accuracy Metrics Have Hurt Recommender Systems. In CHI ’06
897–904. Extended Abstracts on Human Factors in Computing Systems (Montréal, Québec,
[4] Philip Ball. 2010. Wisdom of the fool’s choice. Nature (2010). Canada) (CHI EA ’06). Association for Computing Machinery, New York, NY,
[5] Joeran Beel and Stefan Langer. 2015. A Comparison of Offline Evaluations, Online USA, 1097–1101.
Evaluations, and User Studies in the Context of Research-Paper Recommender [27] David Powers. 2008. Evaluation: From Precision, Recall and F-Factor to ROC,
Systems. In Research and Advanced Technology for Digital Libraries, Sarantos Informedness, Markedness & Correlation. Mach. Learn. Technol. 2 (01 2008).
Kapidakis, Cezary Mazurek, and Marcin Werla (Eds.). Springer International [28] Pearl Pu, Li Chen, and Rong Hu. 2011. A User-Centric Evaluation Framework
Publishing, Cham, 153–168. for Recommender Systems. In Proceedings of the Fifth ACM Conference on Recom-
[6] Alejandro Bellogín and Alan Said. 2018. Recommender Systems Evaluation. mender Systems (Chicago, Illinois, USA) (RecSys ’11). Association for Computing
Springer New York, New York, NY, 2095–2112. Machinery, New York, NY, USA, 157–164.
[7] Paolo Cremonesi, Franca Garzotto, and Roberto Turrin. 2012. Investigating the [29] Francesco Ricci, Lior Rokach, and Bracha Shapira. 2015. Recommender Systems:
Persuasion Potential of Recommender Systems from a Quality Perspective: An Introduction and Challenges. In Recommender Systems Handbook, Francesco
Empirical Study. ACM Trans. Interact. Intell. Syst. 2, 2, Article 11 (June 2012), Ricci, Lior Rokach, and Bracha Shapira (Eds.). Springer US, Boston, MA, 1–34.
41 pages. [30] Saúl Vargas and Pablo Castells. 2011. Rank and Relevance in Novelty and Diversity
[8] Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of Metrics for Recommender Systems. In Proceedings of the Fifth ACM Conference
Recommender Algorithms on Top-n Recommendation Tasks. In Proceedings of on Recommender Systems (Chicago, Illinois, USA) (RecSys ’11). Association for
the Fourth ACM Conference on Recommender Systems (Barcelona, Spain) (RecSys Computing Machinery, New York, NY, USA, 109–116.
’10). Association for Computing Machinery, New York, NY, USA, 39–46. [31] Saúl Vargas and Pablo Castells. 2014. Improving Sales Diversity by Recommend-
[9] Beatriz García. 2004. Urban regeneration, arts programming and major events. ing Users to Items. In Proceedings of the 8th ACM Conference on Recommender
International Journal of Cultural Policy 10, 1 (2004), 103–118. Systems (Foster City, Silicon Valley, California, USA) (RecSys ’14). Association for
[10] Mouzhi Ge, Carla Delgado-Battenfeld, and Dietmar Jannach. 2010. Beyond Computing Machinery, New York, NY, USA, 145–152.
Accuracy: Evaluating Recommender Systems by Coverage and Serendipity. In [32] Hiromu Yakura, Tomoyasu Nakano, and Masataka Goto. 2018. FocusMusicRec-
Proceedings of the Fourth ACM Conference on Recommender Systems (Barcelona, ommender: A System for Recommending Music to Listen to While Working. In
Spain) (RecSys ’10). Association for Computing Machinery, New York, NY, USA, 23rd International Conference on Intelligent User Interfaces (Tokyo, Japan) (IUI ’18).
257–260. Association for Computing Machinery, New York, NY, USA, 7–17.
[11] Asela Gunawardana and Guy Shani. 2009. A Survey of Accuracy Evaluation [33] Hongzhi Yin, Bin Cui, and Xiaofang Zhou. 2018. Spatiotemporal Recommendation
Metrics of Recommendation Tasks. J. Mach. Learn. Res. 10 (Dec. 2009), 2935–2962. in Geo-Social Networks. Springer New York, New York, NY, 2930–2948.
[12] Jonathan L Herlocker, Joseph A Konstan, Loren G Terveen, and John T Riedl. [34] Tao Zhou, Zoltán Kuscsik, Jian-Guo Liu, Matúš Medo, Joseph Rushton Wakeling,
2004. Evaluating collaborative filtering recommender systems. ACM Transactions and Yi-Cheng Zhang. 2010. Solving the apparent diversity-accuracy dilemma of
on Information Systems (TOIS) 22, 1 (2004), 5–53. recommender systems. Proceedings of the National Academy of Sciences 107, 10
[13] Daniel Herzog and Wolfgang Wörndl. 2019. User-Centered Evaluation of Strate- (2010), 4511–4515.
gies for Recommending Sequences of Points of Interest to Groups. In Proceedings
of the 13th ACM Conference on Recommender Systems (Copenhagen, Denmark)
(RecSys ’19). Association for Computing Machinery, New York, NY, USA, 96–100.
[14] Yu-Ling Hsueh and Hong-Min Huang. 2019. Personalized itinerary recommen-
dation with time constraints using GPS datasets. Knowledge and Information

Challengesforrs

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Challengesforrs

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Challengesforrs

Uploaded by

Copyright:

Available Formats

Challenges for Recommender Systems Evaluation

ABSTRACT process where the RS generates recommendations that a user may

Table 1: Recommendation performance (Top-1)

Model Model description Reward Precision Novelty

Recommender System Visited Novel Liked Liked & Novel

5 CHALLENGES AND CONCLUSION Systems 60, 1 (2019), 523–544.

You might also like

Challengesforrs

Uploaded by

Document Informationclick to expand document informationchallenge rm

Document Informationclick to expand document information

Copyright:

Available Formats

Challengesforrs

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Challengesforrs

Uploaded by

Copyright:

Available Formats

Challenges for Recommender Systems Evaluation

ABSTRACT process where the RS generates recommendations that a user may

Table 1: Recommendation performance (Top-1)

Model Model description Reward Precision Novelty

Recommender System Visited Novel Liked Liked & Novel

5 CHALLENGES AND CONCLUSION Systems 60, 1 (2019), 523–544.

You might also like