Challengesforrs
Challengesforrs
Challengesforrs
Francesco Ricci
David Massimo
Antonella De Angeli
[email protected]
[email protected]
[email protected]
Free University of Bozen-Bolzano
Bolzano, Italy
offline and online analysis of the same RS. We discuss some of what a user would visit anyway, which should correspond to the
their limitations by referring to a specific case study in the travel POI-visits contained in the test set, but also to suggest novel (not
and tourism domain. In particular, we retrospectively discuss the yet known) items that possibly lead to “rewarding” tourism experi-
offline and online evaluation we carried out in order to assess the ences. As we mentioned above, the assessment whether a “novel”
quality of a tourism RS that suggests next-POI to users who are recommendation is also relevant is conceptually impossible in a
visiting a city [23–25]. We compare the recommendation accuracy, standard offline analysis. Hence, we optimise the “reward” a user
novelty and utility, which are measured offline, to the users’ explicit obtains from a POI visit, with a method that can estimate reward
feedback given to recommendations presented in a live user study. also for novel items.
This case study is used to ground a more general discussion of new Operatively, at first, we cluster observed POI-visits trajectories
research directions for HCI and RS to overcome the identified issues so that similarly interested/behaving users are grouped. Then we
and limitations of RSs evaluation methods. apply Inverse Reinforcement Learning (IRL) [1, 3] to learn a user
generalised behaviour model, one per cluster, i.e., the (unknown) re-
2 OFFLINE EXPERIMENTS ward that users in a cluster obtain by performing specific POI-visits.
Then, next POI-visit recommendations are generated by exploiting
The assessment of the quality of RSs has been performed offline by the reward function estimated by the IRL algorithm [23, 24].
computing prediction accuracy metrics [8, 20]. Accuracy relates to In Table 1 we show the results obtained by two different IRL-
the ability of the RS to predict either the observed user choices or based recommendation strategies that we devised, Q-BASE and
the explicit evaluations for items. In the first case one measures: Q-POP PUSH, plus a baseline RS named SKNN. Q-BASE exploits the
precision, which is the fraction of relevant items among the rec- learnt user behaviour model by suggesting the most rewarding POI-
ommendations; and recall, the fraction of relevant items that are visit given the last visit performed by the target user. Q-POP PUSH
recommended. In the second case accuracy is measured by regres- instead optimizes simultaneously two criteria: a) the reward of the
sion type error metrics, such as Mean Absolute Error (MAE) or next-POI visit (as for Q-BASE); b) the popularity of the POIs. SKNN is
Root Mean Square Error (RMSE) [12, 27]. a nearest neighbour-based RS approach which given the POI-visits
However, it has been pointed out that optimizing an RS for trajectory of a user seeks for users who performed similar visits
accuracy yields suggestions that are often uninteresting, as they and recommends the most popular next POI-visits performed by
too closely match what the user is observed to typically choose. similar users.
Moreover, these recommendations self-reinforce the consumption In Table 1 we show the systems’ performance: Reward, is the av-
of blockbuster items [4, 26, 30, 34]. In [4, 26] it is argued that a erage increase of reward a user obtains if she acts as recommended
proper assessment of an RS should be based on a wider spectrum rather than as she did (test set); Precision, is the proportion of the
of indicators of the recommendation quality. Hence, in [30, 31] recommendations found in the test set; and Novelty is the percent-
specific metrics to measure the novelty of recommendations are age of the recommendations for the less popular items in the data
proposed. Furthermore, in [22] a metric to assess the similarity of (see [23, 24]).
the properties of suggested items to those in the test set is proposed. The first observation is that Q-BASE, if compared to the SKNN
Finally, in [10, 12] metrics capable to assess global properties of an baseline, suggests POI-visits that are more rewarding and novel at
RS are indicated. the cost of a lower precision. Secondly, we note that by trying to
Offline studies offer a quick and inexpensive tool for evaluating optimise both the reward and the popularity of the recommended
an RS performance at the cost of making the restrictive assumption POIs Q-POP PUSH actually loses the capability to suggest high
that only the evaluations present in the test set define the users’ reward POIs of Q-BASE and scores similarly to SKNN. This clearly
preferences. But, preferences for novel items that an RS could sug- illustrates the “dilemma” of many RSs: being more precise and
gest are not present in that data set, i.e., items not yet “evaluated” hence recommend more popular and possibly obvious items or to
by the user are all considered as bad recommendations. Moreover, recommend novel items that are however estimated to be relevant
the interaction context is not considered in offline studies and it is for the user.
erroneously assumed that when a user evaluates an item, already
evaluated in another context, the same evaluation would be given.
In our recent work [23, 25], we have designed an RS that can 3 USER STUDY
assist users in sequential decision-making in a tourism application. In a second stage we conducted a live user study to understand to
We dealt with the next-POI recommendation problem: supporting what extent the results of the offline study could be confirmed by the
a tourist looking for an additional attraction to visit after having evaluations of real users. In general, in a user study a few alternative
visited some other attractions. Our RS tries, not only to predict RSs are compared by letting the users to try them while performing
Challenges for Recommender Systems Evaluation CHItaly ’21, July 11–13, 2021, Bolzano, Italy
Figure 1: Evaluation GUI. From top to bottom: itinerary detail; info box; recommendations and POI details
a decision making task [6, 11, 21, 28]. When a recommendation for each user an itinerary (Figure 1, top) that the user should imag-
list is presented to a user, she may select one or more items, as a ine she has completed up to the recommendation request, i.e., when
sign of positive evaluation. Then the collected interactions of the the user has to decide which POI to visit next. Users are then asked
users with the recommendations are analysed. Hence, online the to evaluate a list of next-POI recommendations (Figure 1, bottom)
situation is very different from the simulated offline evaluation generated by combining the suggestions computed by the three
where there is a single and static reference set of hypothesised RSs, without being informed which RS recommends what. Users
good recommendations, i.e., the items in the user’s test set [11]. In should judge if the recommended POIs are: previously “visited”,
online experiments the user analyses the recommended items, and “liked” or “novel”. By doing so, we aimed at eliciting behavioural
decides which one is relevant or not. Hence, the “ground truth” is responses as in a real situation.
generated online and depends on the observed recommendations. In practice we built an interaction design that enables the users
In other words, online there are no stored “preferences” or choices to reason as they were going to perform a real next-POI visit; the
that the RS must “predict”, as offline, but preferences and choices experimental system tries to generate the specific context of a true
are “constructed” while the user is interacting with the RS. visit. The users should imagine the real context and make decisions
We have evaluated our RS by implementing a web-based ap- that will be likely to be taken when the users are faced with that
plication accessible from desktop and mobile browsers [24]. 158 decision task. A critical point is surely related the possibility to
experiment participants that visited Florence (ecological study re- actually express a reliable “like” judgement on a POI that the user
quirement) were recruited via social media and mailing lists. The does not know, i.e., a “novel” POI, by simply relying on the system’s
study goal was to measure the users’ perceived novelty and reward presentation of the POI.
of the next-POI visit recommendations generated by Q-BASE, Q-POP Table 2 shows the measured probability that a user marks as
PUSH and SKNN. The application first profiles users by asking them “visited”, “novel”, “liked” or both “liked” and “novel” a POI recom-
to elicit as many previously visited POIs as possible. To facilitate mended by a specific RS. Probabilities are computed by dividing the
users to recall their past activities and to offer ecologically valid total number of items marked as visited (liked, novel and both liked
stimuli, the user interface shows pictures and descriptions (from and novel), for each RS, by the total number of items suggested
Wikipedia) of the POIs. The selected (known) POIs are used to build by the RS. Q-BASE recommends POIs that are less likely to have
been already visited by the user and more likely to be novel than
CHItaly ’21, July 11–13, 2021, Bolzano, Italy F. Ricci, D. Massimo and A. De Angeli
Table 2: Probability to evaluate a POI recommendation as visited, novel, liked and both novel and liked.
those suggested by Q-POP PUSH and SKNN. It is interesting to note We have explained the rationale of this experimental outcome.
that Q-POP PUSH and SKNN performs similarly. These results are in High precision and large probability to like the online recommen-
accordance with the offline study outcome. But, Q-BASE offers less dations, are both associated to the popularity of the recommended
POIs that are liked compared to the other two recommendation items: these items are often in the users’ test sets and users are
strategies. Hence, these results seems to: a) confirm the standard likely to be familiar with them. Travellers are keen to positively
assumption that RS that are evaluated offline as more precise are judge POIs that they did not experience when advertisements or
also recommending online items that the users will like more; b) other information sources make them more desirable [2, 9]. This
falsify our hypothesis that optimising the reward of a recommen- is also supported by our analysis of the user feedback. Users are
dation, as Q-BASE is implementing, will produce recommendations more likely to express their appreciation (like) for known items (not
that the user will like. However Q-BASE suggests more novel POIs novel and already visited) which are mostly suggested by the popu-
and, interestingly, more recommendations that are both liked and larity biased RSs Q-POP PUSH and SKNN. But, also according to some
novel (last column in Table 2). users’ feedback, useful recommendations do not necessarily come
To better interpret these results we further analysed the users’ from (offline) precise and (online) liked recommendations. In fact,
evaluations by computing the probability that a user will like a popular POIs can be autonomously discovered by the users, e.g., on
recommendation given the following three conditions: a) she knows travel portals and there is no need to implement a sophisticated RS
the item but has not yet visited it; b) she already visited it; c) the item for accomplishing that goal.
is totally novel. From this analysis (precise results are not shown Moreover, despite its lower (offline) precision performance and
here for lack of space), we derived the following conclusions. The (online) like of the recommendations, Q-BASE is the RS that better
novel POI-visit suggestions generated by SKNN and Q-POP PUSH accomplishes the true goal of a tourism RS: it suggests more next-
are liked more than those produced by Q-BASE, and this is due to POIs that are liked and novel. So, by optimising the reward Q-BASE
the tendency of Q-BASE to suggests items that have the properties is capable to discover novel items that are also liked (when users
typically liked by the user, e.g., they are of the same historical period are able to assess them).
of the POIs liked by the user, but are not popular POIs, hence are Moreover, we have illustrated that the two evaluation methods
also probably novel and hard to appreciate. tend to reveal a coherent scenario. But, unfortunately, both of them
The very general conclusion that we have derived from this study are still limited in assessing the full usefulness of an RS. This is due
is that users tend to like more the items they are familiar with, e.g., to the fact that both evaluation methods are based on (different)
previously visited items or items that are not novel. In fact, for all simulations of the real usage conditions of an RS. Hence, we believe
the three RSs, the probability that a user likes a recommended next- that in order to overcome this limitation we need to devise novel
POI that she visited tends to be much larger than the probability to solutions that combines HCI and RSs practices and knowledge.
like a novel one. This has been confirmed by a post-survey in which We think that it is important not only to combine offline and
participants declared that it is difficult to like something unknown. online analysis, but also to better match the performance indicators
Hence, when an item is not immediately “recognised” by the user, with the true RS goals. This could allow to better understand the
then the evaluation is necessarily mild and doubtful. This points effectiveness of the RS. For instance, in our user study, notwith-
out a basic difficulty of RSs online evaluation: how users can judge standing the effort made to create a reliable testing platform, the
items that have not yet experienced? How an evaluation based on context of the visit to a POI could not be adequately taken into
the user perceived utility for an item can measure the actual utility consideration by the users. In fact, in a user study, it is very difficult
that the user will gain in the real experience with the recommended to consider and manage all the many factors that can influence the
item? decision-making process of a user. Most importantly, in a user study
the user relates the recommendations to her knowledge and mem-
ory of similar experiences. When a user evaluates recommended
items that are new to her she has to reason by analogy on the base
4 DISCUSSION of her past experiences. This often leads to biased evaluations of the
By looking at the outcome of the offline and online experiments recommendations: if recommendations are judged as relevant or
one could conclude that the typical assumption made while testing not on the base of the user memory then these evaluations are likely
RSs seems to hold: RSs with a high (offline) precision are also liked to be wrong for items that do not match well with past experiences
most by real users. Hence, we could state the superiority of Q-POP [16].
PUSH and SKNN over Q-BASE.
Challenges for Recommender Systems Evaluation CHItaly ’21, July 11–13, 2021, Bolzano, Italy