Academia.eduAcademia.edu

How scales influence user rating behaviour in recommender systems

Behaviour & Information Technology

Many websites allow users to rate items and share their ratings with others, for social or personalisation purposes. In recommender systems in particular, personalised suggestions are generated by predicting ratings for items that users are unaware of, based on the ratings users provided for other items. Explicit user ratings are collected by means of graphical widgets referred to as 'rating scales'. Each system or website normally uses a specific rating scale, in many cases differing from scales used by other systems in their granularity, visual metaphor, numbering or availability of a neutral position. While many works in the field of survey design reported on the effects of rating scales on user ratings, these, however, are normally regarded as neutral tools when it comes to recommender systems. In this paper, we challenge this view and provide new empirical information about the impact of rating scales on user ratings, presenting the results of three new studies carried out in different domains. Based on these results, we demonstrate that a static mathematical mapping is not the best method to compare ratings coming from scales with different features, and suggest when it is possible to use linear functions instead.

This is the author's final version of the contribution published as: Cena, Federica; Gena, Cristina; Grillo, Pierluigi; Kuflik, Tsvi; Vernero, Fabiana; Wecker, Alan J.. How scales influence user rating behaviour in recommender systems. BEHAVIOUR & INFORMATION TECHNOLOGY. None pp: 1-20. DOI: 10.1080/0144929X.2017.1322145 The publisher's version is available at: https://www.tandfonline.com/doi/pdf/10.1080/0144929X.2017.1322145 When citing, please refer to the published version. Link to this full text: http://hdl.handle.net/ This full text was downloaded from iris - AperTO: https://iris.unito.it/ iris - AperTO University of Turin’s Institutional Research Information System and Open Access Institutional Repository Behaviour & Information Technology ISSN: 0144-929X (Print) 1362-3001 (Online) Journal homepage: http://www.tandfonline.com/loi/tbit20 How scales influence user rating behaviour in recommender systems Federica Cena, Cristina Gena, Pierluigi Grillo, Tsvi Kuflik, Fabiana Vernero & Alan J. Wecker To cite this article: Federica Cena, Cristina Gena, Pierluigi Grillo, Tsvi Kuflik, Fabiana Vernero & Alan J. Wecker (2017): How scales influence user rating behaviour in recommender systems, Behaviour & Information Technology, DOI: 10.1080/0144929X.2017.1322145 To link to this article: http://dx.doi.org/10.1080/0144929X.2017.1322145 Published online: 09 May 2017. Submit your article to this journal View related articles View Crossmark data Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=tbit20 Download by: [Universita degli Studi di Torino] Date: 10 May 2017, At: 05:04 BEHAVIOUR & INFORMATION TECHNOLOGY, 2017 https://doi.org/10.1080/0144929X.2017.1322145 How scales influence user rating behaviour in recommender systems Federica Cenaa, Cristina Genaa, Pierluigi Grilloa, Tsvi Kuflikb, Fabiana Verneroa and Alan J. Weckerb a Department of Computer Science, University of Torino, Torino, Italy; bDepartment of Information Systems, The University of Haifa, Haifa, Israel ABSTRACT ARTICLE HISTORY Many websites allow users to rate items and share their ratings with others, for social or personalisation purposes. In recommender systems in particular, personalised suggestions are generated by predicting ratings for items that users are unaware of, based on the ratings users provided for other items. Explicit user ratings are collected by means of graphical widgets referred to as ‘rating scales’. Each system or website normally uses a specific rating scale, in many cases differing from scales used by other systems in their granularity, visual metaphor, numbering or availability of a neutral position. While many works in the field of survey design reported on the effects of rating scales on user ratings, these, however, are normally regarded as neutral tools when it comes to recommender systems. In this paper, we challenge this view and provide new empirical information about the impact of rating scales on user ratings, presenting the results of three new studies carried out in different domains. Based on these results, we demonstrate that a static mathematical mapping is not the best method to compare ratings coming from scales with different features, and suggest when it is possible to use linear functions instead. Received 11 May 2016 Accepted 17 April 2017 1. Introduction Following the advent of Web 2.0, many web-based applications (including social media and e-commerce websites) give users the opportunity to rate content, for social or for personalisation purposes. For example, YouTube1 allows users to rate videos and share their ratings with others. Users provide their ratings by means of ‘rating scales’, that is, graphical widgets that are characterised by specific features (granularity, numbering, presence of a neutral position, etc.) (Gena et al. 2011). On the web, we can find various examples of rating scales, differing, among other things, in their visual appearance, which is probably their most salient feature, for example: Amazon2, aNobii3 and Barnes & Noble4 use stars; Facebook5 and YouTube use thumbs; TripAdvisor6 uses circles; LateRooms7 uses squares and Criticker8 uses bare numbers. User ratings are especially valuable pieces of information for most recommender systems (Adomavicius and Tuzhilin 2005), where users express their preferences for items by rating them and, based on these ratings, they receive personalised suggestions, hopefully suited to their needs. It is, therefore, important to be able to correctly interpret and compare ratings even when they were expressed using different rating scales, as described in the following scenarios. For example, an application A (e.g. IMDb) may be able to import CONTACT Fabiana Vernero [email protected] © 2017 Informa UK Limited, trading as Taylor & Francis Group KEYWORDS Rating scales; recommender system; user studies; human– machine interface users’ ratings from another system B (e.g. YouTube) in the same domain in order to improve its user models with more information about users. Since the two systems use two different rating scales (e.g. ‘stars’ and ‘thumbs’ with different granularities), system A needs a mechanism to correctly translate ratings from scale B to its scale, for example, from a 2-point scale of thumbs to a 10-point scale of stars. In the same way, a new webbased application, which is not able to provide recommendations due to the cold-start problem, can import user ratings from other systems which the user is served by (each with its own rating scale), and use them to generate recommendations. Thus, it needs a method for converting these heterogeneous ratings data to the single, homogeneous scale it uses. Similarly, in a Web 2.0 scenario, social aggregators or mash-up systems, such as Trivago in the tourism domain, may aggregate different services with different features and provide an aggregation (or a summary) of the information from all the systems involved. Thus, they need to correctly aggregate users’ ratings expressed with different rating scales. Moreover, application developers may want to change the rating scale in order to better satisfy users’ needs, as YouTube did in 2009, moving from 5-point (stars) to 2-point (thumbs). In this case, the application needed to keep all previous ratings collected in the past, and to be able to properly convert them to the new scale. 2 F. CENA ET AL. Finally, application developers may want to give users an opportunity to use different rating scales for different tasks (e.g. stars to configure their interests in the user model and thumbs to rate items, as can be seen in certain websites, such as booking.com9 for instance) or in certain moments (e.g. when users gain experience with the system, they may switch from one scale to another). Furthermore, application developers might give their users the possibility to rate items according to different parameters, each of which can be assessed on a specific rating scale. For example, an application in the tourism domain might allow users to rate hotels according to their cleanness and location, providing a 5-star rating scale for cleanness and a thumbs-up/thumbs-down one for location. Thus, in principle a system could allow users to choose which rating scale to use (Cena, Vernero, and Gena 2010). Finally, in the context of the so-called cross-domain recommendation, Cantador, Berkovsky, and Cremonesi (2015) propose to leverage all the available user data provided in various systems and domains in order to generate a more complete user model and better recommendations. As a practical application of knowledge transfer in the context of a single domain (e.g. movies), a target system may import user ratings of overlapping items from a source system – for example, representing user rating with a given rating scale – as auxiliary data to a target system – representing user rating with another rating scale with different granularity – in order to address the data-sparsity problem. In all these cases, these systems need a way to compare ratings given in different rating scales. However, this is not a trivial task, since the rating scales themselves have an effect on the ratings, as many studies point out (Garland 1991; Friedman and Amoo 1999; Amoo and Friedman 2001). In fact, how people respond to different rating scales is primarily an issue of psychology rather than a mathematical question (Cummins and Gullone 2000). This paper particularly focuses on the influence the rating scales exert on user rating behaviour, that is, their capacity to induce users to assign higher or lower ratings than they would assign with a different rating scale. Some interesting insights about the effect of rating scale features on the user rating behaviour in recommender systems can be found in the work of Vaz, Ribeiro, and de Matos (2013). Their results suggest that using a rating scale with a smaller granularity obtains better results, in terms of a lower mean absolute error (MAE), in a rating prediction task. While the work of these authors deals with the impact of rating scales on user rating behaviour in a broad sense, we are particularly focused on their influence on user ratings. While there does not seem to be much debate around this issue in the field of recommender systems and intelligent user interfaces, a notable exception is represented by Cosley et al. (2003). The authors found out that ratings on different scales correlated well, and suggested that designers might allow users to choose their favourite rating scale and compute recommendations by means of mathematically normalised scores. Conversely, in a similar experiment done by Cena, Vernero, and Gena (2010), where users were asked to rate the same item on different rating scales, it was observed that 40% of the ratings departed considerably from mathematical proportion, suggesting that rating scales themselves might induce users to be more or less optimistic (or strict, or meticulous). This insight has been also confirmed in two further experiments done by Gena et al. (2011). In our vision, this effect is, at least in part, a consequence of the set of features (granularity, numbering, presence of a neutral position, etc.) that characterise and differentiate a certain scale from the others. For example, a rating scale exploiting a human metaphor and offering only one point (like the simple ‘thumbs-up’) is perceived very differently in comparison with a scale consisting in bare numbers ranging from 0 to 100. Consequently, the ratings given by means of the two scales can be expected to be considerably different as well. The main contribution of this paper is twofold: first, we provide new empirical evidence about the effects of rating scales on user ratings focusing on recommender systems with studies on large samples and in different domains and contexts, both on and off the web; second, we demonstrate that a static mapping is not the optimal solution to compare ratings originating from rating scales with different features. By static mapping, we refer to a standard mapping which does not take into account user behaviour, but only the number of points offered by a certain scale, and maps them to a chosen interval based on mathematical proportion. For example, in case the destination interval is 0–1 (i.e. the lowest value should be mapped to 0 and the highest to 1), the new rating can be obtained by applying the following formula: 1 ∗( point position − 1), |points| − 1 where |points| indicates the number of points of the original scale and point_position indicates the position of the point corresponding to the original rating. For example, for mapping 3 out of 5 in a 5-point rating scale to a destination interval of 0–1, |points| corresponds to 5, point_position corresponds to 3 and the obtained value is 0.5. BEHAVIOUR & INFORMATION TECHNOLOGY The paper is structured as follows. Section 2 provides the background of previous work on rating scales, while Section 3 provides an analysis of rating scales in order to point out their features and a general description of the three empirical evaluations; Sections 4, 5 and 6 describe the different experiments in three different contexts: several movie websites, a museum guide and a controlled web-based experiment. Afterwards, Sections 7 and 8 discuss and conclude the paper with considerations and future directions for work. 2. Related work 2.1. Specific features and their impact on user ratings The effect of rating scales on user ratings is often reported in research works in different fields, such as economics, psychology and surveys design. These works generally focused on the study of how specific features, such as granularity, neutral point or labelling, can affect the final rating. Granularity. The optimal number of points in rating scales (granularity) and its possible effect on user responses are a much-debated issue in the literature, even if the results obtained so far do not appear to be conclusive. Lim (2008) had 137 participants who repeatedly assessed their overall level of happiness with 4-, 5-, 7- and 11-point Likert-like rating scales, with the aim of investigating whether the number of points can affect the respondents’ ratings. The author directly rescaled all ratings to the 11-point scale for comparison purposes and found that the mean happiness value was significantly higher for the 11-point scale with respect to the 4- and 7-point scales, while there were no significant differences between participants’ mean ratings for the 11- and 5point scales. This implicitly implies that higher granularity causes higher ratings. Dawes (2002), in a study that is related to our experiment 3 (Section 6.2), found that data collected with 5-point scales could be easily translated to make them comparable with data collected with 11-point scales. Similarly, in a subsequent study, they found that five- and seven-point scales can easily be rescaled; comparing 5- or 7-point data to 10-point data, a straightforward rescaling and arithmetic adjustment easily facilitated the comparison (Dawes 2008). Other works studied the effect of granularity on the reliability of the response. Preston and Colman (2000) examined Likert-like scales with 2–11 points, as well as a 101-point scale, and found that ratings expressed by means of scales with 2, 3 or 4 points were the least reliable, valid and discriminating; in addition, participants at their study preferred scales with a relatively 3 high number of points (i.e. 7, 9 or 10), even if scales with 2, 3 or 4 points were judged quicker to use. Thus, they concluded that high granularity is more reliable and preferred by users. Similar results were obtained by Weng (2004), who investigated how granularity can impact the test–retest reliability and found that scales with more response options (at least three) have a better chance of attaining higher reliability. More recently, Shaftel, Nash, and Gillmor (2012) compared different rating scales (a 5-point and a 7-point standard Likert rating scales, a 9-point numerical scale with verbal labels only at the endpoints, a 0–100 visual analogical scale marked in tens (i.e. 0, 10, 20, etc.) and a 6-point verbally labelled scale) and found that some items were ineffective at collecting distinguishing information with some rating scales but effective with others, concluding that the best granularity depends on the item content and purpose, and thus the decision is domain-dependent. Neutral point. In Garland (1991), the author produces some evidence that the presence or absence of a neutral point on a scale produces some distortion in the results. In particular, they found that some respondents may choose the midpoint in order to provide a less negative answer, because of a social desirability bias. On the other hand, rating scales with no midpoint force the real indifferent to make a choice, causing a distortion towards higher or lower answers, depending on the content which is being assessed. Weijters, Cabooter, and Schillewaert (2010) also investigated whether the presence or absence of a neutral midpoint can affect user responses. In particular, they found that it causes a higher NARS (‘net acquiescence response style’, i.e. the tendency to show more agreement than disagreement), a lower ERS (‘extreme response style’) and a lower MR (‘misresponse to reversed items’). In relation to our goals, we can say that with neutral points in the rating scale, we will have less extreme responses and higher ratings. Labelling. Various researchers observed that the labelling of a scale influences user responses. Weijters, Cabooter, and Schillewaert (2010) took into account the format of labels, comparing a case where all points were labelled and one where only the endpoints were labelled. They observed that labelling all the points (w.r.t. only the extremes) causes a higher NARS, a lower ERS and a lower MR. They concluded that the format of a rating scale can bias the mean, variance and internal consistency of the collected data, so that pieces of information obtained with different rating scales are not directly comparable. Other authors studied the effect of the polarity of the labels, providing evidence of a bias towards the left side of a scale, possibly due to factors 4 F. CENA ET AL. such as reading habits, a primacy effect or pseudoneglect, that is, an asymmetry in spatial attention which favours the left side of space (Holzinger, Scherer, and Ziefle 2011). Friedman, Herskovitz, and Pollack (1993), for instance, collected students’ attitudes towards their college with two different scales, one where the items were labelled ‘strongly agree’, ‘agree’, ‘undecided’, ‘disagree’ and ‘strongly disagree’ and one where the labels appeared in the opposite order, and found that the first scale resulted in a significantly greater degree of agreement. Similarly, Yan and Keusch (2015) varied the direction of an 11-point rating scale with numeric labels (from 0 to 10, and vice versa) and found that the mean ratings were shifted towards the left end of the scale. Moreover, in Amoo and Friedman (2001), the authors show that the negative evaluation side of a scale is perceived as more negative when it is labelled with negative rather than positive numbers (e.g.−4 rather than 1), and this causes higher average ratings when scales with negative numerical labels are used. The same result was later achieved by Tourangeau, Couper, and Conrad (2007). Interestingly, the authors also observed an analogous (even if less extreme) effect with colours: in fact, they found that user ratings tend to be higher when the endpoints of a scale are shaded in different hues, as compared to scales where both ends are shaded in the same hue.10 2.2. Using different rating scales in recommender systems Differently from the above-mentioned works, we focus on rating scales as a whole, trying to highlight the influence of scales on user ratings in recommender systems. As seen in Section 1, this topic is particularly relevant, since the recommender’s performance depends on ratings. However, apart from Vaz, Ribeiro, and de Matos (2013) and Cosley et al. (2003), who studied the effect of different scales on user ratings also in relation to MAE, most of the studies in this area focused on other aspects, such as the design of rating scales and the effects of users’ personality on rating behaviour. Vaz, Ribeiro, and de Matos (2013) carried out an experiment where they compared the performance of a collaborative filtering algorithm using ratings expressed on two scales with different granularity. More specifically, the authors obtained two different datasets by mapping their original set of ratings, which were expressed on a 5-point scale, to a 3-point scale (dislike/ neutral/like), converting ‘1’ and ‘2’ ratings to ‘dislike’, ‘3’ ratings to ‘neutral’, and ‘4’ and ‘5’ ratings to ‘like’. Their results suggest that using a rating scale with a smaller granularity obtains better results, in terms of a lower MAE, in a rating prediction task. The intuition behind such a result is that general preferences, expressed at the level of like and dislike, are more stable than preferences expressed with a high degree of detail. While this work deals with the impact of rating scales on user rating behaviour in a broad sense, we are particularly focused on their influence on user ratings. Cosley et al. (2003) asked their subjects to rerate 3 sets of movies on MovieLens (already evaluated by means of the original 5-point rating scale) with a binary scale (thumbs up–down), a no-zero scale (range: −3, +3) and a half-star scale (range: 0.5, 5). The authors found out that ratings on all three scales correlated well with original user ratings, and suggested that designers might allow users to choose their favourite rating scale and compute recommendations by means of mathematically normalised scores. However, they also observed that users tended to give higher mean ratings on the binary and on the no-zero scales, and that new ratings on the binary scale correlated less strongly with original ratings than new ratings on the no-zero and half-star scales. Differently from these studies, we aimed at conducting a more comprehensive study of rating scales, considering granularity, neutral point and visual metaphor and doing this in a variety of different usage scenarios. 2.3. The design of rating scales Referring to the design of rating scales, Swearingen and Sinha (2002) suggested to adopt a mix of different types of questions (e.g. expressing binary liking versus rating items on a Liker-like scale) and to provide constant feedback on user contributions in order to keep users from getting bored or frustrated during the rating. Herlocker et al. (2004) pointed out that the granularity of user preferences with respect to recommended contents may be different from the granularity managed by the specific recommender system. Thus, an appropriate rating scale should allow users to distinguish among exactly as many levels of liking as it makes sense to them. In van Barneveld and van Setten (2004), the authors defined the main elements of interface aspects for presenting system predictions and collecting explicit user feedback in the context of a TV recommender system: (1) presentation form; (2) scale of the prediction or rating (including range, precision, symmetric versus asymmetric and continuous versus discrete); (3) visual symmetry or asymmetry; and (4) the use of colour. They also found that most users prefer to have predictions presented by means of 5-star interfaces, while they are less in agreement regarding interfaces to provide input to the system, consistent with the findings of Cena, Vernero, and Gena (2010). BEHAVIOUR & INFORMATION TECHNOLOGY Also Nobarany et al. (2012) concentrated their work on the design of ‘opinion measurement interfaces’11 (as they called the rating scales). They identified two axes: Measurement Scale (absolute rating vs. relative ranking) and Recall Support (previously recorded opinions). They experimented the use of two prototypes with different rating and ranking scales (Stars, Stars + Recall, Binary, List). The measures used to compare the scales were speed, accuracy, mental demand, suitability of organisation, fun to use and overall preference. Quantitative and qualitative final results showed that (1) a rating interface that provides recall support with examples of users’ previous choices is preferred, and (2) rating accuracy is perceived as more important than rating speed. For Usman et al. (2010), the star rating system is the de facto standard for rating a product, being regarded as one of the most appealing rating systems for direct interaction with users. However, due to its limitation when comparing items with different numbers of raters (i.e. stars do not convey any information regarding the number of people who reviewed a certain item), the authors argue that the visual strength of the five stars is not enough to declare that they are the best option also for recommender systems. For this reason, they proposed a Relative Ranking, where a benchmark item is used to compare other items until they reach its number of points. According to the authors, this method is more realistic and useful at a glimpse than the five stars system. Sparling and Sen (2011) investigated the costs, in terms of mental effort and time, which are associated with rating scales with different granularities and carried out an online survey where they compared the following four scales: unary (‘like it’), binary (‘thumbs up’/‘thumbs down’), five-star and 100-point slider. They found that users’ average rating time increases with the granularity of rating scales, while there are no significant differences as far as cognitive load is concerned, if the unary scale (which requires a significantly less hard work on the part of users) is excluded. Moreover, the participants in their survey preferred the stars and the thumbs, disliking the scales at the ends of the granularity spectrum (unary scale and slider). that a pessimistic sad user rates 3 out of 5 stars’, even if they actually mean the same. For example, Hu and Pu (2013) showed the influence of user personality characteristics on user rating behaviours. They conducted an online survey with 122 participants: they used the Five-factor Model (openness to experience, conscientiousness, extraversion, agreeableness and neuroticism) and the Big Five Inventory (John and Srivastava 1999) in order to describe the personality of each user in the first part of the survey. An adjustment using averages could be adopted in order to compensate for idiosyncratic behaviour. For example, Schafer et al. (2007) predicted user ratings for an item i as a positive or negative variation with respect to their average rating, and the amount of such variation is determined as a function of the difference between the rating other users assigned to item i and their average rating. Similar approaches are described in Herlocker et al. (2004), Adomavicius and Tuzhilin (2005) and Goldberg et al. (2001). 3. Empirical evaluations We carried out three user studies in different domains and contexts, all of them with the aim of validating our main hypothesis regarding if and how rating scales have a personality and thus are able to influence users’ ratings. The first evaluation (Section 4) focused on existing movie websites and aimed to evaluate how the same movies are rated using different scales. The second one (Section 5) exploited a system designed by some of the authors, which acts as a museum visitor’s guide for the Hecht12 museum in Haifa, Israel, and offers different rating scales (Kuflik et al. 2015). This second evaluation took place in a different context from the web, in order to assess the external validity of the other results. The third one (Section 6) is a controlled experiment where participants have to explicitly translate ratings from one scale to another. The three evaluations differ in various aspects, thus allowing us to explore the issue of rating scale personality from different points of view: . 2.4. User personality and its impact on user ratings Many studies in recommender systems focused on differences in rating behaviour that can be related to users’ personality. This is a complementary perspective with respect to the one proposed in our paper, but we put it here for completeness in the discussion. As explained in Schafer et al. (2007), ‘one optimistic happy user may consistently rate things 4 out of 5 stars 5 . . Domain: the first and third evaluations were carried out in the movies domain, while the second one was in the museum/cultural heritage domain; Procedure: the first and second evaluations used realworld data, generated by people using a particular system in situ, while the third evaluation was a controlled experiment; Number of participants: evaluation 1 is based on large datasets with thousands of ratings per system, while evaluations 2 and 3 use data from about 300 subjects each; 6 . . F. CENA ET AL. Type of comparison: the first and second evaluations compare ratings given to a certain item by different people, and each person is expected to have used only one website/rating scale; in the third evaluation, each participant rated all the items with all the rating scales; Context: evaluation 1 used real-world data obtained from ratings given in a context of interaction with real websites, while evaluation 2 used real-world data coming from ratings given in the context of interaction with a museum guide and, finally, evaluation 3 was a controlled experiment performed using an ad hoc web-based interface. Rating scales used in our evaluations (with their corresponding websites, as far as the first evaluation is concerned) were selected with an eye to the fact that they allowed us to cover different possible values for the most distinctive features we identified in a previous work (Cena and Vernero 2015), that is, visual metaphor, icon, granularity, range, positive/negative, neutral position and point mutability. A description of the meaning of these features is provided in Table 1. In fact, according to our vision, the personality of a rating scale is somehow related to its objective features, so that scales with different features can be expected to have different effects on users’ ratings. For the first evaluation, rating scale selection was also affected by the kind of data access facilities (e.g. public Application Programming Interfaces, or APIs) offered by websites. Thus, we selected the following: . . The free text numerical scale ranging from 1 to 100, which is used in Criticker13; The 6-point rating scale with icons representing a hypothetical movie viewer (with mood from frustrated to enthusiast) exploited by Filmeter14; . . . The 4-star rating scale used in FilmCrave,15 which offers half-star ratings; The 5-star rating scale used in MovieLens,16 which offers half-star ratings; The 10-star rating scale used in IMDb.17 For the second evaluation, we used the following: . . . A 5-star rating scale (it was included in all three evaluations because of its status of ‘standard’); A 2-point, a 3-point and a 5-point rating scale with icons representing smiling, sad or neutral faces; A 3-point numerical rating scale with the available positions explicitly labelled as ‘−1’, ‘0’ and ‘+1’. Finally, for the third evaluation, we reused all the rating scales selected for the first two evaluations. A visual summary of the selected rating scales is presented in Figure 1, while a concise description of their distinctive features is given in Table 2. 4. Real movie ratings analysis Goal. The goal of this evaluation was to investigate the values of ratings given on heterogeneous rating scales (and using different systems) on the same items. Our hypothesis was that static normalisation was not enough for mapping user ratings expressed with different rating scales. In studies done by Cena, Vernero, and Gena (2010) and Gena et al. (2011), it was observed that ratings expressed on the same items by means of different rating scales depart considerably from mathematical proportion, and so we can assume that rating scales Table 1. The most distinctive features included in the model of rating scales devised by Cena and Vernero (2015). Feature Visual metaphor Icon Granularity Range Positive/ negative Neutral position Point mutability Description The metaphor, used in the visual appearance of a rating scale, which can impact on its interpretation and emotional connotation (e.g. a smiley face exploits a metaphor related to human emotions). Not all visual presentation forms make use of metaphors. The specific image or presentation form used in a rating scale. The number of positions in the rating scale The minimum and maximum values of the scale (e.g. from 0 to 10) The presence of either only positive, or only negative, or both kinds of points The presence of a neutral, intermediate position (middle point) Whether the points in a rating scale are represented in the same way or not Figure 1. The rating scales analysed in our work, chosen from existing websites and an ad hoc experiment. BEHAVIOUR & INFORMATION TECHNOLOGY 7 Table 2. The analysed rating scales described under different features. Source Visual metaphor Icon Granularity Criticker None (but reference to school marks) FilmCrave Neutral, standard rating system Neutral, standard rating system Neutral, standard rating system Neutral, standard rating system Human, domain- related (cinema) Cinemaviewer 6 PIL Human Smileys 5 PIL Human Smileys 3 PIL Human Smileys 2 PIL Measurement tool Numbers 3 MovieLens, PIL IMDb Filmeter None (bare numbers) Stars 101 8 Stars 10 Stars 5 Stars 10 Range Number of total ratings Average number of ratings per movie Standard deviation FilmCrave 5,311,371 167,798 346,521 4429.83 2668.28 142.56 126.66 292.92 YES NO 1 star – 4 stars only positive NO NO 1 star – 5 stars with halfstar ratings 1 star – 5 stars only positive YES NO only positive YES NO 1 star – 10 Stars only positive NO NO very negative mood – very positive mood very sad face – very happy face very sad face – very happy face very sad face – very happy face −1 – +1 positive + negative YES YES (different icons) positive + negative positive + negative positive + negative positive + negative YES YES (different icons and colours) YES (different icons and colours) YES (different icons and colours) NO 249.19 IMDb MovieLens 116,601,797 11,872,347 97,249.21 80,499.32 9926.71 9892.37 Point mutability only positive Table 3. Main statistic for the movie rating data: number of ratings. Filmeter Neutral position 0–100 actually have an influence on user ratings. Thus, we decided to extend the experiments to a broader set of data in order to externally validate those findings. Hypothesis. Our main hypothesis was that ratings given to the same movies using different rating scales could produce different results due to the features of the scales. Design. Five factors (the five movie portals), betweensubjects design. Subjects. Anonymous subjects who left ratings on 1199 selected movies on the below described systems. Since we collected only anonymous ratings, we do not know exactly the total number of subjects who left ratings on these systems, but just the total number of ratings (see Table 3), since one subject may have rated several times. Note that for selecting the movies, we used an availability sampling approach (also known as sample of convenience18): the selected movies are those that have ratings on all the six movie portals, and similarly the same subjects and their ratings. Criticker Positive/ negative YES NO YES Apparatus and materials. The raw data on ratings were collected both manually, and through public APIs, and through proprietary crawlers (see the detailed description below). Procedure. In order to validate our hypothesis, we collected the ratings given to 1199 different movies by using five different rating scales, which differ in metaphor and granularity (see Section 2), belonging to five different popular movie portals: . . . . . Criticker,19 which exploits as rating scale a free text numerical scale from 0 to 100 (see Figure 2); Filmeter,20 which exploits a 1–6 rating scale with icons representing a hypothetical movie viewer (from frustrated to enthusiast mood) (see Figures 3 and 4)21; FilmCrave,22 which exploits a 1–4 stars rating scale (with the possibility of half-star ratings) (see Figure 5). IMDb,23 which exploits a 1–10 stars rating scale (see Figure 6); MovieLens,24 which exploits a 1–5 stars rating scale (with the possibility of half-star ratings) (see Figure 7). At the beginning of 2014, we started to collect data from Criticker, since, at that time, it provided an API Table 4. Main descriptive statistics for the movie average rating values. Mean Std. error of mean Std. deviation Minimum Maximum Criticker Filmeter FilmCrave IMDb MovieLens 0.6715 0.00295 0.5655 0.0042 0.5703 0.00363 0.6935 0.00298 0.6537 0.00333 0.10201 0.24 0.88 0.1441 0 1 0.1247 0.06 0.85 0.10305 0.19 0.96 0.11518 0.19 0.88 8 F. CENA ET AL. Figure 2. Criticker’s rating scale and average rating display. Figure 3. Filmeter’s rating scale. for extracting movies’ data. With the Criticker API, we selected the most popular films (the ones that had a higher number of ratings). We extracted movie information such as title and id about these films as well as Figure 4. Filmeter’s average rating display. the corresponding IMDb identifier and users’ ratings. In particular, for every film we have extracted all the ratings (in the range 0–100) provided by the users. Then we have calculated the average rating values for every movie. IMDb does not offer any kind of API to query its huge database of films and film-related information. To overcome this constraint and collect data about ratings in IMDb, we used the IMDb movie id extracted from Criticker film list to reach the movie’s URL, and then a softbot automatically parsed the corresponding pages and BEHAVIOUR & INFORMATION TECHNOLOGY 9 Figure 5. FilmCrave’s rating scale and average rating display. extracted the average ratings and the corresponding number of ratings. Unlike Criticker, from the IMDb pages we did not collect all the single ratings, but only the average ratings, and the total number of ratings (and the total number of users) for each movie. Similarly, we manually collected the average ratings and the total number of ratings for the same movies from Filmeter25 and FilmCrave since there were no APIs available, while for MovieLens we utilised the dataset consisting of 20 million ratings26 and selected the corresponding movies through the IMDb id. Similarly to what was done for Criticker, we calculated the average rating values for each movie. Figure 6. IMDb’s rating scale and average rating scale display. We then compared the values obtained from the five rating scales. In order to be able to compare the different scales, the users’ ratings were normalised onto a 0–1 scale, according to the static mapping described in Section 1. With this conversion mechanism, all the values are represented in a comparable 0–1 range. We saved all the converted data in a database and calculated some descriptive statistics about ratings (mean, standard deviation, etc.), inferential statistic (Kruskal– Wallis test), correlations and regressions. Results. The total number of ratings collected from each system is shown in Table 4, together with the mean and standard deviation. We are aware of the fact 10 F. CENA ET AL. Figure 7. MovieLens’ rating scale. that the average number of ratings for movies greatly varies between one system and another. However, these numbers represent the actual photograph of the ratings for movies at the time of evaluation on these systems. As we can notice in Table 5, the mean values show differences that need to be evaluated in order to discover if they are significant. In particular, IMDb and Criticker show higher average ratings, followed by MovieLens, FilmCrave and Filmeter. Our hypothesis in interpreting these results is that the higher granularity and the presence of a neutral point encourage users to rate higher. Criticker offers the finest explicit granularities – and probably pushes the users to be more precise and even stricter – and, in terms of granularity, it is followed by IMDb, whose granularity is probably perceived very close to that of Criticker (100 vs. 10). MovieLens has the same granularity of IMDb through the use of half-star ratings, but it offers the possibility to choose an explicit neutral point, which collects 23.6% of ratings (see Figure 8). Figure 8 indeed shows the percentage of ratings per rating position. We can observe that the half-star positions are not frequently used, with respect to full-star positions. So we can conclude that, generally, users perceive as main granularity of the scales the one associated to the full-star position. The same probably happens with FilmCrave; however, we could not compute these calculations since we only have available the average ratings per movie. FilmCrave shows an average value closer the one of Filmeter, which shows the lowest average: they both do not have a neutral point, and their granularity is close and based on even numbers. Indeed Filmeter shows the lowest average, also associated with the widest rating range (0–1) probably because the icons associated with rating positions push the users to use the lowest rates, and lacks of a neutral point. In order to evaluate the significance of these average results, we performed the Kruskal–Wallis one-way analysis of variance (ANOVA) by ranks, a non-parametric method of testing equivalent to the parametric one-way ANOVA, on the movie average per systems. When the Kruskal–Wallis test leads to significant results, at least one of the samples is different from the other samples. The Kruskal–Wallis test rejected the null hypothesis and confirmed that the values are different with significance at the .05 level. However, we also wanted to determine which of these groups significantly differ from each other. Thus, we ran a post hoc pairwise comparison that showed that the differences between the means are all significant. In order to understand the general trend of rating behaviour, we have first plotted the rating trends on a graph that orders the Criticker rating values from the smallest to the largest and accordingly the other ratings (see Figure 9). From this general view, we can see that there are emerging trends between IMDb and Criticker (in Figure 9(a), the Criticker ratings are partially covered by the ones of IMDb), followed by MovieLens, FilmCrave and Filmeter. In order to investigate the nature of these relationships, we calculated correlations. In particular, we calculated Pearson’s r coefficient to highlight significant linear correlations among pairs of scales and then we performed a linear regression analysis to define linear functions which allow predicting user ratings on a particular scale, given their ratings on another scale. As we can notice in Table 6, all the rating values are significantly correlated. Following Cohen’s classification system (Cohen 1988), we consider only large relationships, that is, the ones with r > 0.5 and explained variability > 0.25. All the correlations among the scales are large, and the ones between Criticker, MovieLens, IMDb and Table 5. Main statistic for the movie average rating values: correlations. Criticker Filmeter FilmCrave IMDb MovieLens a Pearson Correlation Pearson Correlation Pearson Correlation Pearson Correlation Pearson Correlation All the correlations are significant at the .01 level. Cricketer Filmeter FilmCrave IMDb MovieLens 1 0.791a 0.943a 0.944a 0.946a 0.791a 1 0.783a 0.767a 0.740a 0.943a 0.783a 1 0.934a 0.915a 0.944a 0.767a 0.934a 1 0.933a 0.946a 0.740a 0.915a 0.933a 1 BEHAVIOUR & INFORMATION TECHNOLOGY 11 5. Rating scales in the museum Figure 8. Distribution of ratings per scale position in MovieLens. FilmCrave are extremely large. Notice that the extremely large correlations between IMDb and Criticker are also due to the fact that Criticker offers to IMDb users the possibility of importing their IMDb ratings into Criticker. Looking at Table 7, we can observe that the relationship among all the variables is significant, and we can use linear regression to predict one variable from another. If the variables were bound by a precise mathematical mapping, the regression coefficient would be equal to 1. However, Table 7 shows regression coefficients larger and smaller than 1. So we can assume that the different values of the regression coefficients are caused by the influence of the features of the rating scales. We can also notice that regression coefficients larger than 1 describe faster changes in the dependent variable, while smaller coefficients describe slower changes. Negative signs of the intercepts highlight the presence of lower values in the dependent variable. Figure 9. Graphic of correlations among all the scales. Goals. In order to further explore the idea that rating scales may influence user behaviour and test its external validity in a context other than the web, we designed a study using a museum visitors’ guide system. Initial results of that study, published in Kuflik et al. (2012), showed that, indeed, users gave different scores when they were using different rating scales. Here we report on the results of the complete study (overall 251 logs analysed out of about 600 visit logs of real participants’ visits; 3–4 times the number of participants reported in the initial study). Hypothesis. Our main hypothesis was that ratings given to the same presentations using different rating scales would produce different results due to the influence of each rating scale. Design. Five factors (the five rating scales), betweensubjects design. Subjects. The participants (about 600, out of which only 251 logs were used due to various problems) were normal museum visitors who used a museum visitors’ guide during their visit. The visitors were not identified (only by a user ID that was defined when the guide was given to them). No personal information was collected. Apparatus and Materials. In early 2011, a museum visitors’ guide system was introduced at the Hecht museum, a small archaeological museum located at the University of Haifa, Israel. The system is described in Kuflik et al. (2015). It is a web-based system that allows users to freely walk around in the museum, wearing a 12 F. CENA ET AL. Table 6. Main statistic for the movie average rating values: regressions. Independent variable Dependent variable Criticker Filmeter FilmCrave IMDb MovieLens Criticker FilmCrave IMDb MovieLens Criticker Filmeter IMDb MovieLens Criticker Filmeter FilmCrave MovieLens Criticker Filmeter FilmCrave IMDb Filmeter FilmCrave IMDb MovieLens Pearson’s r Explained variance p Regression coefficient Α Equation 0.791 0.943 0.944 0.946 0.791 0.783 0.767 0.740 0.943 0.783 0.934 0.915 0.944 0.767 0.934 0.933 0.946 0.740 0.915 0.933 62% 88% 89% 89% 62% 61% 59% 55% 88% 61% 87% 84% 89% 59% 87% 87% 89% 55% 84% 87% .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 1.114 1.162 0.953 1.067 0.562 0.683 0.549 0.592 0.765 0.897 0.763 0.838 0.934 1.071 1.143 1.042 0.838 0.925 0.999 0.836 −0.182 −0.209 +0.053 −0.063 +0.354 +0.184 +0.383 +0.319 +0.235 +0.053 +0.258 +0.157 +0.024 −0.177 −0.222 −0.069 +0.124 −0.039 −0.082 +0.147 Y = 1.114X − 0.182 Y = 1.162X − 0.209 Y = 0.953X + 0.053 Y = 1.067X − 0.063 Y = 0.562X + 0.354 Y = 0.683X + 0.184 Y = 0.549X + 0.383 Y = 0.592X + 0.319 Y = 0.765X + 0.235 Y = 0.897X + 0.053 Y = 0.763X + 0.258 Y = 0.838X + 0.157 Y = 0.934X + 0.024 Y = 1.071X − 0.177 Y = 1.143X − 0.222 Y = 1.042X − 0.069 Y = 0.838X + 0.124 Y = 0.925X − 0.039 Y = 0.999X − 0.082 Y = 0.836X + 0.147 small proximity sensor and carrying an iPod touch. When they are detected in the vicinity of a point of interest, they are offered a selection of multimedia presentations about objects of interest. Once they have selected a presentation and viewed it, they are required to provide feedback about their satisfaction from the presentation before continuing the visit (i.e. providing feedback is mandatory before the user continues to use the system). As part of the design of the user interface, five different feedback mechanism designs (presented in Figure 10) were implemented and integrated into the system in order to explore whether the interface design of the rating scales has an impact on the ratings, as suggested by Kaptein, Nass, and Markopoulos (2010). The scales in this experiment differ in granularity (from 2 to 5 points), in metaphors (human emotions: the smiley faces; school marks/degrees: the numerical scale; scoring/ranking: the stars), in the presence of a neutral position (present in stars, in 3-point faces or in the numerical scale) and in numbering (there is a scale consisting of −1, 0, 1). They were selected due to their popularity (stars and faces) and due to the fact that stars are neutral, while smiley faces have an emotional aspect. Numbers were chosen in order to see what may be the impact of a clear negative value on ratings. Procedure. We collected visit logs as part of our regular operation where the visitors guide was offered to normal museum visitors during regular museum opening Table 7. Linear regression analysis: results for five stars as the independent factor. All Italians Israelis Dependent Variable Pearson’s r Explained variance p Regression Coefficient Α Equation ten stars Numbers Five smileys Filmeter Three smileys −1, 0,+1 Two smileys Four stars Ten stars Numbers Filmeter Five smileys Three smileys −1, 0, +1 Two smileys Four stars Ten stars Five smileys Numbers Three smileys Filmeter −1,0,+1 Two smileys Four stars 0.874 0.83 0.823 0.804 0.791 0.781 0.772 0.761 0.86 0.81 0.8 0.79 0.77 0.76 0.76 0.75 0.9 0.9 0.88 0.83 0.82 0.81 0.79 0.79 76% 69% 68% 65% 63% 61% 59% 58% 74% 66% 63% 62% 60% 58% 58% 56% 82% 80% 77% 68% 67% 66% 62% 62% .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 0.927 0.838 0.934 0.896 1.306 1.429 1.851 1.036 0.919 0.838 0.884 0.906 1.266 1.351 1.816 1.045 0.936 0.972 0.838 1.391 0.923 1.518 1.916 1.016 0.049 0.146 0.084 0.09 −0.043 −0.081 −0.322 −0.064 0.055 0.137 0.101 0.118 −0.005 −0.016 −0.285 −0.064 0.047 0.026 0.168 −0.118 0.075 −0.162 −0.393 −0.061 Y = 0.93X + 0.05 Y = 0.84X + 0.15 Y = 0.93X + 0.08 Y = 0.90X + 0.09 Y = 1.31X − 0.04 Y = 1.43X − 0.08 Y = 1.85X − 0.32 Y = 1.04X − 0.06 Y = 0.92X + 0.06 Y = 0.84X + 0.14 Y = 0.88X + 0.10 Y = 0.91X + 0.12 Y = 1.27X − 0.01 Y = 1.35X − 0.02 Y = 1.82X − 0.29 Y = 1.05X − 0.06 Y = 0.94X + 0.05 Y = 0.97X + 0.03 Y = 0.84X + 0.17 Y = 1.39X − 0.12 Y = 0.92X + 0.08 Y = 1.52X − 0.16 Y = 1.92X − 0.39 Y = 1.02X − 0.06 BEHAVIOUR & INFORMATION TECHNOLOGY 13 Figure 11. Average, median and standards deviation of the ratings of the five rating scales used in the museum experiment. Figure 10. The five rating scales that could be randomly selected for use in the museum visitors’ guide system. hours. For experimentation purposes, whenever a visitor logged in and started using the system, a randomly selected rating scale was activated for them and used throughout the entire visit. The interactions of the visitors with the system were logged, where the log contained (among other data) the location of the visitor, the presentation viewed, whether it was completed or stopped and the rating the visitor gave to the presentation. The experimentation started in October 2011 and ended in January 2014, when we had about 600 logs of visitors; out of them, 292 logs were good (there were quite a few logs that had problems: they were part of another specific experimentation, they were not recorded properly, etc.) and 251 that had 3 ratings or more were analysed. On average, a visitor rated 14.8 presentations (min = 3, max = 91, median = 12, STD = 12.2). With respect to the rated presentations, out of 346 different presentations available for the users to rate, 288 were actually rated by users during the visits. We had 3747 rating records; hence, a presentation got an average of 13 ratings in 5 rating scales, resulting in an average of 2.6 ratings in every rating scale. Moreover, in many cases, items were not rated in all five rating scales but only in some of them. This study differs considerably from the other studies reported here with respect to the numbers of rating available for analysis. Hence, we provide only the overall results without any further statistical analysis that may not provide any additional insights, given the relatively small number of ratings per rating scale and item. Results. In general, all visitors scored the presentations high. Figure 11 presents the average ratings of the five different methods used in the experiment. In order to be able to compare the various scales, all were converted to the 0–1 scale according to the above-presented static mapping; that is, when there were 2 values, then 0 and 1 were used; when there were 3 values, then 0, 0.5 and 1 were used. Looking at Figure 11, we can see that the average values of the stars (1–5) and 5 faces are close as well as the numerical values (−1, 0, +1) and 3 faces. It is interesting to see among the face scales that 2 faces provided the highest scores, then 3 faces and then 5 faces. Looking at the medians, it seems that all the scales with the smallest range (2 stars and then −1, 0 ,+1 and 3 stars) had similar scores, while the two scales with the highest but identical range (five faces and stars) had, again, a lower but similar score. Given the specific setting of the museum, it is no wonder that in general the scores were high, as it is also common in other areas. The museum is a pleasant environment where visitors come to enjoy, and the presentations are informative, interesting and of high quality. Still, when visitors had the choice, that is, when the scale had more granularity, they were a bit more critical. Statistical analysis (Wilcoxon test and Mann–Whitney with Bonferroni correction) revealed that the differences between all scales are significant, with one exception – the difference between ‘−1, 0, +1’ and ‘3 faces’ was not statistically significant. Given all the above, we can say that whenever the range of the rating scale is similar, then the conversion between them should be relatively easy as it looks like here they have similar averages and medians. We can also note that when the range of scores becomes smaller, the average score becomes higher – users tend to rate positively, but when there is more flexibility, they use it. So if you would like to get a better feedback, then the range of 5 enables a better feedback. 6. Users’ rating behaviour in a web-based user study Goals. We set up a web-based user study in order to further analyse the differences within the set of rating 14 F. CENA ET AL. scales used in our previous studies (see Figure 1). The main goal of this study was understanding how users map their ratings on different rating scales. More specifically, we focused on the conversion of three input ratings (2, 3 and 4 stars) from the 5-star rating scale to all other scales. The 3-star rating was chosen since the use of a midpoint in rating scales is much debated in the literature and we aimed at observing the way people translate it when they are forced to make a choice. The other two ratings were chosen in order to understand how people convert ratings that, although being quite clearly ‘low’ or ‘high’, are not extreme values (1 and 5 ratings were ignored as these are extreme values that are easy to convert). The 5-star scale was chosen as the input scale due to its familiarity to most users. In fact, according to a pilot survey we carried out on 45 web-based rating systems in book, movie and travel domains, stars were the most popular icon (20/45), while 5 points was the most exploited granularity (16/45). Hypothesis. We hypothesised that there are statistically significant differences between the ratings of users when they convert a given rating in a given scale to different rating scales. Design. A 3 × 8 factorial design, that is, input rating × output rating scale, within subjects. Subjects. We recruited 330 participants among the Facebook contacts of the authors, according to an availability sampling strategy.27 Of them, 206 were Italians, 105 were Israelis and 19 were from other countries. Of these, 266 participants (174 Italians, 74 Israelis and 18 from other countries) completed the whole test. Apparatus and materials. Participants took part in our study through web pages (in English) that they could access on their own, at their convenience. All questions were accompanied by figures or interactive representations of the input ratings and/or rating scales they mentioned. Procedure. This study was carried out in 2013. Participants were asked to think of three movies that they would rate with 2, 3 and 4 stars out of 5, respectively, and to rate them using each one of the eight output scales (Figure 12).28 Results. To allow comparison, users’ ratings were converted to a 0–1 range following the same static mapping we used for the previous experiments. Results are presented in Figure 13, where many small differences can be noticed. Considering user conversions of the 2-, 3- and 4-star ratings separately, the Friedman test29 confirmed our intuition that there are significant differences among user ratings on the eight output scales in all three cases. Then, we compared all possible pairs of scales using the Wilcoxon signed-ranks test (note that with 8 rating scales, there are 28 different pairs). Results showed statistically Figure 12. An example web page used for our user study. significant differences for most of them, thus supporting our idea that scales can actually exert an influence on user ratings. However, an agreement (i.e. no significant differences, indicating that two scales have a similar influence on users) could still be observed in at least 5 cases out of 28 for each one of the three rating conversions. The most significant cases of agreement are the following: . Three and 5 smileys were found to be in agreement with 10 stars in the translation of 2 and 3, and of 3 and 4 stars, respectively. Similarly, they were in agreement with Filmeter in the translation of 2- and 3-star Figure 13. Two three, four stars rating interpretation in eight different rating scales. BEHAVIOUR & INFORMATION TECHNOLOGY . ratings (3 smileys) and 3- and 4-star ratings (5 smileys). Three smileys and five smileys themselves are found to be in agreement for the translation of 2- and 3-star ratings, which is not surprising since these two scales are quite similar (same metaphor, use of colour, use of explicitly negative points, close granularity and a neutral point in both cases). Three smileys were found to be in agreement with ‘−1, 0, +1’ for the translation of 3- and 4-star ratings. Also for these two scales, which were already found to be in agreement in the museum experiment (Section ‘Rating scales in the museum’), similarity in user ratings can be explained by the fact that they have the same granularity, a neutral point and explicitly represented negative positions, even if they have very different visual metaphors. 15 Figure 15. Comparison between Italians and Israelis: three stars rating conversion. Since we noticed some differences between Israelis and Italians, we analysed their answers separately. Considering the 2-star rating (Figure 14), we can observe that in cases where the means do not agree, the Israelis were usually more critical than the Italians (the exception being the ‘numbers’ scale). Statistical analysis (MANOVA30 with Bonferroni correction) revealed that there are significant differences between the two national groups in the ‘numbers’ scale, in the ‘−1, 0, +1’ scale and in the 5 smileys scale (p < .05). Notice that the observed average difference is smaller for numbers than for the other two scales, while relatively large differences between Italians and Israelis (e.g. in the case of the 3-smiley scale) were not found to be significant. As far as the 3-star rating is concerned, visual inspection of Figure 15 suggests that differences between the two groups are systematically larger, with the exceptions of the ‘numbers’ scale (smaller difference) and of the 4 stars, 10 stars and Filmeter scales that again elicited very close rating interpretations in the two groups. Again, the Israelis were a bit more critical in their ratings. Statistical analysis (MANOVA) revealed that there are significant differences regarding the 5-, 3- and 2-smiley scales, and the ‘−1, 0, +1’ scale (p < .05). On the contrary, differences are generally very small for the 4-star rating (Figure 16), and they are statistically significant only for what concerns the interpretation of the numbers and 5-smiley scales (p < .05). Following the same approach we described in Section 4, we joined the data deriving from users’ translations of the three input ratings and performed further analyses to study whether and how ratings can be translated from a scale to another, concentrating on the existence of simple linear correlations. More specifically, we considered pairs consisting of five stars and each one of the other scales. Five stars played the role of independent variables in the linear regression study and were chosen because they were used as the standard input scale in our web-based study. Results are presented in Table 8, including the linear regression-based equation that allows to translate ratings from a scale to another. Notice that scales are ordered according to the value of r (or, equivalently, to the percentage of explained variability), so that those which show a higher linear correlation with the independent variable are listed first. All results are significant, with p = .01, and following Cohen’s classification system (Cohen 1988), all correlations can be considered very large. Figure 14. Comparison between Italians and Israelis: two stars rating conversion. Figure 16. Comparison between Italians and Israelis: four stars rating conversion. 16 F. CENA ET AL. Table 8. Main statistics for the web-based user study: correlations. Filmeter Four stars Ten stars 1 0.768a 0.731a 0.868a 0.768a 1 0.706a 0.817a 0.731a 0.706a 1 0.780a 0.868a 0.817a 0.780a 1 Numbers Numbers Filmeter Four stars Ten stars a Pearson Correlation Pearson Correlation Pearson Correlation Pearson Correlation . All the correlations are significant at the .01 level. The strongest correlation exists between 5 stars and 10 stars (r = 0.874). This result is not surprising, since we previously observed statistical agreement between ratings on these two scales and we could already find an extremely large correlation in the real movie ratings analysis, where MovieLens was treated as a 5-point scale by excluding half-star ratings. Moreover, ratings on the 5-star scale are good predictors for ratings on numbers, five smileys and Filmeter scales. While 5 stars, 5 smileys, 10 stars and numbers all have ‘related’ granularity, Filmeter is similar to the other scales in that it has a fine granularity. Considering the two national populations separately, we found no big differences between Italians and Israelis, but we can observe that correlations are systematically higher for Israelis. In fact, the average value for Pearson correlation is 0.83 for Israelis and 0.78 for Italians. In order to allow comparison with the real movie ratings analysis, we also studied the existence of linear relationships in pairs consisting of the following scales: numbers (Criticker), Filmeter, 4 stars (FilmCrave) and 10 stars (e.g. IMDb) (see Table 9). For simplicity, the scale with the finest granularity played the role of independent variable. Our data show that there are large relationships for all the pairs, thus confirming our findings for the real movie ratings analysis. Summing up our main findings, we can observe the following: . In most cases, there are statistically significant differences between user ratings between any two rating scales we tested. This result allows us to accept our . . hypothesis and claim that rating scales have an influence on users’ rating behaviour. Cases of agreement between two scales can be explained by the fact that they have various objective features in common. In most cases, the common feature is granularity, which can be the equivalent (as for three smileys and ‘−1, 0, +1’), sometimes quite close (five smileys and Filmeter) and sometimes directly related (5 smileys and 10 stars). From the cases of agreement we observed, only 3 smileys did not have a similar granularity to 5 smileys and (particularly) 10 stars. All rating scales are significantly correlated, so that ratings can be translated from a scale to another using linear regression-based equations. The way different rating scales are perceived is likely to be influenced by cultural background aspects. Israelis usually seem more critical than the Italians; at the same time, they seem to be more consistent when rating items on different scales. 7. Discussion Our work complements and adds to the large body of knowledge about rating scales that already exists. Our study in particular focused on the impact of the appearance of rating scales on user ratings in recommender systems. In this paper, we have provided new evidence that rating scales do have some distinctive features that somehow influence users’ rating behaviour. This is confirmed by our statistical analysis that showed that the average differences between rating scales (when transformed to a common scale) are significant in most cases, for all the three empirical evaluations we carried out (Sections 4, 5 and 6). In particular, in experiments 1 (Section 4) and 3 (Section 6), most of the rating data are significantly correlated, thus showing a similar behaviour of the users while they rate items, but with up/down differences, in our opinion due to the specific characteristics of the rating scales. Notice that, for experiment 2 (Section 5), the smaller number Table 9. Linear regression analysis: results for numbers, Filmeter, 10 stars and 4 stars pairs. Independent variable Numbers 10 stars Filmeter a Dependent variable Pearson’s r Explained variance p Regression coefficient Α Equation Ten starsa Filmetera Four starsa Filmetera Four starsa Four starsa 0.868 0.768 0.731 0.817 0.780 0.706 75% 59% 53% 67% 61% 50% .000 .000 .000 .000 .000 .000 0.912 0.849 0.986 0.859 1.000 0.862 −0.003 0.058 −0.105 +0.098 −0.060 −0.012 Y = 0.912X − 0.003 Y = 0.849X + 0.058 Y = 0.986X − 0.105 Y = 0.859X + 0.098 Y = 1.000X + 0.060 Y = 0.862X + 0.012 Large relationship according to Cohen’s (1988) classification. BEHAVIOUR & INFORMATION TECHNOLOGY of ratings (2.6 per rating scale per item) prevents us from calculating correlations. 7.1 The impact of granularity on rating In particular, for what concerns experiment 1, IMDb, Criticker and MovieLens show higher average ratings, followed by FilmCrave and Filmeter. Our hypothesis is that the 10-point based granularity encourages users to rate higher. Criticker offers the finest granularity, and probably pushes users to be more precise and even stricter. IMDb offers an explicit 10-point granularity and, in comparison with MovieLens (which offers an implicit 10-point granularity through half-star ratings and gives the possibility to choose a neutral point), its average results are indeed higher. MovieLens offers half-star ratings, which are not so frequently used (see Figure 8) as full-star ratings. Filmeter shows the lowest average, also associated with the widest rating range (0–1). We can hypothesise that the absence of a neutral point, and the icons associated with rating positions push users to express lower ratings (see Figure 2). Neutral icons such as the stars do not seem to affect users as much. When they interact with star-based scales, users are probably more influenced by the granularity and the presence or absence of a neutral point. On the contrary, very descriptive and representational icons, such as those of Filmeter, exert a greater influence on users’ ratings. At the same time, performing correlational studies and linear regression analyses, we have highlighted strong relationships among user ratings on different scales: such relationships tell us that when using these scales, users express their ratings in a comparable and consistent way, so that ratings on a scale could be easily predicted, provided that users’ ratings on a related scale are known. In particular, regarding experiment 1, we observed very large correlations between Criticker, MovieLens, IMDb and FilmCrave, which are all above r > 0.9, and large correlations (0.740 > r < 0.791) between Filmeter (which is characterised by more influencing icons) and the remaining ones. All these correlations were confirmed in experiment 3,31 where we also identified a large correlation between 5 and 10 stars. Interestingly, we observed that all correlated scales show similar or closely related granularity, as well as sharing other objective features. Likewise, we observed some cases of statistical agreement between pairs of rating scales (e.g. three smileys and ‘−1, 0, +1’ scale): again in this case, the involved scales usually have similar objective features, and they always have the same or a closely related granularity. Moreover, we observed that scales with a larger granularity seem to stimulate reasoning and, therefore, the expression of 17 pondered ratings, while scales with lower granularity lead to extreme ratings, in accordance with some stateof-art work (Preston and Colman 2000). Thanks to experiment 2, we could extend our results to the museum context, and confirm that ratings given to the same item with different rating scales are significantly different. Also in this case, granularity was useful to explain the gathered data: on the one hand, ratings were closer for scales with a similar granularity; on the other hand, user ratings became higher when granularity was lower. Interestingly, this last insight differs from our results from experiment 1, where users tended to give higher ratings when granularity was higher (as in Lim 2008). A simple way to explain these apparently contrasting behaviours is to ascribe them to domain-specific effects (museum vs. cinema). In general, users liked and appreciated the high-quality multimedia presentations they watched and found them informative, hence the high evaluations they received. This may explain why with the rating scales with two or three options, the ratings were very high in general, while in the rating scales with five options (stars and smileys ratings were similar), the evaluations were a bit more moderate. However, if we consider that rating scale granularity ranges 4–101 in experiment 1 and 2–5 in experiment 2, our data may actually highlight a tendency to give higher ratings with scales that have an ‘extreme’ granularity, namely, either very low or very high, and to be more critical when using scales with an intermediate granularity. Considering all three empirical studies, our intuition is that granularity is the main and most clearly identifiable feature responsible for the effects that rating scales can exert on user ratings, an insight that also confirms the assumption that granularity is one of the most important features of rating scales, in accordance with the model we adopted (Section 2) and related work (such as Lim 2008 and Dawes 2002). According to our experiments, other scale features, such as the presence of a neutral point or particularly expressive icons, influence unpredictably users’ ratings. 7.2 Comparing rating scales and their impact on recommender systems’ performance Another intuition garnered from our results is that the rating scales also impact on the performance of the recommender, as already introduced by Vaz, Ribeiro, and de Matos (2013). This implies that comparing, using MAE for instance, the results of recommender systems that collected user preferences by means of different rating scales may lead to incomparable results. As highlighted by all three experiments (see Sections 4, 5 and 6), some scales seem to push users to a similar, and thus 18 F. CENA ET AL. comparable, rating behaviour, while many other scales do not. Thus, our suggestion is that MAE and RMSE can be compared only when recommender systems use the same rating scale to collect user preferences. As far as rating conversion is concerned, our results show that a uniform, predefined mapping is not enough if we want to compensate for the effects of rating scales. Luckily, we have found that strong linear relationships exist between most pairs of rating scales that can be used for rating translation. However, the specific coefficients to use for rating conversions differed in our evaluations, thus preventing us from providing a ‘one-sizefits-all’ recipe for rating translation. Our intuition is that, given the existence of linear relationships between two types of scales, a specific conversion formula should be derived from an analysis of sample data which can be considered representative, given the domain and audience that a certain service/website aims at targeting. 3. 4. 5. 6. 7. 8. 9. 10. 11. 8. Conclusions and future work The studies reported here confirmed our past results that there are significant differences in users’ average ratings on different scales. However, we also confirmed Cosley et al.’s (2003) result that ratings consisting of different scales correlate well in most cases. This means that ratings on a scale can be predicted based on ratings on other scales, which is good, but a static, predefined mapping is not enough, in that it does not fully take into account the effects of rating scales on users’ rating behaviour. According to our understanding, such effects are responsible for the differences we could observe in users’ average ratings. When there is a strong positive correlation (according to Pearson’s r coefficient), we suggest to translate ratings using linear equations derived from regression analysis. For example, this can be done when ratings are translated from a 10-point star scale to a 101-point scale where users input bare numbers. Based on our studies, however, there are no fixed parameters for such equations, and we suggest to derive them from an analysis of real ratings expressed by the target group of users in the target domain. As future work, we aim to conduct further studies on rating scales, in different contexts and different domains, examining the relationship to the rating scale features. This will be conducted with the goal of trying to determine a general model capable of understanding users’ rating behaviour given a particular scale. Notes 1. http://www.youtube.com 2. http://www.amazon.com 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. http://www.anobii.com http://www.barnesandnoble.com http://www.facebook.com http://www.tripadvisor.com http://www.laterooms.com http://www.criticker.com http://www.booking.com A detailed analysis of the impact of colour on user ratings is out of the scope of the current paper. The interested reader can refer to Tourangeau, Couper, and Conrad (2007) for more information on their study or to Stickel et al. (2009) for a more general discussion of how colours can enhance usability and user experience. Notice that, by the term ‘opinion measurement interfaces’, the authors specifically refer to rating scales. In the context of Web 2.0, and with the proliferation of user-generated contents, however, the analysis of people’s opinions, attitudes and emotions expressed through natural language texts is gaining relevance. The interested reader can refer to Petz et al. (2015) for a discussion of the challenges arising from user-generated content (with respect to more formal texts) and an evaluation of the effectiveness of several text pre-processing algorithms. http://mushecht.haifa.ac.il/Default_eng.aspx http://www.criticker.com http://www.filmeter.net. Note that the data were gathered from Filmeter in 2013. Currently, the system has changed its rating scale. http://www.filmcrave.com http://www.movielens.umn.edu http://www.imdb.com Even though random sampling is the best way of having a representative sample, these strategies require a great deal of time and money. Therefore, much research in psychology is based on samples obtained through non-random selection, such as the availability sampling – that is, a sampling of convenience, based on subjects available to the researcher, often used when the population source is not completely defined (Royce and Straits 1999). http://www.criticker.com http://www.filmeter.net Notice that the data were gathered from Filmeter in 2013. Currently, the system has changed its rating scales. http://www.filmcrave.com http://www.imdb.com http://www.movilens.com Filmeter merely shows the integer value of the averages, while other systems, such as IMDb and FilmCrave, show the rounded averages. http://www.grouplens.org/taxonomy/term/14 The availability sampling is a sampling of convenience, based on subjects available to the researcher, often used when the population source is not completely defined. Even though random sampling is the best way of having a representative sample, these strategies require a great deal of time and money. Therefore, much research in psychology is based on samples obtained through non-random selection (Royce and Straits 1999). Notice that both rating scales (in the first and second parts) and adjectives (in the second part) were always presented to participants in a fixed order. This might BEHAVIOUR & INFORMATION TECHNOLOGY be considered as a limitation for this study, since it might determine some order-related bias. 29. The Friedman test is a non-parametric statistical test similar to ANOVA. It is used to detect whether at least one of the examined samples/items is significantly different from the others. 30. MANOVA was used instead of ANOVA since we had to consider two dependent variables, that is, ratings by Israeli and ratings by Italians. 31. Notice, however, that here we used 5-star and 4-star rating scales with no half-star ratings, differently from MovieLens and FilmCrave. Acknowledgements We are very grateful to Fabrizio Garis and Sathya Del Piano, the two students who helped us in collecting the large amount of data we studied in the real movie ratings analysis (Section 5). We are also very grateful to Martina Deplano, who helped us in carrying out the web-based user study (Section 6). Disclosure statement No potential conflict of interest was reported by the authors. References Adomavicius, G., and A. Tuzhilin. 2005. “Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-art and Possible Extensions.” IEEE Transactions on Knowledge and Data Engineering 17: 734–749. Amoo, T., and H. H. Friedman. 2001. “Do Numeric Values Influence Subjects Responses to Rating Scales?” Journal of International Marketing and Marketing Research 26: 41–46. Cantador, I Fernandez-Tobıas, S. Berkovsky, and P. Cremonesi. 2015. Recommender Systems Handbook. Chapter Cross-Domain Recommender Systems, 919–959. Boston, MA: Springer. Cena, F., and F. Vernero. 2015. “A Study on User Preferential Choices About Rating Scales.” International Journal of Technology and Human Interaction 11 (1): 33–54. Cena, F., F. Vernero, and C. Gena. 2010. “Towards a Customization of Rating Scales in Adaptive Systems.” In User Modeling, Adaptation, and Personalization – 18th International Conference, UMAP 2010, edited by P. De Bra, A. Kobsa, and D. N. Chin, Big Island, HI, USA, June 20–24. Proceedings, Volume 6075 of Lecture Notes in Computer Science, 369–374. Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates. Cosley, D., S. K. Lam, I. Albert, J. A. Konstan, and J. Riedl. 2003. “Is Seeing Believing?: How Recommender System Interfaces Affect Users” Opinions.” In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ‘03, 585–592. New York: ACM. Cummins, R., and E. Gullone. 2000. “Why We Should Not Use 5-Point Likert Scales: The Case for Subjective Quality of Life Measurement.” In Proceedings of the Second International Conference on Quality of Life in Cities, 74–93. Singapore: The School. 19 Dawes, J. 2002. “FivePoint vs. Eleven Point Scales: Does It Make A Difference To Data Characteristics?” Australasian Journal of Market Research 10: 39–47. Dawes, J. 2008. “Do Data Characteristics Change According to the Number of Scale Points Used? An Experiment Using 5Point, 7-Point and 10-Point Scales.” International Journal of Market Research 50: 61–77. Friedman, H. H., and T. Amoo. 1999. “Rating the Rating Scales.” Journal of Marketing Management 9 (3): 114–123. Friedman, H., P. Herskovitz, and S. Pollack. 1993. “The Biasing Effect of Scale-Checking Styles on Response to a Likert Scale.” In Proceedings of the Joint Statistical Meeting, 792– 795. San Francisco, CA: ASA. Garland, R. 1991. “The Mid-Point on a Rating Scale: Is it Desirable.” Marketing Bulletin 2: 66–70. Gena, C., R. Brogi, F. Cena, and F. Vernero. 2011. “The Impact of Rating Scales on User’s Rating Behavior.” In User Modeling, Adaption and Personalization – 19th International Conference, UMAP 2011, Girona, Spain, July 11–15. Proceedings, Lecture Notes in Computer Science 6787, 123–134. Goldberg, K. Y., T. Roeder, D. Gupta, and C. Perkins. 2001. “Eigentaste: A Constant Time Collaborative Filtering Algorithm.” Information Retrieval 4 (2): 133–151. Herlocker, J. L., J. A. Konstan, L. G. Terveen, and J. Riedl. 2004. Collaborative Filtering Recommender “Evaluating Systems.” ACM Transactions on Information Systems 22 (1): 5–53. Holzinger A., R. Scherer, and M. Ziefle. 2011. “Navigational User Interface Elements on the Left Side: Intuition of Designers or Experimental Evidence?” In HumanComputer Interaction – INTERACT 2011, edited by P. Campos, N. Graham, J. Jorge, N. Nunes, P. Palanque, and M. Winckler, Lisbon, Portugal, September 5–9. Proceedings, Volume 6947 of Lecture Notes in Computer Science, 162–177. Hu, R., and P. Pu. 2013. “Exploring Relations Between Personality and User Rating Behaviors.” UMAP Workshops 2013, CEUR-WS.org. John, O. P., and S. Srivastava. 1999. “The Big Five Trait Taxonomy: History, Measure- Ment, and Theoretical Perspectives. Handbook of Personality: Theory and Research.” Behaviour & Information Technology Output 2 (1999): 102–138. Kaptein, M. C., C. Nass, and P. Markopoulos. 2010. “Powerful and Consistent Analysis of Likert-Type Rating Scales.” In Proceedings of the SIGCHI conference on human factors in computing systems, CHI’10, 2391–2394. New York: ACM. Kuflik, Tsvi, Alan J. Wecker, F. Cena, and C. Gena. 2012. “Evaluating Rating Scales Personality.” In User Modeling, Adaptation, and Personalization – 20th International Conference, UMAP 2012, edited by J. Masthoff, B. Mobasher, M. C. Desmarais, and R. Nkambou, Montreal, Canada, July 16–20. Proceedings, Volume 7379 of Lecture Notes in Computer Science, 310–315. Kuflik, Tsvi, Alan J. Wecker, Joel Lanir, and Oliviero Stock. 2015. “An Integrative Framework for Extending the Boundaries of the Museum Visit Experience: Linking the Pre, During and Post Visit Phases.” Journal of IT & Tourism 15 (1): 17–47. Lim, H.e. 2008. “The Use of Different Happiness Rating Scales: Bias and Comparison Problem?” Social Indicators Research 87: 259–267. 20 F. CENA ET AL. Nobarany, S., L. Oram, V. K. Rajendran, C.-H. Chen, J. McGrenere, and T. Munzner. 2012. “The Design Space of Opinion Measurement Interfaces: Exploring Recall Support for Rating and Ranking.” In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI’12, 2035–2044. New York: ACM. Petz, G., M. Karpowicz, H. Fürschuß, A. Auinger, V. Stríteský, and A. Holzinger. 2015. “Reprint of: Computational Approaches for Mining User’s Opinions on the Web 2.0.” Information Processing &amp; Management 51 (4): 510–519. Preston, C., and A. Colman. 2000. “Optimal Number of Response Categories in Rating Scales: Reliability, Validity, Discriminating Power, and Respondent Preferences.” Acta Psychologica 104 (1): 1–15. Royce, S. A., and B. C. Straits. 1999. Approaches to Social Research. 3rd ed. New York: Oxford University Press. Schafer, J. B., D. Frankowski, J. Herlocker, and S. Sen. 2007. The Adaptive web. Chapter Collaborative Filtering Recommender Systems, 291–324. Berlin: Springer-Verlag. Shaftel, J., B. L. Nash, and S. C. Gillmor. 2012. Effects of the Number of Response Categories on Rating Scales, Roundtable Presented at the Annual Conference of the American Educational Research, https://cete.ku.edu/sites/ cete.drupal.ku.edu/files/docs/Presentations/2012_04_Shafte l%20et%20al.,%20Number%20of%20Response%20Categori es,%204-9-12.pdf. Sparling, E. I., and S. Sen. 2011. “Rating: how Difficult is it?” In RecSys ACM, edited by B. Mobasher, R. D. Burke, D. Jannach and G. Adomavicius, 149–156. New York: ACM. Stickel, C., K. Maier, M. Ebner, and A. Holzinger. 2009. “The Modeling of Harmonious Color Combinations for Improved Usability and UX.” ITI 2009: 323–328. Swearingen, K., and R. Sinha. 2002. “Interaction Design for Recommender Systems.” In Proceedings of Designing Interactive Systems 2002. New York: ACM Press. Tourangeau, R., M. P. Couper, and F. Conrad. 2007. “Color, Labels, and Interpretive Heuristics for Response Scales.” Public Opinion Quarterly 71 (1): 91–112. Usman, Z., F. A. Alghamdi, A. Tariq, and T. N. Puri. 2010. “Relative Ranking – A Biased Rating.” In Innovations and Advances in Computer Sciences and Engineering, edited by T. Sobh, 25–29. Dordrecht: Springer. van Barneveld, J., and M. van Setten. 2004. “Designing Usable Interfaces for TV Recommender Systems.” In Personalized Digital Television. Targeting Programs to Individual Users, edited by L. Ardissono, A. Kobsa and M. Maybury, 259–285. Dordrecht: Kluwer Academic Publishers. Vaz, P. C., R. Ribeiro, and D. M. de Matos. 2013. “Understanding the Temporal Dynamics of Recommendations Across Different Rating Scales.” UMAP Workshops 2013, CEUR-WS.org. Weijters, B., E. Cabooter, and N. Schillewaert. 2010. “The Effect of Rating Scale Format on Response Styles: The Number of Response Categories and Response Category Labels.” International Journal of Research in Marketing 27 (3): 236–247. Weng, L. J. 2004. “Impact of the Number of Response Categories and Anchor Labels on Coefficient Alpha and Test-Retest Reliability.” Educational and Psychological Measurement 64 (6): 956–972. Yan, T., and F. Keusch. 2015. “The Effects of the Direction of Rating Scales on Survey Responses in a Telephone Survey.” Public Opinion Quarterly 79 (1): 145–165.