Academia.eduAcademia.edu

Assessing usability and fun in educational software

2005, Proceeding of the 2005 conference on Interaction design and children - IDC '05

Abstract

We describe an investigation into the relationship between usability and fun in educational software designed for children. Twenty-five children aged between 7 and 8 participated in the study. Several evaluation methods were used; some collected data from observers, and others collected reports from the users. Analysis showed that in both observational data, and user reports, ratings for fun and usability were correlated, but that there was no significant correlation between the observed data and the reported data. We discuss the possible reasons for these findings, and describe a method that was successful in eliciting opinions from young children about fun and usability.

Assessing Usability and Fun in Educational Software Stuart MacFarlane Child-Computer Interaction Group Department of Computing University of Central Lancashire Preston, Lancashire, PR1 2HE, United Kingdom +44(0)1772 893291 [email protected] Gavin Sim Child-Computer Interaction Group Department of Computing University of Central Lancashire Preston, Lancashire, PR1 2HE, United Kingdom +44(0)1772 895162 [email protected] Matthew Horton Child-Computer Interaction Group Department of Computing University of Central Lancashire Preston, Lancashire, PR1 2HE, United Kingdom +44(0)1772 895151 [email protected] England is supporting children in the preparation for the SAT (Standard Attainment Task) tests at all levels of the curriculum. In England, SAT tests are used at the end of Key Stage 1 (age 7), Key Stage 2 (age 11) and Key Stage 3 (age 14) as a means of measuring the progress and attainment of children in the national curriculum. These tests are seen by many parents to be an important indicator of achievement. This paper examines three software packages designed to assist children in their preparation for key stage 1 SAT tests in Science, exploring issues relating in particular to usability and fun and learning. ABSTRACT We describe an investigation into the relationship between usability and fun in educational software designed for children. Twenty-five children aged between 7 and 8 participated in the study. Several evaluation methods were used; some collected data from observers, and others collected reports from the users. Analysis showed that in both observational data, and user reports, ratings for fun and usability were correlated, but that there was no significant correlation between the observed data and the reported data. We discuss the possible reasons for these findings, and describe a method that was successful in eliciting opinions from young children about fun and usability. Usability and Fun Usability is an important factor in establishing whether educational software will facilitate the acquisition of knowledge. ISO 9241-11 [2] defines usability as the extent to which a product can be used by specific users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use. If users perceive that a system is very difficult to use, the perception may influence their ability to absorb material provided by the system [3]. Fun is harder to define. Carroll [4] (p38-9) summarises it well: “Things are fun when they attract, capture, and hold our attention by provoking new or unusual emotions in contexts that typically arouse none, or arousing emotions not typically aroused in a given context.” Perhaps it needs to be added that the emotions should be happy ones. He isn’t writing about children here, and children may not experience fun in exactly the same way. Draper [5] suggests that fun is associated with playing for pleasure, and that activities should be done for their own sake through freedom of choice. In other words, fun is not goal-related. It should be noted that ‘fun’ is not the same as ‘satisfaction’ in the definition of usability above. Satisfaction involves progress towards goals, while fun doesn’t. Carroll [4] suggests that the concept of usability should be extended to include fun, but here we are using the ISO Keywords Children, Usability, Fun, Evaluation, Educational Software, Computer-Assisted Learning. INTRODUCTION Educational Multimedia & SATs In England targets have been established for school leavers to be accredited in ICT (Information and Communication Technology), and all schools were due to be connected to the Internet by 2002 [1] Children from the age of 5 are developing basic ICT skills as a consequence of these policies, and a wide range of study material in digital format has emerged to support their learning in all subject domains. Digital content, in the form of multimedia applications, can offer greater opportunities to engage the children in learning environments, through interactive games, targeted immediate feedback and utilising sensory modalities in presenting the content. One area in which multimedia software has emerged in Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. IDC 2005, June 8-10, 2005, Boulder, Colorado, USA Copyright 2005 ACM 1-59593-096-5/05/0006...$5.00. 103 definition of usability, and treating fun as a completely separate (though not entirely independent) construct. An objective of educational software for children is to provide an engaging learning environment, keeping children’s attention by providing fun [6]. This is usually achieved through games. However, one danger of adopting computer technology into education is that learning is devalued by being seen as fun and entertainment [7]. THE SOFTWARE Three commercially available software products designed to assist with the teaching of Science were tested in these experiments. We refer to them here as S1, S2, and S3; screen shots from each appear in Figures 1, 2, and 3. Two of the software applications (S1 and S3) disguised the assessment activities within a gaming context whilst the other (S2) presented the material in a more formal and linear structure. Assessing usability and fun It should be noted that there are several approaches to measuring usability or fun as part of a user study; one is to observe what happens, noting evidence of usability or fun as it occurs during the interaction, and another is to ask the users for their own assessments of the usability or fun in the interaction. The first of these is analogous to a doctor diagnosing patients’ medical conditions by observing signs that indicate particular medical conditions, the second is analogous to the doctor diagnosing by asking the patients to describe their symptoms. Signs and symptoms are not always measures of the same things; consequently, in medicine, the best diagnosis is obtained by considering both signs and symptoms. In the same way, in a user study, measures of usability or fun taken from observations are likely to be measuring different things to measures taken from users’ own reports of usability and fun. A third approach is to assess usability or fun using ‘expert’ methods such as Heuristic Evaluation [8] or Cognitive Walkthrough [9]. These methods are analogous to medical diagnosis based on laboratory analysis of samples, which are carried out in the absence of the patient. The results can also be used in diagnosis. In medicine, signs, symptoms, and pathology test results all provide overlapping information about the state of health of the patient, though none of them, in general, provides all of the salient information on its own. ‘Health’ is a complex and ill-defined construct, and an assessment of it is best made using a variety of methods to obtain a number of measures. Similarly, ‘usability’ and ‘fun’ are complex constructs, which can be assessed in a number of ways. Survey methods such as interviews and questionnaires collect ‘symptoms’, observation methods collect ‘signs’, and expert methods collect ‘pathology results’. Markopoulos and Bekker [10] discuss how evaluation methods for interactive products for children can be compared, and Gray and Salzman [11] criticise a number of earlier comparative studies of evaluation methods. These papers are concerned with usability, but there is no reason to believe that assessing methods for evaluating fun is going to be any easier. It is clear that there is no simple way to compare methods; there are too many criteria, and too many variables that might be relevant. The aim of a research programme comparing evaluation methods should be to produce and test guidelines for which methods are likely to work in particular circumstances, and for particular aspects of the construct being evaluated. Figure 1: Screenshot from product S1 Figure 2: Screenshot from product S2 104 England. The whole age group from the school participated in the experiment; all parents gave consent for their children to take part. The sample covered the normal range of ability. Some of the children needed help with reading the questions and instructions. Not all of the children had English as their first language, but all spoke English fluently. All of the children had studied the National Curriculum for at least one year. They had completed the tests for these topics a few months earlier, so the subject matter of the software was largely familiar to them. They were about one year older than the age range for which the products were intended, which meant that they were reasonably familiar with the scientific content of the software. The data collection ran over four days, and we were very fortunate that all of the 25 children were present at school throughout, so that we obtained a full set of data from each of them. Figure 3: Screenshot from product S3 Procedure The experimental design was piloted with a small sample from a different local primary school, and as a consequence a number of questions in the pre- and post-tests were redesigned. The design was within-subjects single factor with three conditions: S1, S2 and S3 (the three software products). To determine the order in which children used the three applications, a 3 x 3 Latin Square was used. Product S2 presented a mixture of topics, but S1 and S3 both allowed users to choose the science topic. The experimental design ensured that each child saw different topics – ‘Life Processes’ on one, and ‘The Solar System’ on the other – in order to minimise learning effects across the software products. These particular topics were chosen because, firstly, they were treated similarly on the two products, and, secondly, they are presented in the National Curriculum as the simplest and the hardest science topics in the Key Stage 1 curriculum. The experimental work was carried out at the school, in a room close to the children’s classroom. Three similar laptops were used to conduct the experiments; they were arranged in the corners of the room to minimise distractions. One researcher sat by each laptop, and explained the tasks to the children. An assistant, whose job was to note the children’s reactions and engagement with the tasks, accompanied each researcher. The children were withdrawn from their classroom in groups of two or three and were directed to the software. Each child came to the test as a volunteer and was given the opportunity to leave the research activity both before and during the work; none of them did so. All were keen to take part and seemed to enjoy the experience. Prior to using the software each child was shown its box and first screen, and was asked to indicate on a smileyometer (Fig. 4) [13] how good they thought the application was going to be. The rationale for this was that this gave a measure of expectation that could indicate whether or not the child was subsequently let down by the activity, or pleasantly surprised. The experimental aims were to evaluate the three pieces of software to investigate the relationships between usability, fun and learning in educational software for children. This paper is concerned particularly with results relating to usability and fun; more detailed discussion of the learning effects and educational merits of the products has appeared elsewhere [12]. In this study we assessed ‘observed usability’ and ‘reported usability’ separately, and similarly, ‘observed fun’ and ‘reported fun’ were both measured. HYPOTHESES A number of hypotheses were drawn up prior to the research study concerning the relationship between usability, fun and learning. It was hypothesised that observed fun and observed usability would be correlated, as any usability problems may hinder the children’s ability to use the software, affecting their level of fun. Similarly, we might expect a correlation between reported usability and reported fun. Further hypotheses were that learning would be correlated with both fun and usability. In edutainment there is an implied relationship between fun and learning, and any usability problems may impede the learning process. The relationship between fun and learning may be more complicated; it is likely that increased fun could lead to more learning (this is the theoretical justification for edutainment products), but it is also possible that too much fun could interfere with the learning process, and also that the amount of learning being achieved by the user might affect their enjoyment of the process. We were interested in both ‘observed’ and ‘reported’ usability and fun, and hypothesised that ‘observed usability’ might be correlated with ‘reported usability’ and that ‘observed fun’ might be correlated with ‘reported fun’. METHOD Sample The sample consisted of 25 children of both genders, aged between 7 years 4 months and 8 years 3 months, from a Primary School (age range 4-11 years) in Lancashire, 105 Figure 4: Smileyometer Each child was then given a paper based pre-test based on questions found within the software to establish their prior knowledge of the subject domain. Following this the children were given instruction by one of the two researchers outlining the tasks to be performed, in each case the task was to complete a revision exercise using the software. Where children finished the task quickly, they were allowed to try some other parts of the software. The tasks were chosen to be as comparable as possible across the three products. The children were given approximately 10 minutes to use the software, after which a post-test was administered to establish any learning effect. They were then asked to rate the software using a second smileyometer to give a rating for ‘actual’ experience. For each of the activities the researchers and assistants recorded usability problems, facial gestures and comments to establish the level of fun. Over the course of three days every child used each of the three applications once. A week later, the researchers returned to the school to ask the children to assess their experiences with the three pieces of software. A ‘fun sorter’ methodology [13] was used for this final evaluation. The fun sorter questionnaire (see Fig. 6) required the children to rank the three products in order of preference on three separate criteria: fun, ease of use, and how good they were for learning. The method used here was to give each child a form with three spaces for each question, and some ‘stickers’ with pictures of the products. They were asked to rank the three products by sticking the three stickers into the spaces on the form in the appropriate order. Additionally they were asked to specify which of the three products they would choose, and which one they thought the class teacher would choose. Due to the constraints of the school timetable, this activity was done in the classroom, where it was impractical to prevent the children from comparing notes. Some will inevitably have been influenced by peer pressure in their choices. Fig. 5 shows an example of a completed sheet. Figure 5: a completed ranking questionnaire RESULTS Usability and Fun During the experiments each child was observed by two observers, the researcher, who concentrated on noting usability issues, and an assistant, who concentrated on observing the child, and noting indicators of enjoyment and engagement, such as comments, smiles, laughter, or positive body language, and also signs of lack of enjoyment and frustration, including, for example, sighs, and looking around the room. We considered using checklists for the observers, but decided to use free form note-taking, since pilot testing indicated that there would be a wide range of responses and issues. Scoring was done simply by counting positive issues noted regarding usability, and subtracting the number of negative issues; a similar algorithm was used to get a fun score for each child on each product. It is clear that these scores are very approximate; there was variability between the observers in what they wrote down, and interpreting the notes and deciding which of them were issues was a subjective process. It should be noted that the researchers and assistants were rotated throughout the experiments in order to reduce observer effects as far as possible. Sometimes, the researchers wrote comments about fun, or the assistants wrote about usability; issues were counted up in the same way, whoever had noted them. Duplicated 106 comments made by both observers were counted only once. It is clear that there is a complex relationship between observed fun and usability; we had hypothesised that observed fun and observed usability would be correlated, and they were. The correlation is not strong (Spearman’s rho = 0.239) but it is statistically significant (p=0.039). However, neither observed fun nor observed usability correlated significantly with learning. Specific observed software usability issues with the By analysing the qualitative data that had been recorded, all three piece of software were found to have usability problems. Here are some specific issues that were problematic for a number of children independently. In S1 a question was displayed at the bottom of the screen, with four answers positioned to the right (Fig. 1). Each answer had a coloured box around it and there was a corresponding coloured alien for each answer. These aliens popped in and out of craters in random locations and the object of the game was to feed a burger to the correct coloured alien. The aliens were the same colour as the boxes around the answers. The burger acted as a timer and appeared to be gradually eaten over a period of time, when the burger disappeared a life would be lost. Once a child had lost three lives their game was over. The game could be paused which would allow them additional time to read the question; however, none of the children appeared to notice the message informing them how to pause the game, and none of them actually paused it. Another observation recorded was that 13 of the 25 children appeared to be randomly clicking the screen in their early attempts to play this game, and 16 of the children needed an explanation on how to play the game. Following the explanation they seemed to grasp the concept of the game. The game play itself was too difficult for 10 of the children, as the aliens moved too quickly, not allowing enough time to answer the question. The game did incorporate a certain level of feedback; for example when they answered incorrectly the correct answer would be subtly highlighted; only a few of the children appeared to notice this. In summary, the major usability issues with this game were that the procedure of answering the questions was too complex, and that answering the questions required levels of hand-eye coordination and mouse dexterity that not all of the children had. It should be remembered that the subjects here were older than the target audience for the software. In S2 the children were given audio instructions to enter their name and then press enter (Fig. 6). They encountered difficulties from the start as 15 of them needed assistance after typing their name, as they did not know which was the enter key. Our laptops did not have a key marked ‘enter’; the designers ought to have anticipated such problems and given more generally applicable instructions. Figure 6: S2 first screen The children were then given verbal instruction to select the ‘revise’ section. Once in this section, the software provided audio instructions telling them to answer the question and press the tick button, which was located at the bottom right hand corner of the screen (see Fig 2). The software provided an audio and visual representation of the question and each question appeared individually on the screen. Despite the audio instructions, it was noted that 10 of the children did not press the tick button after answering the question. The software allowed them to progress to the next question without comment, even though they had not recorded their answer to the previous one. In the final software product (S3) the children had to select the appropriate topic from the list available, and then choose the ‘revise’ option. They were then required to enter their name and age before beginning the questions. Similar problems with the vocabulary used were encountered in this software as with S2, in the fact that the children did not know where the ‘enter’ key was located; this caused difficulty for 18 of the children. The questions appeared at the top of the screen with the four possible answers in boxes underneath (see Fig. 3). A timer limited the amount of time the children could play the game, and they were allowed only to get two questions wrong before they were returned to the main menu. Three “question busters” (mechanisms to make the questions easier, such as a hint) were provided as animated icons in the top left hand corner of the screen. If the child answered a question incorrectly they would receive audio instruction to use one of the three question busters; however none of the children actually used this feature, they would just select another answer. It was apparent that the question buster feature had not been explained in a way that the users could understand. After answering a few questions the children were rewarded with a brief arcade-type game. Most of the children found the games too difficult, and failed within a few seconds, mostly returning a score of zero. Nevertheless, the children liked these games even when they were completely unsuccessful in them, and this product did best in their assessment of fun. 107 there is very little variation in the answers; this tendency for over-enthusiasm when young children use such measures has been noted before [13]. The second reason is that the way the question was asked was poorly chosen; we asked how good using the software was rather than asking how much fun they had had. Hence we were probably not measuring the same construct in the smileyometer question as in the ranking questionnaire. Evaluation by children After all of the testing was finished, the children recorded their preferences for fun, learning, ease of use, which one the teacher would choose and the one they would choose. This was done using the ‘fun sorter’ method [13], described in the ‘Method’ section above (see Fig. 5). All of the children did this ranking successfully after a brief explanation. Most of them ranked the products differently for each criterion, indicating that they were able to distinguish between the criteria. Much of the thought that went in to the ranking was done out loud, enabling us to confirm informally that they were indeed understanding the different criteria. The children’s own reports of how much fun the products were to use, and of how usable they were, were positively correlated (Spearman’s rho = 0.350, p = 0.002, just as they were for observations of fun and usability. The children were asked which software they would choose for themselves, and 13 of the children chose S3, 10 S1 and just 2 chose S2. There was a very strong correlation between the software that the children thought was the most fun and the one that they would choose (rho = 0.450, P<0.0005). This shows that fun is a major criterion in the children’s assessment of whether they want to use a product; this is no surprise. Less predictably, there were also significant positive correlations between whether they would choose the software for themselves and their rankings for ease of use (rho = 0.294, p = 0.010) and (encouragingly for teachers and parents!) for how good they thought it was for learning (rho = 0.234, p = 0.043). One plausible explanation for why all of these factors (fun, learning, ease of use) are correlated with the children’s choice would be that they had failed to distinguish between the constructs, but further examination of the data reveals no significant correlations between their rankings for ease of use and learning, or between fun and learning. Hence we conclude that the children could distinguish between the constructs, and that all three of them are important to children in their preferences for software. There was a negative correlation (non-significant) between the software they perceived to be the most fun and the one that they thought their teacher would choose. There was a significant positive correlation (rho = 0.269, p = 0.020) between the children’s assessment of how good a product was for learning, and whether they thought that a teacher would select it. A possible explanation of these results is that children do not see a product that is fun as being ‘suitable’ for classroom use. Children were asked at the end of each test to complete a smileyometer to indicate how good their experiences had been. It would be nice to be able to validate some of the above findings by comparing them with the smileyometer results collected during the experiments. However, there is little correlation between these scores and the fun ratings. We suspect that the problem is in the smileyometer results rather than the ranking results. There are two possible reasons for this. The first is that most children answered the questions by selecting the ‘brilliant’ option, so that CONCLUSIONS All three software products evaluated had significant usability problems that were obvious even in a brief study such as this. Our observations showed that the children appeared to have less fun when their interactions had more usability problems. Also, their own assessments of the products for fun and usability were similarly correlated, and that both usability and fun appear to be factors influencing whether they would choose a product for themselves. The conclusion is that usability does matter to children; so getting it right should be more of a priority for designers and manufacturers than it appears to be currently. The results also highlight the fact that the children’s preference is for fun in software, which is no surprise. They clearly identified the software which presented the questions in a more formal linear manner, and which had no games, as the least fun. An important finding was that children as young as 7-8 were able to distinguish between concepts such as usability, fun, and potential for learning. We asked them to rank the products separately on each of these criteria; they were able to do this after a very brief explanation, and their answers showed that they were differentiating between the concepts in a consistent way. It seems that ranking methods like this are a useful way of getting opinions about interactive products from children; this method was more successful here than a method where they were asked to score each product separately. We measured ‘observed’ usability and fun, and ‘reported’ usability and fun. It was interesting that correlations were found, as hypothesised, between observed usability and observed fun, and between reported usability and reported fun, but no correlations were found between observed usability and reported usability, or between observed fun and reported fun. We conclude that both usability and fun are complex constructs, and that methods based on observation are assessing different subsets of the constructs to methods based on users’ reports. Hence attempts to validate evaluation methods by comparing their findings to those of different evaluation methods may be doomed to failure unless care is taken to ensure that the methods are indeed assessing the same construct. FUTURE WORK We have begun a series of heuristic evaluations of these pieces of software, using heuristics for usability, for fun, and for educational design. These evaluations are being conducted independently of the tests. It will be interesting to find out whether there is again a correlation between the findings for fun and for usability. 108 There is scope for refinement of the ‘fun sorter’ ranking method [13], but it appears to be a promising evaluation tool for use with young children, and not only for assessing fun. We are also planning a range of further investigations of evaluation methods for children’s interactive products, for both usability and fun. These will include investigations of the components of the usability and fun constructs that are particularly critical for children’s products, and experiments involving children as evaluators, rather than as evaluation subjects. 5. 6. 7. 8. 9. ACKNOWLEDGEMENTS We would like to acknowledge the invaluable assistance of the children and teachers of English Martyrs School, Preston. Thanks are due also to Emanuela Mazzone, Janet Read and the MSc and PhD students who assisted us in the data collection and experimental design. 10. REFERENCES 1. 2. 3. 4. Department for Education and Employment, Connecting the Learning Society: National Grid for Learning. 1997, Department for Education and Employment: London. ISO, Ergonomic Requirements for Office Work with Visual Display Terminals (VDTs) -- Part 11: Guidance on Usability. 1998, ISO 9241-11. Anajaneyulu, K.S.R., R.A. Singer, and A. Harding, Usability Studies of a Remedial Multimedia System. Journal of Educational Multimedia and Hypermedia, 1998. 7(2/3): p. 207-236. Carroll, J.M., Beyond Fun. Interactions, 2004. 11(5): p. 38-40. 11. 12. 13. 109 View publication stats Draper, S.W., Analysing Fun as a Candidate Software Requirement. Personal Technology, 1999. 3: p. 117-122. Alessi, S.M. and S.R. Trollip, Multimedia for Learning: Methods and Development. 3rd ed. 2001, Massachusetts: Allyn & Bacon. Okan, Z., Edutainment: Is Learning at Risk? British Journal of Educational Technology, 2003. 34(3): p. 255-264. Nielsen, J., Heuristic Evaluation, in Usability Inspection Methods, J. Nielsen and R.L. Mack, Editors. 1994, John Wiley: New York. Polson, P., et al., Cognitive Walkthroughs: A Method for Theory-Based Evaluation of User Interfaces. Int Journal of Man-Machine Studies, 1992. 36: p. 741-773. Markopoulos, P. and M. Bekker, On the Assessment of Usability Testing Methods for Children. Interacting with Computers, 2003. 15: p. 227-243. Gray, W.D. and M.C. Salzman, Damaged Merchandise? A Review of Experiments that Compare Usability Evaluation Methods. HumanComputer Interaction, 1998. 3: p. 203-261. Sim, G.R., S.J. MacFarlane, and J.C. Read. Measuring the Effects of Fun on Learning in Software for Children. in CAL2005. 2005. Bristol, UK. Read, J.C., S.J. MacFarlane, and C. Casey. Endurability, Engagement and Expectations: Measuring Children's Fun. in Interaction Design and Children. 2002. Eindhoven: Shaker Publishing.