Assessing Usability and Fun in Educational Software
Stuart MacFarlane
Child-Computer Interaction
Group
Department of Computing
University of Central
Lancashire
Preston, Lancashire, PR1
2HE, United Kingdom
+44(0)1772 893291
[email protected]
Gavin Sim
Child-Computer Interaction
Group
Department of Computing
University of Central
Lancashire
Preston, Lancashire, PR1
2HE, United Kingdom
+44(0)1772 895162
[email protected]
Matthew Horton
Child-Computer Interaction
Group
Department of Computing
University of Central
Lancashire
Preston, Lancashire, PR1
2HE, United Kingdom
+44(0)1772 895151
[email protected]
England is supporting children in the preparation for the
SAT (Standard Attainment Task) tests at all levels of the
curriculum. In England, SAT tests are used at the end of
Key Stage 1 (age 7), Key Stage 2 (age 11) and Key Stage 3
(age 14) as a means of measuring the progress and
attainment of children in the national curriculum. These
tests are seen by many parents to be an important indicator
of achievement. This paper examines three software
packages designed to assist children in their preparation for
key stage 1 SAT tests in Science, exploring issues relating
in particular to usability and fun and learning.
ABSTRACT
We describe an investigation into the relationship between
usability and fun in educational software designed for
children. Twenty-five children aged between 7 and 8
participated in the study. Several evaluation methods were
used; some collected data from observers, and others
collected reports from the users. Analysis showed that in
both observational data, and user reports, ratings for fun
and usability were correlated, but that there was no
significant correlation between the observed data and the
reported data. We discuss the possible reasons for these
findings, and describe a method that was successful in
eliciting opinions from young children about fun and
usability.
Usability and Fun
Usability is an important factor in establishing whether
educational software will facilitate the acquisition of
knowledge. ISO 9241-11 [2] defines usability as the
extent to which a product can be used by specific users to
achieve specified goals with effectiveness, efficiency and
satisfaction in a specified context of use. If users perceive
that a system is very difficult to use, the perception may
influence their ability to absorb material provided by the
system [3].
Fun is harder to define. Carroll [4] (p38-9) summarises it
well: “Things are fun when they attract, capture, and hold
our attention by provoking new or unusual emotions in
contexts that typically arouse none, or arousing emotions
not typically aroused in a given context.” Perhaps it needs
to be added that the emotions should be happy ones. He
isn’t writing about children here, and children may not
experience fun in exactly the same way. Draper [5]
suggests that fun is associated with playing for pleasure,
and that activities should be done for their own sake
through freedom of choice. In other words, fun is not
goal-related.
It should be noted that ‘fun’ is not the same as
‘satisfaction’ in the definition of usability above.
Satisfaction involves progress towards goals, while fun
doesn’t.
Carroll [4] suggests that the concept of usability should be
extended to include fun, but here we are using the ISO
Keywords
Children, Usability, Fun, Evaluation, Educational
Software, Computer-Assisted Learning.
INTRODUCTION
Educational Multimedia & SATs
In England targets have been established for school leavers
to be accredited in ICT (Information and Communication
Technology), and all schools were due to be connected to
the Internet by 2002 [1] Children from the age of 5 are
developing basic ICT skills as a consequence of these
policies, and a wide range of study material in digital
format has emerged to support their learning in all subject
domains. Digital content, in the form of multimedia
applications, can offer greater opportunities to engage the
children in learning environments, through interactive
games, targeted immediate feedback and utilising sensory
modalities in presenting the content.
One area in which multimedia software has emerged in
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies
are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
IDC 2005, June 8-10, 2005, Boulder, Colorado, USA
Copyright 2005 ACM 1-59593-096-5/05/0006...$5.00.
103
definition of usability, and treating fun as a completely
separate (though not entirely independent) construct.
An objective of educational software for children is to
provide an engaging learning environment, keeping
children’s attention by providing fun [6]. This is usually
achieved through games. However, one danger of adopting
computer technology into education is that learning is
devalued by being seen as fun and entertainment [7].
THE SOFTWARE
Three commercially available software products designed
to assist with the teaching of Science were tested in these
experiments. We refer to them here as S1, S2, and S3;
screen shots from each appear in Figures 1, 2, and 3. Two
of the software applications (S1 and S3) disguised the
assessment activities within a gaming context whilst the
other (S2) presented the material in a more formal and
linear structure.
Assessing usability and fun
It should be noted that there are several approaches to
measuring usability or fun as part of a user study; one is to
observe what happens, noting evidence of usability or fun
as it occurs during the interaction, and another is to ask the
users for their own assessments of the usability or fun in
the interaction. The first of these is analogous to a doctor
diagnosing patients’ medical conditions by observing
signs that indicate particular medical conditions, the
second is analogous to the doctor diagnosing by asking the
patients to describe their symptoms. Signs and symptoms
are not always measures of the same things; consequently,
in medicine, the best diagnosis is obtained by considering
both signs and symptoms. In the same way, in a user
study, measures of usability or fun taken from
observations are likely to be measuring different things to
measures taken from users’ own reports of usability and
fun.
A third approach is to assess usability or fun using ‘expert’
methods such as Heuristic Evaluation [8] or Cognitive
Walkthrough [9]. These methods are analogous to medical
diagnosis based on laboratory analysis of samples, which
are carried out in the absence of the patient. The results
can also be used in diagnosis. In medicine, signs,
symptoms, and pathology test results all provide
overlapping information about the state of health of the
patient, though none of them, in general, provides all of
the salient information on its own. ‘Health’ is a complex
and ill-defined construct, and an assessment of it is best
made using a variety of methods to obtain a number of
measures. Similarly, ‘usability’ and ‘fun’ are complex
constructs, which can be assessed in a number of ways.
Survey methods such as interviews and questionnaires
collect ‘symptoms’, observation methods collect ‘signs’,
and expert methods collect ‘pathology results’.
Markopoulos and Bekker [10] discuss how evaluation
methods for interactive products for children can be
compared, and Gray and Salzman [11] criticise a number of
earlier comparative studies of evaluation methods. These
papers are concerned with usability, but there is no reason
to believe that assessing methods for evaluating fun is
going to be any easier. It is clear that there is no simple
way to compare methods; there are too many criteria, and
too many variables that might be relevant. The aim of a
research programme comparing evaluation methods should
be to produce and test guidelines for which methods are
likely to work in particular circumstances, and for
particular aspects of the construct being evaluated.
Figure 1: Screenshot from product S1
Figure 2: Screenshot from product S2
104
England. The whole age group from the school
participated in the experiment; all parents gave consent for
their children to take part. The sample covered the normal
range of ability. Some of the children needed help with
reading the questions and instructions. Not all of the
children had English as their first language, but all spoke
English fluently. All of the children had studied the
National Curriculum for at least one year. They had
completed the tests for these topics a few months earlier,
so the subject matter of the software was largely familiar to
them. They were about one year older than the age range
for which the products were intended, which meant that
they were reasonably familiar with the scientific content of
the software. The data collection ran over four days, and
we were very fortunate that all of the 25 children were
present at school throughout, so that we obtained a full set
of data from each of them.
Figure 3: Screenshot from product S3
Procedure
The experimental design was piloted with a small sample
from a different local primary school, and as a consequence
a number of questions in the pre- and post-tests were
redesigned. The design was within-subjects single factor
with three conditions: S1, S2 and S3 (the three software
products). To determine the order in which children used
the three applications, a 3 x 3 Latin Square was used.
Product S2 presented a mixture of topics, but S1 and S3
both allowed users to choose the science topic. The
experimental design ensured that each child saw different
topics – ‘Life Processes’ on one, and ‘The Solar System’
on the other – in order to minimise learning effects across
the software products. These particular topics were chosen
because, firstly, they were treated similarly on the two
products, and, secondly, they are presented in the National
Curriculum as the simplest and the hardest science topics
in the Key Stage 1 curriculum.
The experimental work was carried out at the school, in a
room close to the children’s classroom. Three similar
laptops were used to conduct the experiments; they were
arranged in the corners of the room to minimise
distractions. One researcher sat by each laptop, and
explained the tasks to the children. An assistant, whose
job was to note the children’s reactions and engagement
with the tasks, accompanied each researcher. The children
were withdrawn from their classroom in groups of two or
three and were directed to the software. Each child came to
the test as a volunteer and was given the opportunity to
leave the research activity both before and during the work;
none of them did so. All were keen to take part and
seemed to enjoy the experience.
Prior to using the software each child was shown its box
and first screen, and was asked to indicate on a
smileyometer (Fig. 4) [13] how good they thought the
application was going to be. The rationale for this was
that this gave a measure of expectation that could indicate
whether or not the child was subsequently let down by the
activity, or pleasantly surprised.
The experimental aims were to evaluate the three pieces of
software to investigate the relationships between usability,
fun and learning in educational software for children. This
paper is concerned particularly with results relating to
usability and fun; more detailed discussion of the learning
effects and educational merits of the products has appeared
elsewhere [12].
In this study we assessed ‘observed usability’ and ‘reported
usability’ separately, and similarly, ‘observed fun’ and
‘reported fun’ were both measured.
HYPOTHESES
A number of hypotheses were drawn up prior to the
research study concerning the relationship between
usability, fun and learning. It was hypothesised that
observed fun and observed usability would be correlated,
as any usability problems may hinder the children’s ability
to use the software, affecting their level of fun. Similarly,
we might expect a correlation between reported usability
and reported fun. Further hypotheses were that learning
would be correlated with both fun and usability. In
edutainment there is an implied relationship between fun
and learning, and any usability problems may impede the
learning process. The relationship between fun and
learning may be more complicated; it is likely that
increased fun could lead to more learning (this is the
theoretical justification for edutainment products), but it is
also possible that too much fun could interfere with the
learning process, and also that the amount of learning
being achieved by the user might affect their enjoyment of
the process.
We were interested in both ‘observed’ and ‘reported’
usability and fun, and hypothesised that ‘observed
usability’ might be correlated with ‘reported usability’ and
that ‘observed fun’ might be correlated with ‘reported fun’.
METHOD
Sample
The sample consisted of 25 children of both genders, aged
between 7 years 4 months and 8 years 3 months, from a
Primary School (age range 4-11 years) in Lancashire,
105
Figure 4: Smileyometer
Each child was then given a paper based pre-test based on
questions found within the software to establish their prior
knowledge of the subject domain. Following this the
children were given instruction by one of the two
researchers outlining the tasks to be performed, in each
case the task was to complete a revision exercise using the
software. Where children finished the task quickly, they
were allowed to try some other parts of the software. The
tasks were chosen to be as comparable as possible across
the three products. The children were given approximately
10 minutes to use the software, after which a post-test was
administered to establish any learning effect. They were
then asked to rate the software using a second
smileyometer to give a rating for ‘actual’ experience. For
each of the activities the researchers and assistants recorded
usability problems, facial gestures and comments to
establish the level of fun. Over the course of three days
every child used each of the three applications once.
A week later, the researchers returned to the school to ask
the children to assess their experiences with the three
pieces of software. A ‘fun sorter’ methodology [13] was
used for this final evaluation. The fun sorter questionnaire
(see Fig. 6) required the children to rank the three products
in order of preference on three separate criteria: fun, ease of
use, and how good they were for learning. The method
used here was to give each child a form with three spaces
for each question, and some ‘stickers’ with pictures of the
products. They were asked to rank the three products by
sticking the three stickers into the spaces on the form in
the appropriate order. Additionally they were asked to
specify which of the three products they would choose, and
which one they thought the class teacher would choose.
Due to the constraints of the school timetable, this activity
was done in the classroom, where it was impractical to
prevent the children from comparing notes. Some will
inevitably have been influenced by peer pressure in their
choices. Fig. 5 shows an example of a completed sheet.
Figure 5: a completed ranking questionnaire
RESULTS
Usability and Fun
During the experiments each child was observed by two
observers, the researcher, who concentrated on noting
usability issues, and an assistant, who concentrated on
observing the child, and noting indicators of enjoyment
and engagement, such as comments, smiles, laughter, or
positive body language, and also signs of lack of
enjoyment and frustration, including, for example, sighs,
and looking around the room. We considered using
checklists for the observers, but decided to use free form
note-taking, since pilot testing indicated that there would
be a wide range of responses and issues. Scoring was done
simply by counting positive issues noted regarding
usability, and subtracting the number of negative issues; a
similar algorithm was used to get a fun score for each child
on each product. It is clear that these scores are very
approximate; there was variability between the observers in
what they wrote down, and interpreting the notes and
deciding which of them were issues was a subjective
process. It should be noted that the researchers and
assistants were rotated throughout the experiments in order
to reduce observer effects as far as possible.
Sometimes, the researchers wrote comments about fun, or
the assistants wrote about usability; issues were counted
up in the same way, whoever had noted them. Duplicated
106
comments made by both observers were counted only
once.
It is clear that there is a complex relationship between
observed fun and usability; we had hypothesised that
observed fun and observed usability would be correlated,
and they were. The correlation is not strong (Spearman’s
rho = 0.239) but it is statistically significant (p=0.039).
However, neither observed fun nor observed usability
correlated significantly with learning.
Specific observed
software
usability
issues
with
the
By analysing the qualitative data that had been recorded,
all three piece of software were found to have usability
problems. Here are some specific issues that were
problematic for a number of children independently.
In S1 a question was displayed at the bottom of the screen,
with four answers positioned to the right (Fig. 1). Each
answer had a coloured box around it and there was a
corresponding coloured alien for each answer. These aliens
popped in and out of craters in random locations and the
object of the game was to feed a burger to the correct
coloured alien. The aliens were the same colour as the
boxes around the answers. The burger acted as a timer and
appeared to be gradually eaten over a period of time, when
the burger disappeared a life would be lost. Once a child
had lost three lives their game was over. The game could
be paused which would allow them additional time to read
the question; however, none of the children appeared to
notice the message informing them how to pause the
game, and none of them actually paused it. Another
observation recorded was that 13 of the 25 children
appeared to be randomly clicking the screen in their early
attempts to play this game, and 16 of the children needed
an explanation on how to play the game. Following the
explanation they seemed to grasp the concept of the game.
The game play itself was too difficult for 10 of the
children, as the aliens moved too quickly, not allowing
enough time to answer the question. The game did
incorporate a certain level of feedback; for example when
they answered incorrectly the correct answer would be
subtly highlighted; only a few of the children appeared to
notice this. In summary, the major usability issues with
this game were that the procedure of answering the
questions was too complex, and that answering the
questions required levels of hand-eye coordination and
mouse dexterity that not all of the children had. It should
be remembered that the subjects here were older than the
target audience for the software.
In S2 the children were given audio instructions to enter
their name and then press enter (Fig. 6). They encountered
difficulties from the start as 15 of them needed assistance
after typing their name, as they did not know which was
the enter key. Our laptops did not have a key marked
‘enter’; the designers ought to have anticipated such
problems and given more generally applicable instructions.
Figure 6: S2 first screen
The children were then given verbal instruction to select
the ‘revise’ section. Once in this section, the software
provided audio instructions telling them to answer the
question and press the tick button, which was located at
the bottom right hand corner of the screen (see Fig 2). The
software provided an audio and visual representation of the
question and each question appeared individually on the
screen. Despite the audio instructions, it was noted that
10 of the children did not press the tick button after
answering the question. The software allowed them to
progress to the next question without comment, even
though they had not recorded their answer to the previous
one.
In the final software product (S3) the children had to select
the appropriate topic from the list available, and then
choose the ‘revise’ option. They were then required to
enter their name and age before beginning the questions.
Similar problems with the vocabulary used were
encountered in this software as with S2, in the fact that the
children did not know where the ‘enter’ key was located;
this caused difficulty for 18 of the children. The questions
appeared at the top of the screen with the four possible
answers in boxes underneath (see Fig. 3). A timer limited
the amount of time the children could play the game, and
they were allowed only to get two questions wrong before
they were returned to the main menu. Three “question
busters” (mechanisms to make the questions easier, such as
a hint) were provided as animated icons in the top left
hand corner of the screen. If the child answered a question
incorrectly they would receive audio instruction to use one
of the three question busters; however none of the children
actually used this feature, they would just select another
answer. It was apparent that the question buster feature
had not been explained in a way that the users could
understand. After answering a few questions the children
were rewarded with a brief arcade-type game. Most of the
children found the games too difficult, and failed within a
few seconds, mostly returning a score of zero.
Nevertheless, the children liked these games even when
they were completely unsuccessful in them, and this
product did best in their assessment of fun.
107
there is very little variation in the answers; this tendency
for over-enthusiasm when young children use such
measures has been noted before [13]. The second reason is
that the way the question was asked was poorly chosen; we
asked how good using the software was rather than asking
how much fun they had had. Hence we were probably not
measuring the same construct in the smileyometer question
as in the ranking questionnaire.
Evaluation by children
After all of the testing was finished, the children recorded
their preferences for fun, learning, ease of use, which one
the teacher would choose and the one they would choose.
This was done using the ‘fun sorter’ method [13],
described in the ‘Method’ section above (see Fig. 5).
All of the children did this ranking successfully after a
brief explanation. Most of them ranked the products
differently for each criterion, indicating that they were able
to distinguish between the criteria. Much of the thought
that went in to the ranking was done out loud, enabling us
to confirm informally that they were indeed understanding
the different criteria.
The children’s own reports of how much fun the products
were to use, and of how usable they were, were positively
correlated (Spearman’s rho = 0.350, p = 0.002, just as
they were for observations of fun and usability.
The children were asked which software they would choose
for themselves, and 13 of the children chose S3, 10 S1 and
just 2 chose S2. There was a very strong correlation
between the software that the children thought was the
most fun and the one that they would choose (rho = 0.450,
P<0.0005). This shows that fun is a major criterion in the
children’s assessment of whether they want to use a
product; this is no surprise. Less predictably, there were
also significant positive correlations between whether they
would choose the software for themselves and their
rankings for ease of use (rho = 0.294, p = 0.010) and
(encouragingly for teachers and parents!) for how good they
thought it was for learning (rho = 0.234, p = 0.043). One
plausible explanation for why all of these factors (fun,
learning, ease of use) are correlated with the children’s
choice would be that they had failed to distinguish
between the constructs, but further examination of the data
reveals no significant correlations between their rankings
for ease of use and learning, or between fun and learning.
Hence we conclude that the children could distinguish
between the constructs, and that all three of them are
important to children in their preferences for software.
There was a negative correlation (non-significant) between
the software they perceived to be the most fun and the one
that they thought their teacher would choose. There was a
significant positive correlation (rho = 0.269, p = 0.020)
between the children’s assessment of how good a product
was for learning, and whether they thought that a teacher
would select it. A possible explanation of these results is
that children do not see a product that is fun as being
‘suitable’ for classroom use.
Children were asked at the end of each test to complete a
smileyometer to indicate how good their experiences had
been. It would be nice to be able to validate some of the
above findings by comparing them with the smileyometer
results collected during the experiments. However, there is
little correlation between these scores and the fun ratings.
We suspect that the problem is in the smileyometer results
rather than the ranking results. There are two possible
reasons for this. The first is that most children answered
the questions by selecting the ‘brilliant’ option, so that
CONCLUSIONS
All three software products evaluated had significant
usability problems that were obvious even in a brief study
such as this. Our observations showed that the children
appeared to have less fun when their interactions had more
usability problems. Also, their own assessments of the
products for fun and usability were similarly correlated,
and that both usability and fun appear to be factors
influencing whether they would choose a product for
themselves. The conclusion is that usability does matter
to children; so getting it right should be more of a priority
for designers and manufacturers than it appears to be
currently.
The results also highlight the fact that the children’s
preference is for fun in software, which is no surprise.
They clearly identified the software which presented the
questions in a more formal linear manner, and which had
no games, as the least fun.
An important finding was that children as young as 7-8
were able to distinguish between concepts such as
usability, fun, and potential for learning. We asked them
to rank the products separately on each of these criteria;
they were able to do this after a very brief explanation, and
their answers showed that they were differentiating between
the concepts in a consistent way. It seems that ranking
methods like this are a useful way of getting opinions
about interactive products from children; this method was
more successful here than a method where they were asked
to score each product separately.
We measured ‘observed’ usability and fun, and ‘reported’
usability and fun. It was interesting that correlations were
found, as hypothesised, between observed usability and
observed fun, and between reported usability and reported
fun, but no correlations were found between observed
usability and reported usability, or between observed fun
and reported fun. We conclude that both usability and fun
are complex constructs, and that methods based on
observation are assessing different subsets of the constructs
to methods based on users’ reports. Hence attempts to
validate evaluation methods by comparing their findings to
those of different evaluation methods may be doomed to
failure unless care is taken to ensure that the methods are
indeed assessing the same construct.
FUTURE WORK
We have begun a series of heuristic evaluations of these
pieces of software, using heuristics for usability, for fun,
and for educational design. These evaluations are being
conducted independently of the tests. It will be interesting
to find out whether there is again a correlation between the
findings for fun and for usability.
108
There is scope for refinement of the ‘fun sorter’ ranking
method [13], but it appears to be a promising evaluation
tool for use with young children, and not only for
assessing fun.
We are also planning a range of further investigations of
evaluation methods for children’s interactive products, for
both usability and fun. These will include investigations
of the components of the usability and fun constructs that
are particularly critical for children’s products, and
experiments involving children as evaluators, rather than as
evaluation subjects.
5.
6.
7.
8.
9.
ACKNOWLEDGEMENTS
We would like to acknowledge the invaluable assistance of
the children and teachers of English Martyrs School,
Preston. Thanks are due also to Emanuela Mazzone, Janet
Read and the MSc and PhD students who assisted us in
the data collection and experimental design.
10.
REFERENCES
1.
2.
3.
4.
Department for Education and Employment,
Connecting the Learning Society: National Grid
for Learning. 1997, Department for Education
and Employment: London.
ISO, Ergonomic Requirements for Office Work
with Visual Display Terminals (VDTs) -- Part
11: Guidance on Usability. 1998, ISO 9241-11.
Anajaneyulu, K.S.R., R.A. Singer, and A.
Harding, Usability Studies of a Remedial
Multimedia System. Journal of Educational
Multimedia and Hypermedia, 1998. 7(2/3): p.
207-236.
Carroll, J.M., Beyond Fun. Interactions, 2004.
11(5): p. 38-40.
11.
12.
13.
109
View publication stats
Draper, S.W., Analysing Fun as a Candidate
Software Requirement. Personal Technology,
1999. 3: p. 117-122.
Alessi, S.M. and S.R. Trollip, Multimedia for
Learning: Methods and Development. 3rd ed.
2001, Massachusetts: Allyn & Bacon.
Okan, Z., Edutainment: Is Learning at Risk?
British Journal of Educational Technology, 2003.
34(3): p. 255-264.
Nielsen, J., Heuristic Evaluation, in Usability
Inspection Methods, J. Nielsen and R.L. Mack,
Editors. 1994, John Wiley: New York.
Polson, P., et al., Cognitive Walkthroughs: A
Method for Theory-Based Evaluation of User
Interfaces. Int Journal of Man-Machine Studies,
1992. 36: p. 741-773.
Markopoulos, P. and M. Bekker, On the
Assessment of Usability Testing Methods for
Children. Interacting with Computers, 2003. 15:
p. 227-243.
Gray, W.D. and M.C. Salzman, Damaged
Merchandise? A Review of Experiments that
Compare Usability Evaluation Methods. HumanComputer Interaction, 1998. 3: p. 203-261.
Sim, G.R., S.J. MacFarlane, and J.C. Read.
Measuring the Effects of Fun on Learning in
Software for Children. in CAL2005. 2005.
Bristol, UK.
Read, J.C., S.J. MacFarlane, and C. Casey.
Endurability, Engagement and Expectations:
Measuring Children's Fun. in Interaction Design
and Children. 2002. Eindhoven: Shaker
Publishing.