ETCHA Sketches: Lessons Learned From Collecting Sketch Data: M O and C A and R D
ETCHA Sketches: Lessons Learned From Collecting Sketch Data: M O and C A and R D
ETCHA Sketches: Lessons Learned From Collecting Sketch Data: M O and C A and R D
Abstract
We present ETCHA Sketches–an Experimental Test Corpus
of Hand Annotated Sketches–with the goal of facilitating the
development of a standard test corpus for sketch understand-
ing research. To date we have collected sketches from four
domains: circuit diagrams, family trees, floor plans and ge-
ometric configurations. We have also labeled many of the
strokes in these data sets with geometric primitive labels (e.g.,
line, arc, polyline, polygon, and ellipse). We found accu-
rate labeling of data to be a more complex task than may
be anticipated. The complexity arises because labeled data
can be used for different purposes with different require- (a) Circuit Diagram (b) Floorplan
ments, and because some strokes are ambiguous and can le-
gitimately be put into multiple categories. We discuss sev-
eral different labeling methods and some properties of the
sketches that became apparent from the process of collect-
ing and labeling the data. The data sets are available online at
http://rationale.csail.mit.edu/ETCHASketches.
Introduction
In recent years, improvements have been made in both low- (c) Family tree (d) Ge-
ometry
level stroke classification and high level symbol understand-
ing (Davis, Landay, & Stahovich 2002). Although each of
these tasks requires the analysis of sketch data, to date there
has been no standard test corpus to use in developing or eval- Figure 1: Sample sketches from several domains
uating these systems. Instead, each researcher has collected
his or her own data, a process which is both time consuming
and makes it difficult to compare sketch recognition tech-
nologies. ETCHA Sketches (an Experimental Test Corpus or other. The process of assigning labels turned out to be
of Hand Annotated Sketches) addresses these problems by more subtle than it first appeared. We ended up collecting
providing a corpus of freely-drawn sketches in a number of four sets of labels with slightly different semantics and with
domains and a medium through which sketch recognition re- different intended purposes. For example, when evaluating
searchers can share their data. Such corpora have proved ex- a low level classifier we may be interested in both how often
tremely useful in other research areas, e.g., spoken language it outputs the correct answer and how often it outputs one of
systems and computational linguistics (Linguistic Data Con- a set of acceptable answers. We wanted to support both of
sortium 2004). these types of evaluations and several others that we describe
First, we describe the sketches we collected from a va- below. We describe how we collected these different labels
riety of domains. Our goal was to have data from several and how we anticipate them being used.
different domains so that we could explore the differences Third, we share some observations about the sketches and
between them and so that our data set would avoid the par- their labellings that revealed some interesting properties of
ticular biases of any one domain. sketching. For example, we often observed that users did not
Second, in order to use the data for evaluation and training draw complex shapes with consecutive strokes, but rather
tasks it was necessary to label the strokes with the primitive drew part of one object, then switched to another before re-
shapes that they depict: line, arc, ellipse, polyline, polygon, turning to complete the first one.
Stroke Collection as this one, while staying away from completely free form
Our goal was to gather data from a variety of different sketches such as maps or artistic drawings.
domains, to gain an understanding of different types of When collecting floor plan sketches we asked users to
sketches and to see if they varied in substantive ways. We draw simple bird’s eye views of single story apartments.
have restricted ourselves initially to sketches made with They were asked first to draw their current apartment (as
very simple interfaces (described below), in order to study a warm up task), brainstorm several apartments they would
sketching styles that are not influenced by specific interface like to live in, choose one of them to make a cleaned up
capabilities. Here, we describe the four domains we gath- drawing of, and finally, to verbally describe the design to
ered sketches from, the setup we used to collect them, and the moderator the while redrawing it one final time. The
the representations we use for the data. subjects were not architects and had no explicit sketching
experience, but the task was accessible because of people’s
Sketch Collection in Different Domains familiarity with floor plans.
By studying a broad range of domains we hope to identify a Geometric Configurations We also wanted to include
representative and realistic set of issues facing sketch recog- one domain in which the users were not performing any
nition systems. Representative sketches collected from four particular task. In this data set, users were simply asked
different domains can be seen in Figure 1. The collection to draw a number of different geometric shapes and con-
process for each domain is described below. figurations. For example: “Draw two concentric circles”
or “Draw two lines that meet at an acute angle.” As we
Circuit Diagrams To collect the circuit data, we solicited discuss later, strokes from this domain were more consis-
users with circuit design experience and asked them to pro- tently labeled than strokes from the circuit and family tree
duce sketches with certain properties. For example, we domains. This implies that the strokes were less ambiguous
asked them to sketch a circuit with 1 battery, 3 resistors, 1 and it highlights the importance of collecting data in realistic
capacitor, 1 transistor and ground. The users were members contexts.
of the MIT community and were all familiar with circuit de-
sign and had significant training and experience producing
sketches of circuits from coursework and design. Equipment and Software
This domain was one of the two based on simple compo- We collected all of the sketches on Tablet-PCs using pen
sitions of geometric shapes (e.g., lines for a resistor and an sized styli. We chose Tablet-PCs over technologies that
arrow in a circle for a current source). One of the features record physical ink (Mimio, CrossPad, HP’s Digital Pen)
of the circuit domain was the heavy use of lines in many dif- because our goal is to build interactive recognition systems
ferent shapes (e.g., wires, resisters, ground symbols, etc. . . ) that work with digital ink. In all cases users were pre-
and a tendency for users to draw multiple shapes with a sin- sented with a simple interface that contained a description
gle stroke (e.g., two resistors and a wire connecting them of what they should draw and a large area in which to draw
were frequently drawn with one stroke). the figure. There were no editing capabilities (e.g. copy,
Family Trees Like circuit diagrams, family tree diagrams paste) in the interface. However, the eraser end of the stylus
were a good source of simple geometric shapes drawn for a could be used to delete strokes. The use of the eraser had
particular task. They included a number of shapes that did one significant impact because it was configured to delete
not appear or were rare in circuit diagrams such as quadri- full strokes rather than pixels. Users informed us that they
laterals and ellipses. quickly learned to compensate for this by drawing more and
Family tree sketches were collected by asking subjects to smaller strokes to avoid deleting more than they intended. In
draw a tree (not necessarily their own) using ellipses for fe- the future we plan to support the more natural pixel-based
males, quadrilaterals for males, lines for marriages, jagged deletion mechanism.
lines for divorces and arrows for parent to child links. Some No recognition feedback was given to the users because it
subjects used text to fill in the names of people and others did can significantly modify their drawing behavior. The mod-
not. We did not use the sketches with text in the stroke label- ification occurs because users adapt their drawing style so
ing process because we see the handling of text and mixed that their figures are recognized more reliably. Furthermore,
text graphics to be a higher level task than the classification as described in (Hong et al. 2002) users do not necessarily
of shapes and wanted to avoid attacking that particular prob- want recognition feedback. As a result we avoided giving
lem at this point. We could have used the sketches with the users any feedback.
text removed but this was unnecessary because we had suf- The sketching interface was implemented in C# using Mi-
ficient numbers of sketches without text. crosoft’s Ink API. This allowed us to display pressure sensi-
tive ink that was visually realistic.
Floor Plans The floor plan domain was an interesting con-
trast to the other domains because floor plans are not based
Data Storage
on compositions of geometric shapes. The important fea-
tures in floor plans are usually the rooms created by the walls In collecting our data we have found it useful to have three
and not the walls themselves. This changes the focus of the different storage media. First, the sketches are collected
drawing task. We wanted to contrast the more geometric na- using Microsoft’s instrumented GIF images, which contain
ture of the other domains with a more free form domain such both an image and the stroke data. Second, we extract just
Best CanBeA Context IsA Total(*)
# of labelers 19 50 19 44 105
# of strokes labeled 466 175 543 186 814
# of strokes in corpus 387 154 467 162 750
(*) Some labelers and strokes appeared in multiple conditions so the total is not equal to the sum of the columns
the information that concerns the strokes, including: x po- The Four Label Sets
sition, y position, and time at each data point. This format
Best in Isolation The first label set indicates the most
is much simpler than Microsoft’s, and is easily accessible
likely interpretation of the stroke when it is evaluated by
without depending upon the Microsoft Ink API. In the fu-
itself. These labels are best suited for training and evaluat-
ture we plan to include pressure information, but have not
ing low level classifiers by asking the question: How many
yet done so because we have yet to use that property.
times does the classifier return the correct label for a stroke?
Third, we have found it useful to organize the data in an In this case we define the correct label to be the one that the
SQL database. Postgres has built-in support for geometric stroke most resembles when it is evaluated in isolation. We
data types and was a natural choice. The database has been chose this definition because it matches the common pur-
extremely useful in organizing the large amounts of data we pose of low level classifiers, to classify single strokes. This
have collected that spans many dimensions, such as different definition is also the most appropriate for training a statis-
domains, tasks within domains, authors, sketches, strokes, tical classifier that operates on individual strokes. It would
and labeling information. With the database we can retrieve, be misleading to have strokes in the training set labeled in
for example, all of the strokes over 20 pixels long that were context, because this information is not available to a single
labeled as lines from any sketch in any domain. Without stroke classifier.
the database such a query requires extensive indexing (es-
sentially a custom built database) or huge numbers of file Context The second label set contains labels that take into
accesses to load all of the strokes we have collected. The account the context of the entire sketch. These labels are
database can answer such queries efficiently, is easy to up- more suited for comparison with a higher level recognition
date, and provides a convenient central storage for our data. system that analyzes more than a single stroke at a time.
The data is published on our website at This label set also allows us to study the role of context in
http://rationale.csail.mit.edu/ETCHASketches/ a down- classifying strokes, both for the people doing the labeling
loadable archive of text based files. We are currently and for an automated system.
considering API’s or web based UI’s to allow researchers IsA and CanBeA The third and fourth label sets assign a
elsewhere to directly browse and contribute to the database. set of labels to each stroke. For example, a stroke that looks
Suggestions from the community as to how this may be like both a line and an arc (because it is only slightly curved)
useful would be appreciated. A summary of the quantity of would get two labels. The distinction between the IsA and
data acquired from each domain is listed in Table 2. CanBeA sets is in the semantics of class membership. For
IsA the semantics are that the stroke is an instance of each
assigned type. For CanBeA the semantics are that the stroke
Stroke Labeling could possibly be an instance of each assigned type. The
two sets have the same general purpose but the second al-
If the data is to be useful for evaluation and training, we
lows for a larger range of possible interpretations. The most
need to know what the right answer(s) are. As an initial
important question these label sets can answer is: How well
step we have focused on the labeling of individual strokes
matched to the list of possible labels from the label set is the
as one of the following primitive shapes: line, arc, polyline,
output of a classifier that ranks multiple possible interpreta-
polygon, ellipse, and other). We anticipated that the assign-
tions. Good performance on this task is extremely important
ment of strokes into classes of primitive shapes would be
when using the output of a stroke classifier as the input to
straightforward, but the process was more complex than we
a higher level recognition process, because the higher level
anticipated. It taught us some lessons that we feel are widely
process must have a large enough set of options to perform
relevant.
recognition accurately but too many possibilities lead to pro-
The primary issue we encountered was that labels can be hibitively inefficient recognition.
used for different purposes. Such as evaluating two clas-
sifiers, one that returns a single class and one that returns The Labeling Processes
ranked outputs. They will need different types of data in
order to perform both evaluations. As a result, we have col- After breaking down the types of labels we wanted to gather,
lected four different label sets in an attempt to cover a wide we constructed four different graphical interfaces to imple-
range of possible uses of the labeled strokes. Each of these ment the labeling schemes described above (pictured in Fig-
label sets was collected with slightly different processes, de- ure 2).
scribed below. In order to collect a large number of labels quickly we
Floor plans Circuits Family trees Geometry Total
Number of Sketches 127 360 70 3055 3612
Number of Strokes 11261 4349 1990 5797 23397
Number of users 23 10 10 27 70
(a) Best in context (b) Is the stroke a... (c) Can it be a...
Table 3: Agreement between labelers using the labels from the Best labeling condition and on a per study basis.
Table 4: The number of strokes with multiple labels as compared with the total number of strokes in each category
the differences between the CanBeA and IsA label sets. Af- several phenomena that contradict assumptions made by de-
ter analyzing the current state of the two data sets we have signers of some interactive sketching systems:
determined that the CanBeA set should be used instead of
the IsA data set. After compiling the final label sets with • Observation #1: Users do not always draw each object
our current, agreement based, criteria for which labels to in- with a sequence of consecutive strokes. We observed nu-
clude, we found that in the IsA label set only 3.7% of strokes merous sketches in which the user drew part of an object,
meeting the agreement criteria had more than one label. This left it unfinished, drew a second object, and only then re-
was somewhat surprising because before applying the crite- turned to finish the first.
rion 24.2% of all the strokes labeled in the IsA condition had • Observation #2: Users drew more than one object using
received more than one label by at least one of the two la- a single stroke. In circuit diagrams, for example, users of-
belers. This means that although labelers generally agreed ten drew several resistors, wires, and even voltage sources
on at least one label they rarely agreed on a second label. (circles) all with a single pen stroke.
The results from the CanBeA case are higher with 12.3% of
the labels meeting the inclusion criterion having multiple la- • Observation #3: Erasing whole strokes instead of indi-
bels and 46.9% of the total number of labeled strokes. We vidual pixels affects drawing style. Some users were sur-
believe that these percentages (summarized in Table 4), es- prised that the eraser removed an entire stroke instead of
pecially for the IsA case, are artificially low as a result of just part of it. They mentioned that they learned to com-
requiring unanimous agreement for a label to be included. pensate for this limitation by drawing shorter strokes.
We plan to reevaluate this label set when more labelers have • Observation #4: Users draw differently when using an
evaluated each stroke and we have experimented with other interactive system than when freely sketching. We in-
label inclusion criteria. However, visual inspection of the formally observed that many users drew more precisely
stroke labellings included in the current CanBeA label set when using an interactive system that displayed recogni-
suggests that they capture a reasonable amount of variation tion feedback, than when using a system that performed
without being overly liberal with label assignments, and can no recognition.
therefore be used for cases needing multiple stroke labels.
One of our goals in collecting more label data is to in- Related Work
crease the number of strokes that are labeled in all four con-
ditions. In the current corpus there are a relatively small We are unaware of any publicly available corpus of catego-
numbers of such strokes because we selected strokes ran- rized and labeled sketch data. Other work relating to the
domly from a pool to avoid order effects in the labeling pro- analysis of sketches has been done in analyzing the rela-
cess. However, our initial pool of strokes was too large and tionships between sketching and a number of cognitive pro-
we did not get a dense labeling of the space and therefore cesses. Tversky studies the importance of sketching in the
there are less strokes that appear in all the label sets than we early stages of design (Tversky 2002). Kavakli et al. inves-
had hopped. Having a more dense labeling of strokes would tigate style differences between novice and expert architects
allow more complete comparisons between the different la- (Kavakli & Gero 2002) and between designers performing
bel sets. different tasks, including tasks at different stages of the de-
sign process (Kavakli, Scrivener, & Ball 1998). These stud-
Qualitative Observations ies have provided us insight into the types of variations we
might expect and have suggested key variables to record and
In addition to providing a test corpus, the stroke and label analyze. It would be worthwhile to reproduce these types
data tells us how people make free sketches. Based on our of studies to see how the results vary with the use of digital
current data sets and our observations of people using our ink and to see if the differences between different tasks and
systems over the past couple of years, we have observed users can be more quantitatively characterized.
Conclusion
We have created the ETCHA Sketches database that, to
date, contains sketches from four domains and four differ-
ent sets of labels which can be used for different evaluation
and training tasks. We have presented some of our insights
into both the process of collecting and labeling sketches and
some properties of free sketches.
The ETCHA Sketches database is publicly avail-
able to the community through our group’s web page:
http://rationale.csail.mit.edu/ETCHASketches. In the future
we intend to provide a more powerful web based inter-
face to facilitate the addition of new sketches and the se-
lective retrieval and browsing of the data sets. As a work
in progress, the datasets will continue to grow and cover
more domains. Researchers at other institutions with similar
classes of datasets or even just raw stroke data are encour-
aged to contact the authors to incorporate their data into the
database.
References
Carletta, J. 1996. Assessing agreement on classifica-
tion tasks: the kappa statistic. Computational Linguistics
22(2):249–254.
Davis, R.; Landay, J.; and Stahovich, T., eds. 2002. Sketch
Understanding. AAAI Press.
Hong, J.; Landay, J.; Long, A. C.; and Mankoff, J. 2002.
Sketch recognizers from the end-user’s, the designer’s, and
the programmer’s perspective. Sketch Understanding, Pa-
pers from the 2002 AAAI Spring Symposium 73–77.
Kavakli, M., and Gero, J. S. 2002. The structure of concur-
rent cognitive actions: A case study on novice and expert
designers. Design Studies 23(1):25–40.
Kavakli, M.; Scrivener, S. A. R.; and Ball, L. J. 1998.
Structure in idea sketching behaviour. Design Studies
19(4):485–517.
Linguistic Data Consortium. 2004. LDC–linguistic data
consortium. http://www.ldc.upenn.edu.
Tversky, B. 2002. What do sketches say about thinking?
Sketch Understanding, Papers from the 2002 AAAI Spring
Symposium 148–151.