ETCHA Sketches: Lessons Learned From Collecting Sketch Data: M O and C A and R D

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

ETCHA Sketches: Lessons Learned from Collecting Sketch Data

M IKE O LTMANS and C HRISTINE A LVARADO and R ANDALL DAVIS


MIT Computer Science and Artificial Intelligence Laboratory
32 Vassar Street
Cambridge, MA 02139
{moltmans, calvarad, davis}@csail.mit.edu

Abstract
We present ETCHA Sketches–an Experimental Test Corpus
of Hand Annotated Sketches–with the goal of facilitating the
development of a standard test corpus for sketch understand-
ing research. To date we have collected sketches from four
domains: circuit diagrams, family trees, floor plans and ge-
ometric configurations. We have also labeled many of the
strokes in these data sets with geometric primitive labels (e.g.,
line, arc, polyline, polygon, and ellipse). We found accu-
rate labeling of data to be a more complex task than may
be anticipated. The complexity arises because labeled data
can be used for different purposes with different require- (a) Circuit Diagram (b) Floorplan
ments, and because some strokes are ambiguous and can le-
gitimately be put into multiple categories. We discuss sev-
eral different labeling methods and some properties of the
sketches that became apparent from the process of collect-
ing and labeling the data. The data sets are available online at
http://rationale.csail.mit.edu/ETCHASketches.

Introduction
In recent years, improvements have been made in both low- (c) Family tree (d) Ge-
ometry
level stroke classification and high level symbol understand-
ing (Davis, Landay, & Stahovich 2002). Although each of
these tasks requires the analysis of sketch data, to date there
has been no standard test corpus to use in developing or eval- Figure 1: Sample sketches from several domains
uating these systems. Instead, each researcher has collected
his or her own data, a process which is both time consuming
and makes it difficult to compare sketch recognition tech-
nologies. ETCHA Sketches (an Experimental Test Corpus or other. The process of assigning labels turned out to be
of Hand Annotated Sketches) addresses these problems by more subtle than it first appeared. We ended up collecting
providing a corpus of freely-drawn sketches in a number of four sets of labels with slightly different semantics and with
domains and a medium through which sketch recognition re- different intended purposes. For example, when evaluating
searchers can share their data. Such corpora have proved ex- a low level classifier we may be interested in both how often
tremely useful in other research areas, e.g., spoken language it outputs the correct answer and how often it outputs one of
systems and computational linguistics (Linguistic Data Con- a set of acceptable answers. We wanted to support both of
sortium 2004). these types of evaluations and several others that we describe
First, we describe the sketches we collected from a va- below. We describe how we collected these different labels
riety of domains. Our goal was to have data from several and how we anticipate them being used.
different domains so that we could explore the differences Third, we share some observations about the sketches and
between them and so that our data set would avoid the par- their labellings that revealed some interesting properties of
ticular biases of any one domain. sketching. For example, we often observed that users did not
Second, in order to use the data for evaluation and training draw complex shapes with consecutive strokes, but rather
tasks it was necessary to label the strokes with the primitive drew part of one object, then switched to another before re-
shapes that they depict: line, arc, ellipse, polyline, polygon, turning to complete the first one.
Stroke Collection as this one, while staying away from completely free form
Our goal was to gather data from a variety of different sketches such as maps or artistic drawings.
domains, to gain an understanding of different types of When collecting floor plan sketches we asked users to
sketches and to see if they varied in substantive ways. We draw simple bird’s eye views of single story apartments.
have restricted ourselves initially to sketches made with They were asked first to draw their current apartment (as
very simple interfaces (described below), in order to study a warm up task), brainstorm several apartments they would
sketching styles that are not influenced by specific interface like to live in, choose one of them to make a cleaned up
capabilities. Here, we describe the four domains we gath- drawing of, and finally, to verbally describe the design to
ered sketches from, the setup we used to collect them, and the moderator the while redrawing it one final time. The
the representations we use for the data. subjects were not architects and had no explicit sketching
experience, but the task was accessible because of people’s
Sketch Collection in Different Domains familiarity with floor plans.
By studying a broad range of domains we hope to identify a Geometric Configurations We also wanted to include
representative and realistic set of issues facing sketch recog- one domain in which the users were not performing any
nition systems. Representative sketches collected from four particular task. In this data set, users were simply asked
different domains can be seen in Figure 1. The collection to draw a number of different geometric shapes and con-
process for each domain is described below. figurations. For example: “Draw two concentric circles”
or “Draw two lines that meet at an acute angle.” As we
Circuit Diagrams To collect the circuit data, we solicited discuss later, strokes from this domain were more consis-
users with circuit design experience and asked them to pro- tently labeled than strokes from the circuit and family tree
duce sketches with certain properties. For example, we domains. This implies that the strokes were less ambiguous
asked them to sketch a circuit with 1 battery, 3 resistors, 1 and it highlights the importance of collecting data in realistic
capacitor, 1 transistor and ground. The users were members contexts.
of the MIT community and were all familiar with circuit de-
sign and had significant training and experience producing
sketches of circuits from coursework and design. Equipment and Software
This domain was one of the two based on simple compo- We collected all of the sketches on Tablet-PCs using pen
sitions of geometric shapes (e.g., lines for a resistor and an sized styli. We chose Tablet-PCs over technologies that
arrow in a circle for a current source). One of the features record physical ink (Mimio, CrossPad, HP’s Digital Pen)
of the circuit domain was the heavy use of lines in many dif- because our goal is to build interactive recognition systems
ferent shapes (e.g., wires, resisters, ground symbols, etc. . . ) that work with digital ink. In all cases users were pre-
and a tendency for users to draw multiple shapes with a sin- sented with a simple interface that contained a description
gle stroke (e.g., two resistors and a wire connecting them of what they should draw and a large area in which to draw
were frequently drawn with one stroke). the figure. There were no editing capabilities (e.g. copy,
Family Trees Like circuit diagrams, family tree diagrams paste) in the interface. However, the eraser end of the stylus
were a good source of simple geometric shapes drawn for a could be used to delete strokes. The use of the eraser had
particular task. They included a number of shapes that did one significant impact because it was configured to delete
not appear or were rare in circuit diagrams such as quadri- full strokes rather than pixels. Users informed us that they
laterals and ellipses. quickly learned to compensate for this by drawing more and
Family tree sketches were collected by asking subjects to smaller strokes to avoid deleting more than they intended. In
draw a tree (not necessarily their own) using ellipses for fe- the future we plan to support the more natural pixel-based
males, quadrilaterals for males, lines for marriages, jagged deletion mechanism.
lines for divorces and arrows for parent to child links. Some No recognition feedback was given to the users because it
subjects used text to fill in the names of people and others did can significantly modify their drawing behavior. The mod-
not. We did not use the sketches with text in the stroke label- ification occurs because users adapt their drawing style so
ing process because we see the handling of text and mixed that their figures are recognized more reliably. Furthermore,
text graphics to be a higher level task than the classification as described in (Hong et al. 2002) users do not necessarily
of shapes and wanted to avoid attacking that particular prob- want recognition feedback. As a result we avoided giving
lem at this point. We could have used the sketches with the users any feedback.
text removed but this was unnecessary because we had suf- The sketching interface was implemented in C# using Mi-
ficient numbers of sketches without text. crosoft’s Ink API. This allowed us to display pressure sensi-
tive ink that was visually realistic.
Floor Plans The floor plan domain was an interesting con-
trast to the other domains because floor plans are not based
Data Storage
on compositions of geometric shapes. The important fea-
tures in floor plans are usually the rooms created by the walls In collecting our data we have found it useful to have three
and not the walls themselves. This changes the focus of the different storage media. First, the sketches are collected
drawing task. We wanted to contrast the more geometric na- using Microsoft’s instrumented GIF images, which contain
ture of the other domains with a more free form domain such both an image and the stroke data. Second, we extract just
Best CanBeA Context IsA Total(*)
# of labelers 19 50 19 44 105
# of strokes labeled 466 175 543 186 814
# of strokes in corpus 387 154 467 162 750
(*) Some labelers and strokes appeared in multiple conditions so the total is not equal to the sum of the columns

Table 1: Sizes of the different label sets.

the information that concerns the strokes, including: x po- The Four Label Sets
sition, y position, and time at each data point. This format
Best in Isolation The first label set indicates the most
is much simpler than Microsoft’s, and is easily accessible
likely interpretation of the stroke when it is evaluated by
without depending upon the Microsoft Ink API. In the fu-
itself. These labels are best suited for training and evaluat-
ture we plan to include pressure information, but have not
ing low level classifiers by asking the question: How many
yet done so because we have yet to use that property.
times does the classifier return the correct label for a stroke?
Third, we have found it useful to organize the data in an In this case we define the correct label to be the one that the
SQL database. Postgres has built-in support for geometric stroke most resembles when it is evaluated in isolation. We
data types and was a natural choice. The database has been chose this definition because it matches the common pur-
extremely useful in organizing the large amounts of data we pose of low level classifiers, to classify single strokes. This
have collected that spans many dimensions, such as different definition is also the most appropriate for training a statis-
domains, tasks within domains, authors, sketches, strokes, tical classifier that operates on individual strokes. It would
and labeling information. With the database we can retrieve, be misleading to have strokes in the training set labeled in
for example, all of the strokes over 20 pixels long that were context, because this information is not available to a single
labeled as lines from any sketch in any domain. Without stroke classifier.
the database such a query requires extensive indexing (es-
sentially a custom built database) or huge numbers of file Context The second label set contains labels that take into
accesses to load all of the strokes we have collected. The account the context of the entire sketch. These labels are
database can answer such queries efficiently, is easy to up- more suited for comparison with a higher level recognition
date, and provides a convenient central storage for our data. system that analyzes more than a single stroke at a time.
The data is published on our website at This label set also allows us to study the role of context in
http://rationale.csail.mit.edu/ETCHASketches/ a down- classifying strokes, both for the people doing the labeling
loadable archive of text based files. We are currently and for an automated system.
considering API’s or web based UI’s to allow researchers IsA and CanBeA The third and fourth label sets assign a
elsewhere to directly browse and contribute to the database. set of labels to each stroke. For example, a stroke that looks
Suggestions from the community as to how this may be like both a line and an arc (because it is only slightly curved)
useful would be appreciated. A summary of the quantity of would get two labels. The distinction between the IsA and
data acquired from each domain is listed in Table 2. CanBeA sets is in the semantics of class membership. For
IsA the semantics are that the stroke is an instance of each
assigned type. For CanBeA the semantics are that the stroke
Stroke Labeling could possibly be an instance of each assigned type. The
two sets have the same general purpose but the second al-
If the data is to be useful for evaluation and training, we
lows for a larger range of possible interpretations. The most
need to know what the right answer(s) are. As an initial
important question these label sets can answer is: How well
step we have focused on the labeling of individual strokes
matched to the list of possible labels from the label set is the
as one of the following primitive shapes: line, arc, polyline,
output of a classifier that ranks multiple possible interpreta-
polygon, ellipse, and other). We anticipated that the assign-
tions. Good performance on this task is extremely important
ment of strokes into classes of primitive shapes would be
when using the output of a stroke classifier as the input to
straightforward, but the process was more complex than we
a higher level recognition process, because the higher level
anticipated. It taught us some lessons that we feel are widely
process must have a large enough set of options to perform
relevant.
recognition accurately but too many possibilities lead to pro-
The primary issue we encountered was that labels can be hibitively inefficient recognition.
used for different purposes. Such as evaluating two clas-
sifiers, one that returns a single class and one that returns The Labeling Processes
ranked outputs. They will need different types of data in
order to perform both evaluations. As a result, we have col- After breaking down the types of labels we wanted to gather,
lected four different label sets in an attempt to cover a wide we constructed four different graphical interfaces to imple-
range of possible uses of the labeled strokes. Each of these ment the labeling schemes described above (pictured in Fig-
label sets was collected with slightly different processes, de- ure 2).
scribed below. In order to collect a large number of labels quickly we
Floor plans Circuits Family trees Geometry Total
Number of Sketches 127 360 70 3055 3612
Number of Strokes 11261 4349 1990 5797 23397
Number of users 23 10 10 27 70

Table 2: The composition of the full (unlabeled) dataset.

(a) Best in context (b) Is the stroke a... (c) Can it be a...

(d) Best in isolation

Figure 2: The four labeling interfaces


implemented the interfaces as web-based Java applets and For the IsA and CanBeA label sets, the situation was com-
invited users to spend a few minutes labeling strokes in ex- plicated by the presence of multiple labels per stroke. When
change for being entered into a drawing each time they com- labeling the data, each stroke were presented and judged in-
pleted a session. The participants were largely from the dependently. Accordingly we chose to include stroke labels
MIT community, although it was open to anyone with a in the final dataset on a per label basis because it maintained
Java-compatible web browser. Participants generally did not the same semantics of the labels and was analogous to the
have any experience with sketch recognition research. We criterion for the single label cases. For example, if one la-
avoided labeling the data ourselves because we didn’t want beler labeled a stroke as a line and an arc but the other labeler
to bias the labeling with our own definitions of the classes, labeled it as a line, we included the stroke in the label set but
which have evolved in parallel with our recognizers. only with the line label.
While designing the interfaces we wanted to avoid order For the multiple label data sets we considered, and re-
effects. For example, when being shown strokes in isolation, jected, the criterion of including all of the labels that were
after seeing several carefully drawn lines, a slightly curvy assigned by any labeler. Although this criterion captures the
line may be more likely to be classified as an arc due to the idea of including a wider range of possible interpretations, it
contrast between the previous strokes. To avoid the prob- does not account for errors made by the labelers and there-
lem the interface randomly selects strokes from a pool of all fore fails to ensure the accuracy of the labels.
strokes that are selected across different sketches, authors, One drawback to the agreement based inclusion of strokes
and data sets. is that it prefers strokes that are less ambiguous. It is possi-
The four desired label sets fell into two categories: single ble that these strokes will be easier for recognizers to clas-
label and multiple label. sify and will not be representative of the range of strokes
present in sketches. To address this problem, we are con-
Collecting the Single Label Sets: Best and Context sidering other criterion for including labels in the final label
These two interfaces both asked the user to choose the best sets. One option, that could be applied to all of the label
interpretation for a stroke. The interface for the Best label sets, is to have more than two labelers for each stroke and
set presented the stroke in isolation. The Context interface include labels which were agreed upon at least some per-
presented the stroke in the context of the original sketch by centage of the time. Alternatively, the number of votes for
coloring it in red while the remainder of the sketch was ren- each label could be used to weight or rank the labels as-
dered in blue. Unlike the other conditions that randomized signed to a stroke. Our label collection is ongoing and when
the stroke ordering, the Context interface presented strokes we obtain sufficient data to experiment with these different
in the order they were drawn to to avoid having to switch criteria we will be able to more thoroughly evaluate the suit-
back and forth between sketches. These two interfaces also ability of each of these options.
allowed the user to assign the label other if the stroke did The number of strokes, to date, included from each con-
not naturally fit into any category. We decided to include dition are listed in Table 1.
this category so that users would not randomly assign labels
to strokes that did not fit into any of the categories.
Discussion
Collecting the Multiple Label Sets: IsA and CanBeA To
collect these two data sets we viewed the classification as a Here we present lessons we learned about the nature of the
binary decision for each label. To capture this we showed la- sketches from different domains, the differences between the
belers one stroke in isolation and asked them if it belonged label sets, and ways to improve the final labeled corpus.
in a particular category (line, arc, etc.). To avoid ordering
effects the labeling process was organized by the label cat- Observations from the Data and Label Sets
egories instead of by strokes. This meant that the labelers The labeling data revealed an important lesson about the dif-
were asked to classify strokes against a specific category and ferent classes of sketches that we collected. The strokes
were then presented with a sequence of five strokes. The cat- from the geometry data set were generally easier to label.
egory was then changed and they were presented another set There was a higher level of agreement between labelers on
of strokes. For example, they were asked if five strokes were strokes from the geometry data than from the circuit and
lines and then asked if the next five strokes were arcs. This family tree domains (shown in Table 3). This indicates that
process repeated until all of the strokes were evaluated for they were less ambiguous. The Kappa value is a measure
each category. These two interfaces varied only in the in- of agreement that takes into account how likely two labelers
structions presented to the user; “is it a. . . ” versus “can this are to agree by chance (Carletta 1996). We hypothesize that
possibly be a. . . .” this difference between conditions is attributable to the fact
that the strokes were drawn outside of any realistic context
Generating Label Sets From the Raw Label Data and therefore the person drawing was focused on producing
careful clear figures. The fact that people perceived these
After collecting the labels for the strokes we compiled the strokes differently serves as a warning that training and test
results to produce the four label sets described above. For data should be selected from data that is collected from sim-
the sets with single interpretations, Best and Context, this ilar contexts to the one that will be used in practice.
was a straightforward task, we simply included the strokes While we provided some guidance above about how to
that both labelers agreed on. choose a label set for different purposes, we did not discuss
%Agreement Kappa
Geometry 90.4 0.86
Family Trees 80.1 0.73
Circuits 82.0 0.71

Table 3: Agreement between labelers using the labels from the Best labeling condition and on a per study basis.

Labeled Strokes Final Set of Labeled Strokes


Total # strokes # with multiple labels Total # strokes # with multiple labels
IsA 186 45 (24.2%) 162 6 (3.7%)
CanBeA 175 82 (46.9%) 154 19 (12.3%)

Table 4: The number of strokes with multiple labels as compared with the total number of strokes in each category

the differences between the CanBeA and IsA label sets. Af- several phenomena that contradict assumptions made by de-
ter analyzing the current state of the two data sets we have signers of some interactive sketching systems:
determined that the CanBeA set should be used instead of
the IsA data set. After compiling the final label sets with • Observation #1: Users do not always draw each object
our current, agreement based, criteria for which labels to in- with a sequence of consecutive strokes. We observed nu-
clude, we found that in the IsA label set only 3.7% of strokes merous sketches in which the user drew part of an object,
meeting the agreement criteria had more than one label. This left it unfinished, drew a second object, and only then re-
was somewhat surprising because before applying the crite- turned to finish the first.
rion 24.2% of all the strokes labeled in the IsA condition had • Observation #2: Users drew more than one object using
received more than one label by at least one of the two la- a single stroke. In circuit diagrams, for example, users of-
belers. This means that although labelers generally agreed ten drew several resistors, wires, and even voltage sources
on at least one label they rarely agreed on a second label. (circles) all with a single pen stroke.
The results from the CanBeA case are higher with 12.3% of
the labels meeting the inclusion criterion having multiple la- • Observation #3: Erasing whole strokes instead of indi-
bels and 46.9% of the total number of labeled strokes. We vidual pixels affects drawing style. Some users were sur-
believe that these percentages (summarized in Table 4), es- prised that the eraser removed an entire stroke instead of
pecially for the IsA case, are artificially low as a result of just part of it. They mentioned that they learned to com-
requiring unanimous agreement for a label to be included. pensate for this limitation by drawing shorter strokes.
We plan to reevaluate this label set when more labelers have • Observation #4: Users draw differently when using an
evaluated each stroke and we have experimented with other interactive system than when freely sketching. We in-
label inclusion criteria. However, visual inspection of the formally observed that many users drew more precisely
stroke labellings included in the current CanBeA label set when using an interactive system that displayed recogni-
suggests that they capture a reasonable amount of variation tion feedback, than when using a system that performed
without being overly liberal with label assignments, and can no recognition.
therefore be used for cases needing multiple stroke labels.
One of our goals in collecting more label data is to in- Related Work
crease the number of strokes that are labeled in all four con-
ditions. In the current corpus there are a relatively small We are unaware of any publicly available corpus of catego-
numbers of such strokes because we selected strokes ran- rized and labeled sketch data. Other work relating to the
domly from a pool to avoid order effects in the labeling pro- analysis of sketches has been done in analyzing the rela-
cess. However, our initial pool of strokes was too large and tionships between sketching and a number of cognitive pro-
we did not get a dense labeling of the space and therefore cesses. Tversky studies the importance of sketching in the
there are less strokes that appear in all the label sets than we early stages of design (Tversky 2002). Kavakli et al. inves-
had hopped. Having a more dense labeling of strokes would tigate style differences between novice and expert architects
allow more complete comparisons between the different la- (Kavakli & Gero 2002) and between designers performing
bel sets. different tasks, including tasks at different stages of the de-
sign process (Kavakli, Scrivener, & Ball 1998). These stud-
Qualitative Observations ies have provided us insight into the types of variations we
might expect and have suggested key variables to record and
In addition to providing a test corpus, the stroke and label analyze. It would be worthwhile to reproduce these types
data tells us how people make free sketches. Based on our of studies to see how the results vary with the use of digital
current data sets and our observations of people using our ink and to see if the differences between different tasks and
systems over the past couple of years, we have observed users can be more quantitatively characterized.
Conclusion
We have created the ETCHA Sketches database that, to
date, contains sketches from four domains and four differ-
ent sets of labels which can be used for different evaluation
and training tasks. We have presented some of our insights
into both the process of collecting and labeling sketches and
some properties of free sketches.
The ETCHA Sketches database is publicly avail-
able to the community through our group’s web page:
http://rationale.csail.mit.edu/ETCHASketches. In the future
we intend to provide a more powerful web based inter-
face to facilitate the addition of new sketches and the se-
lective retrieval and browsing of the data sets. As a work
in progress, the datasets will continue to grow and cover
more domains. Researchers at other institutions with similar
classes of datasets or even just raw stroke data are encour-
aged to contact the authors to incorporate their data into the
database.

References
Carletta, J. 1996. Assessing agreement on classifica-
tion tasks: the kappa statistic. Computational Linguistics
22(2):249–254.
Davis, R.; Landay, J.; and Stahovich, T., eds. 2002. Sketch
Understanding. AAAI Press.
Hong, J.; Landay, J.; Long, A. C.; and Mankoff, J. 2002.
Sketch recognizers from the end-user’s, the designer’s, and
the programmer’s perspective. Sketch Understanding, Pa-
pers from the 2002 AAAI Spring Symposium 73–77.
Kavakli, M., and Gero, J. S. 2002. The structure of concur-
rent cognitive actions: A case study on novice and expert
designers. Design Studies 23(1):25–40.
Kavakli, M.; Scrivener, S. A. R.; and Ball, L. J. 1998.
Structure in idea sketching behaviour. Design Studies
19(4):485–517.
Linguistic Data Consortium. 2004. LDC–linguistic data
consortium. http://www.ldc.upenn.edu.
Tversky, B. 2002. What do sketches say about thinking?
Sketch Understanding, Papers from the 2002 AAAI Spring
Symposium 148–151.

You might also like