EPIC-OSM: A Software Framework for OpenStreetMap Data Analytics
Jennings Anderson
Robert Soden Kenneth M. Anderson Marina Kogan Leysia Palen
University of Colorado Boulder
{jennings.anderson, robert.soden, ken.anderson, marina.kogan, leysia.palen}@colorado.edu
Abstract
An important area of work in big data software
engineering involves the design and development of
software frameworks for data-intensive systems that
perform large-scale data collection and analysis. We
report on our work to design and develop a software
framework for analyzing the collaborative editing
behavior of OpenStreetMap users when working on
the task of crisis mapping. Crisis mapping occurs
after a disaster or humanitarian crisis and involves
the coordination of a distributed set of users who
collaboratively work to improve the quality of the
map for the impacted area in support of emergency
response efforts. Our paper presents the challenges
related to the analysis of OpenStreetMap and how
our software framework tackles those challenges to
enable the efficient processing of gigabytes of
OpenStreetMap data. Our framework has already
been deployed to analyze crisis mapping efforts in
2015 and has an active development community.
1. Introduction
We live at a time when organizations of all kinds
increasingly have the means to generate, collect, and
analyze large volumes of data via software systems.
These systems—collectively known as data-intensive
software systems or “big data” systems—are
challenging to design, develop, and deploy [1]. One
application area that requires the development of
these systems is crisis informatics [12], which
investigates how social computing can impact the
practice of emergency management. Of particular
interest is the use of digital maps to support disaster
response, an activity known as crisis mapping.
However, the analytics of geospatial data are
especially challenging to resolve. This is because a)
map datasets tend to be extremely large—often
consuming terabytes or petabytes of information—
and b) map datasets are not good at conveying how
they were created. That is, for any given version of a
map, all one sees is the final aggregate map, not the
individual edits that were performed to create it. This
is what separates collaboratively-edited geospatial
data from collaboratively-edited text documents—
such as articles on Wikipedia—which can much more
easily display editing history across users.
In the new world of crowdsourced data
generation where information can be produced
quickly for open use, understanding the collaboration
that went into the construction of the map can be as
important as the map itself. This is especially true for
action-oriented communities, like the crisis mapping
community, that are trying to understand their
evolving work practices while they work to produce
maps that can be used to aid crisis response. These
communities seek to understand their work in situ to
improve upon it. Social computing researchers desire
the same understanding to both document what
digital crowds can achieve and with an eye towards
designing better tools to support that work in the
future. For the big data community, this type of
research is important, as it requires the novel use of
data analysis techniques both for the batch processing
of existing data sets as well as the real-time analysis
of edits that stream in during a crisis event.
In this paper, we report on the design and
development of a big data software framework that
can be used to analyze the edit history of
OpenStreetMap (OSM), making it possible to study
the cooperative work that occurs there, including but
not limited to the intensely collaborative periods of
crisis mapping where much is at stake for
humanitarian groups using these maps on the ground.
At the time of this writing, there are no other
frameworks that perform this type of analysis for
OSM data; indeed, use of our software framework
has been steadily increasing since its initial
deployment for studying and monitoring the mapping
activity surrounding the 2015 Nepal Earthquake
event. This increased use is the direct result of the
unique analysis capabilities our framework provides
on top of OSM data.
OpenStreetMap is an open geographic data
initiative that provides a map, and its associated
geospatial data, that anyone can contribute to and
access. Our software framework, known as epic-osm,
can scale to process gigabytes of OSM data by
employing a variety of techniques to both analyze
Figure 1: Port-Au-Prince, Haiti in OSM, before the 2010 earthquake (left) and 4 days after (right).
Remotely-located volunteer mappers added all features by tracing aerial imagery [8].
data for desired metrics and visualize the results in
ways that are meaningful to mappers themselves and
the larger OSM organization and community. It also
makes details of this enormous data-producing
organization [13] available to researchers in the way
that Wikipedia has been studied extensively for years
as a notable site of collaborative data production. Our
framework is more than just a design; our code is
available on GitHub and the software tools that have
been built on top of this framework are in active use.
Our experiences designing and implementing
this framework can be of use to others. We
demonstrate how to address the challenging data
modeling issues that arise in the design of dataintensive software systems [16], as well as issues of
extensibility, scalability, and interoperability.
1.1. Studying the OSM Community
With some notable exceptions [9], the majority
of research on the social organization of the OSM
community has been based upon qualitative research
methods such as participant observation, interviews,
and surveys. These studies have provided insights
into participant motivation \cite{Budhathoki2012}
and demographics \cite{Schmidt2013}. Our team
saw that examination of the OSM database itself—
which contains a complete record of every edit ever
made—is critical to the advancement of the 2.1M
OSM member organization, which needs to better
understand its production functions to manage its
growth [13], as well as for social computing
researchers to characterize the nature of cooperative
crisis mapping. Understanding the social processes
governing the creation of OSM data is especially
important for crisis informatics, since these
behavioral phenomena can affect the quality of the
geographic data produced. This can have real human
consequences as OSM is frequently used as the
primary base map in humanitarian response [18].
One likely reason that so little analytical research
of socio-behavioral phenomena in OSM has been
conducted (in comparison to the vastly-studied
Wikipedia organization) is the challenges of
manipulating OSM data. A complete download of the
OSM history database is over a terabyte in size and is
continuously growing as new edits are made. This
difficulty affects not only scholars, but also the OSM
community itself, which struggles to track its own
activity, and hence its growth and impact [13]. To
address this knowledge gap, we have identified a
number of OSM members who have been willing to
contribute to the development of the epic-osm
framework as well as deploy and test it for a range of
purposes. As will be discussed, this engagement has
helped push the development of our framework and
its surrounding toolset in new directions.
Furthermore, this has catalyzed discussion within the
OSM community about the need for new tools, as the
existing community toolset, prior to the creation of
our framework, is sparse and does not provide indepth analytical capabilities.
1.2. Crisis Informatics and OpenStreetMap
When a major disaster occurs, a subset of the
OSM community rapidly converges on the map
around the impacted geographical area. The first
well-documented case of this was after the 2010 Haiti
Earthquake, where what few mapping products did
exist were lost to the destruction of the office
buildings of the national mapping agency. The
international humanitarian responders converging
onto the scene needed accurate maps to perform their
work [18]. As depicted in Figure 1, hundreds of
remote mappers from all over the world dramatically
improved the digital map coverage of the affected
areas in a matter of days by digitally tracing aerial
imagery to build the map. This map then became the
primary resource used in relief efforts [18].
Known as high-tempo events, these activations
are of interest to the OSM community as a way to
understand and communicate its impact. It is also of
specific interest to crisis informatics researchers
because of the rapid, large-scale convergence of
“digital volunteers” from around the world, which
demonstrates new forms of collective behavior [7,
13]. However, to begin asking questions of how this
collaboration occurs, we must first create new tools
to access and explore the “site of work”—the
database supporting the map itself. This is the
motivation behind the development of epic-osm—to
create the first open framework for easily analyzing
the large OSM dataset. Initially developed to support
crisis informatics research, the use cases we will
discuss are abundant and the framework provides
great flexibility for all types of OSM research.
2. OpenStreetMap
Created in 2004 by students in the UK in
response to restrictive licensing on geographic data
\cite{Chilton2009a}, OSM has become the most
widely used platform for “volunteered geographic
information” \cite{ Elwood2008, Goodchild2007}.
OSM is supported by a worldwide network of
developers and volunteers committed to the open data
values of the platform. Today, OSM has over 2.1M
registered users, a small subset of whom are active
editors [10], and 2.9B individual geographic points
[11]. The website itself is a Ruby on Rails application
on top of a PostgreSQL database. OSM incorporates
an in-browser map editor and provides an API to
interface with external tools.
2.1. OSM Data Structure
Six domain-level data types are found in the
OSM database. Three of these primary objects
construct the map itself: nodes, ways, and relations.
Nodes are the most basic building blocks of the
database and represent single geographic points. A
way is composed of an ordered series of nodes,
representing a line or polygon. A relation is a
collection of nodes and/or ways, such as a country
border or a noncontiguous set of polygons. When an
object is first created, its version is set to “1.” Any
subsequent edit to that object will increment the
version number; such edits also track the user who
performed them and the changeset (discussed below)
to which this edit belongs. Representations of nodes,
ways, and relations are shown in Figure 2.
Beyond the primary map objects, the OSM
database contains changesets, users, and notes. A
changeset is the digital receipt associated with every
edit to the map. Each time a user commits their edits
to the database, a changeset is generated with
information about the editing session. The changeset
id is recorded with every map object it contains,
allowing a user to view a complete grouping of all
the objects edited within a single changeset.
A note object is a geographically-located
comment that a user adds to the map. These notes are
marked as either open or resolved and may contain a
comment thread as users discuss the note. Notes
document a discussion between users on how to
represent a feature on the map, which can be another
important element for understanding map creation.
The OSM user database contains the user display
name, a unique user id, and the date on which the
user created an account on openstreetmap.org. epicosm makes use of the date when a user creates an
account to determine their experience level with
OSM. This facilitates comparison of behavioral
differences between novice and experienced editors.
2.1. Tags
The descriptive, non-spatial characteristic of
each map object within OSM is a set of tags. These
are unrestricted key-value pairs that can be added to
any map object. An active wiki supports discussion
about best tagging practices for consistency within
the map, and editing tools offer default tag
suggestions, but there are no database rules to enforce
tagging schema or structure. For instance, Table 1
shows some of the top keys and common values for
OSM objects in the map for New York City at the
time of writing. From this table we can observe that
information regarding the building footprints and
heights for NYC is of major interest to the subset of
the OSM community mapping in NYC, and is
therefore not representative of all cities within OSM.
This highlights the non-uniform characteristics of
OSM contributions, calling for analysis tools that are
capable of handling this dynamic nature.
Table 1: Top Tags for OSM objects in NYC.
Objects
Key
Most-common values
w/ tag
66%
building
garage, house, school
64%
height
8.2, 8.0
13%
highway residential
11%
name
(various)
2%
amenity
parking, bicycle parking
(a) Node
A drinking fountain as a
single pair of coordinates.
lat: 40.7303993,
lon: -73.9970100,
version: 1,
tags: {
amenity:
drinking_water,
name:
Washington
Square}
(b) Way: Path
A series of 41 nodes which
create this footway
id: 197582876,
changeset: 31859815,
uid: 1306,
version: 2,
timestamp: 2015-0610T03:06:09Z,
tags: {
highway: footway}
(c) Way: Building
Series of 4 nodes that
outline the arch
id: 248166269,
tags: {
building: yes,
height: 20.5,
name:
Washington
Square Arch
tourism:
attraction}
(d) Relation: Path
A collection of 3 ways
creating a footway
members: [
{way: archId},
{way: poolId},
{way: parkId}
]
tags: {
highway:
pedestrian}
Figure 2: OSM Objects as Rendered on openstreetmap.org. Each object shows various aspects of
possible metadata (truncated) associated with OSM objects. Data © OpenStreetMap contributors.
Map rendering software then uses these tags to
properly display an object. For example, a way
tagged with {“highway”:“pedestrian”} represents
a path, while a way tagged as {“building”:”yes”}
represents a building. Examples can be seen in Fig. 2.
The importance of tags in OSM analysis cannot
be overstated. However, given the open and dynamic
nature of tags and tagging practices as the map
evolves, an analysis tool must be robust to handle
filtering by tags. For example, it is common for
current OSM analyses to report summary statistics of
OSM data by reporting on the number of new nodes
added to the database. However, reporting that
956,725 nodes were added to the map in the month
after the 2010 Haiti earthquake reveals very little
about the manner in which the collaborative mapping
was achieved. Filtering and sorting intelligently with
tags instead can achieve results like this:
“308 users added 40,067 roads to the map and 162
users added 20,696 buildings to the map. 148 of these
users were the same, adding buildings and roads.”
Even this first-step expansion is a much richer
summary of user contributions. The requirement,
therefore, to develop a framework that is tag-aware is
critical in understanding the creation of the map. As a
result, epic-osm has advanced support for tags, and a
mechanism for incorporating knowledge about the
types of tags that the OSM community uses to create
its maps (see Section 3.5). It can use this mechanism
to find “all buildings” in a region even though
different users tag buildings in different ways.
2.2. Planet Files
OSM provides its data in a common XML
format via a RESTful API. Unfortunately for our
analysis, this data represents the current state of the
map, or the most recent version of the map objects,
which, as we discussed above, is not of primary
interest to those who study crisis mapping and the
creation of the map itself. More useful are the “fullhistory planet files” that OSM strives to make
available for download on a weekly basis. These files
are bulk exports of the complete OSM database
containing every edit to every object. Available in the
Google protocol buffer format (PBF), these files are
about 60gb in size, whereas the uncompressed history
database in the OSM XML format is over a terabyte
in size. While the PBF exports make obtaining the
full history easier, working with the files requires
specific knowledge of the file format and structure,
and is computationally intensive to manipulate. This
creates a requirement for an analysis framework: any
OSM analytical framework must be able to handle
the processing of full-history PBF files, which will
continually grow in size as the OSM community
continues to work.
3. epic-osm Framework
This
section
describes
the
current
implementation of the framework and its features.
epic-osm has supported crisis informatics research
throughout its development. This iterative, domaindriven approach to development has been shown to
be useful when creating data-intensive systems
\cite{Barrenechea2015a}. As we refined our OSM
research questions, the framework was adapted and
refactored to support the processing of those
questions. This agile development process has
enhanced the usability and capabilities of the
framework, thus supporting a main design goal which
was to encourage the adoption and use of the
framework among the many different communities
interested in better understanding OSM data and
mapping practices.
time buckets for sorting the data returned. All queries
return arrays of the form:
[{start_of_aw, bucket_end, results},
{bucket_start, bucket_end, results},
...,
{bucket_start, end_of_aw, results}]
The first bucket will always start at the
beginning of the analysis window and will end on the
first unit of analysis after that. For example, if the
unit were specified as “month” and the analysis
window started at 2014/06/15, then the first bucket
would include results from this date up to
2014/07/01. The second bucket would include all
data for the range 2014/07/01 to 2014/08/01. This
design decision ensures that the colloquial units of
analysis make sense. If a user is looking to perform
an analysis on months, then their results are returned
in time buckets of the common month, not a grouping
of 28 days starting from the beginning of the analysis
window. In the event no unit of analysis is specified,
then a query will return an array with one item:
[ {start_of_aw, end_of_aw, results} ]
3.1. Features
The central object in our software framework is
called an analysis window (aw). This is a spatiotemporal bounding box for a researcher’s given
geographic area and time frame of interest. All data
analyses operate within the scope of an analysis
window. An analysis window is thus defined by
specific start and end times and a set of polygonal
geographic bounding boxes; in addition, an analysis
window includes the queries to be performed on that
subset of the database and other metadata such as the
the contact person and associated data directories.
The framework does not limit the size or
timeframe of an analysis window. However, we
recommend working with a bounded analysis,
especially during initial research. Since OSM is home
to many different types of mappers with a great deal
of variance around mapping practices, careful
boundedness in space and time will yield results that
are easier to interpret; one can then build on those
results with progressively larger bounds, if desired.
3.2. Queries
Queries are associated with a specific analysis
window and a specific temporal unit of analysis.
Since every OSM object has a date and time
associated with its creation, all queries return data
sorted by these common features. A specific time unit
for analysis can currently be set to hour, day, month,
and year. These increments are then used to create
The framework is therefore designed to treat
time as the default structure for analysis. This design
decision supports the current practices in crisis
informatics research and other observers of time- and
safety-critical events. This makes our framework
unique in comparison to other OSM data services that
return the map data as it exists in real-time such as
the official OSM API. These services are designed to
deliver up-to-date geospatial data and map rendering,
while epic-osm is designed for analysis of user
contributions within a given period of time.
Furthermore, this ensures the results that are
returned by queries represent individual edits, not
necessarily distinct map objects. In other words, the
same map objects with different versions may appear
across multiple buckets of returned results. This
allows users to explore the creation of the map by
tracking changes to individual objects through time.
3.3. Conceptual Framework
In Figure 3, we show the semantic relationships
between the various data objects in our framework.
Note
User
Node
+
OSMObject
Way
*
+
Relation
Changeset
*
Figure 3: The domain objects of epic-osm.
The root class is OSMObject; it has attributes such as
geometry, date created, user id, object id, and version
number. Each OSMObject has an associated user
who edited that particular version of that particular
object. Nodes, ways, relations, and changesets are all
subclasses of OSMObject.
The UML diagram shows that ways consist of
one or more nodes and relations consist of some
number of nodes and ways. While in practice this is
true, our analysis framework performs extra work
during import to ensure that each of these objects
stands on its own. In particular, when importing a
way, we traverse all of its associated nodes and
embed the geographic information of those nodes in
the way itself. We do the same thing for a relation,
accessing all of its associated nodes and/or ways and
embedding these objects into the relation itself.
Therefore, when epic-osm performs a query on ways
or relations, the query only has to access way or
relation objects in epic-osm’s persistence layer.
The decision to perform this extra work during
import was twofold: a) improving run-time
performance and b) reducing complexity during
analysis. With respect to the former, we did not want
to incur a run-time penalty during an analysis
workflow spending time accessing a way or relation’s
constituent parts. With respect to the latter, users may
edit attributes of either the way or relation itself, or
the nodes and/or ways associated with it. In such
cases, the associated objects may not be aware of
these changes. To properly reconstruct the object
requires resolving the geometries based on dates and
changeset ids and “burning-in” the geometry as it
existed in that specific version of a way or relation.
We determined it was best to absorb this
computational cost just once during import. This type
of tradeoff is common in the design of big data
software frameworks.
Changesets contain information about the editing
session such as a geographical bounding box of the
extents of the user’s edits and the length of the
editing session. Changesets themselves are unaware
of the objects contained within the editing session,
but the edited objects contain the changeset id of the
changeset in which they were edited, allowing these
relationships to be established after the fact. Note:
although the semantics of our UML diagram allow
changesets to include other changesets, this does not
happen in practice: each changeset stands on its own
and does not reference other changesets. Finally, our
notes class contains attributes that allow OSM notes
to be retrieved from the database and analyzed.
Figure 4 presents the framework classes that are
used to perform an analysis at run-time. An instance
of EpicOSM acts as a controller for the analysis
QuestionAsker
EpicOSM
DatabaseConnection
AnalysisWindow
+
Query
NodeQuery
WayQuery
RelationQuery
NoteQuery
ChangesetQuery
UserQuery
Figure 4: The run-time objects of epic-osm.
session, creating the requested analysis window,
asking it to connect to the database, and invoking its
associated queries. The QuestionAsker acts as a
proxy for the user who invoked epic-osm, and can
influence where the results of the analysis are stored,
provide other metadata about the invoking user, or
further process the results of the invoked queries. The
classes in Figures 3 and 4 are connected because
query objects return instances of the domain objects.
Thus, node queries will return instances of nodes that
can then be further analyzed.
3.4. Current Technology Stack
In keeping with OSM’s mission of open
geospatial data, our framework is built on open
source technologies. The logic of the framework is
currently written in Ruby and is supported by a
variety of open source libraries, developed by the
greater OSM community and available on GitHub,
for processing and importing OSM planet files.
Given the importance of OSM object tags and
their key-value structure, we chose to use a NoSQL
document database, MongoDB, with inherent keyvalue support for persistence. Mongo stores each
domain-level OSM object in namesake collections
(i.e., nodes, ways, relations, etc.). Common fields
such as date created, user id, changeset id, and
geometry are indexed by MongoDB to speed up most
queries; specific tags such as “highway” or
“building” are indexed as well to support queries
against these objects of interest.
3.4. Flexible Query Language
To support the goal of extensibility, our
framework makes use of metaprogramming
techniques [14] to avoid binding clients of the
framework to a particular set of metrics and query
methods. Metaprogramming facilities have been a
part of programming languages for many years and
include techniques such as “monkey patching” in
Ruby, Python, and Javascript and key-value
observing in Objective C. In epic-osm, we make use
of a feature provided by the Ruby run-time system
known as “method missing.” This feature is invoked
whenever a client calls a method on an object that
does not have an implementation of that method
either within itself, its included modules, or its
superclasses. Though normally this situation would
generate an exception that can crash a running
program, Ruby’s runtime instead calls the object
again this time on a method called method_missing. It
passes to this method a description of the method the
client was trying to invoke. If that object has an
implementation of method_missing and it can handle
the processing of the failed call, the call will instead
succeed. If it cannot handle the invocation, then,
finally, an exception will be raised.
In epic-osm, almost all querying-related methods
are handled by method_missing. This convention
allows us to handle a wide range of possible queries
that can be expressed using a domain-specific
language that our method parses at run-time and
allows for new queries to be added in an incremental
fashion. For instance, a call to the method
nodes_x_year will be interpreted by an analysis
window as a request to return all edited nodes that
fall within its constraints, grouped by year. That same
functionality (retrieving all nodes) can be invoked but
have the data grouped in a different way by simply
calling the method with a different argument after the
‘x’, i.e. nodes_x_month or nodes_x_day.
Since the desired structure of the results is
defined by the name of the function, arguments
passed to the queries are for further filtering of the
results and are passed through epic-osm to MongoDB
unaltered. This allows users to take advantage of
MongoDB query capabilities in their own epic-osm
queries. For instance, the query: ways_x_month(
constraints: {“tags.highway” => “pedestrian”
}) will return every version of a way which
represents a pedestrian footpath which was edited or
created within the analysis window, grouped into
months. In this example, epic-osm handles grouping
the results of the query into months while MongoDB
finds all of the relevant ways while ensuring that all
returned ways have a tag called “highway” with the
value “pedestrian.” For improved performance, users
can externally index the underlying MongoDB
collections to support common queries.
3.5. Question Modules
As shown in Figure 4, query objects target a
specific type of domain object: Node queries return
nodes while note queries return notes. This modular
design allows analysts to focus their queries on just
the domain objects they need. However, many
questions require querying multiple types of objects.
epic-osm provides this type of query via the use of
Ruby’s support for modules.
A specific module is created that contains all of
the code that is needed to query across multiple types
of domain objects; this module exports a single
method that can then be invoked on an analysis
window to execute the query at run-time. As an
example, consider the need to ask an analysis
window about the number of schools that were edited
within its geo-temporal bounding box. For this
particular query, it is important to check both nodes
and ways to find all possible schools “hiding” in the
map. According to OSM’s community guidelines, the
best practice for marking a school on the map is with
the tag: {“amenity”: “school”}. However, the
actual OSM object that should contain this tag is not
strictly defined. Mappers are encouraged to use an
area (a polygon comprised of a closed way) that
outlines the school’s geographic footprint; however,
the Wiki also states that mappers can “place a node in
the middle of the site if [he or she is] in a hurry”
(wiki.openstreetmap.org/wiki/Tag:amenity=school).
As a result, the question of “how many schools
were mapped during the analysis window” becomes
far more complicated than a simple query for objects
with the school amenity tag. Instead, one must query
both the ways and the nodes collection, identify
distinct versions of interest and then resolve any
geographic overlap in which both a node and a way
mark the same school. To illustrate this, Table 2
shows the results of this query for the 2010 Haiti
Earthquake across different types of OSMObjects
and shows how the numbers change when accounting
for geographic overlaps:
Table 2: Differences in use of “school” tag.
Query: “amenity”: Nodes Ways Geo“school”
Unique
Added
145
41
166
Edited
32
27
57
Unique Sum
146
52
173
Ultimately, one may conclude that 173 schools
were edited in Haiti within OSM in the month
following the 2010 Haiti Earthquake.
As mentioned, these more complex queries are
isolated into Ruby modules—that epic-osm calls
question modules since they contain all the code
needed to ask a particular, complex question—that
are then accessed via a single method with all support
code cleanly hidden away from the main classes of
the framework.
If OSM community guidelines change for a
particular tag, just the code in the relevant module
has to change in response. If one analyst has a
broader (or more narrow) definition of what
constitutes a particular entity, they can create their
own module for finding instances of that entity.
These modules can then be easily shared and plugged
into any instance of the framework.
This is important because defining questions
such as “how many schools were edited” as shown
above are not immediately straightforward, so turning
that question into a single method within a reusable
module ensures that all users abide by the same rules
when querying the data.
This modular design has also affected the
development process by encouraging developers to
write many questions in separate modules and then
refactor common helper functions into the analysis
window to make them available to all other question
modules, thereby making the functionality provided
by the core objects more powerful over time.
4.1. Persistence Layer
As mentioned above, MongoDB is used to store
OSM history data and to perform the bulk of the
work with respect to the queries that users specify.
Storing the history data in this way allows users to
have the flexibility to easily track changes to their
queries over time. For example, a user may define an
analysis window for their hometown over the past
month. With each new month, they can create a new
analysis window with the same geographical bounds,
but with new start and end dates. As the user learns
more about their data through defining new
questions, persistence of previous analysis windows
allows them to rerun those questions without having
to re-import the underlying data. Furthermore, using
a database ensures that the size of objects referenced
by an analysis window can scale beyond the physical
memory constraints of a user’s machine. While
MongoDB was selected for its ease of use and
deployment, any key-value store or document store
could be used as the persistence layer for epic-osm.
4. Implementation
4.2. Output
Above, we presented the concepts and
capabilities contained in the epic-osm framework.
Here, we discuss how we have created a set of tools
that use the framework and some of their
implementation-related concerns. The advantage of
creating a framework that can be incorporated into a
wide range of tools is the large number of analysis
use cases that can then be supported. Our initial set of
tools handles the processing of a large amount of
OSM data via the use of batch processing. First,
command line tools are used to download and import
OSM history data into MongoDB. Second, an input
file is used to specify the parameters of a desired
analysis window along with the desired queries.
Third, a command line tool was created to read the
input file, create instances of the objects shown in
Figure 4, and kick off the processing of the specified
queries. The output of that process is a directory of
easily read JSON files. This straightforward set of
tools and components can be used to process
gigabytes of map data, ensuring scalability.
It is important to note that this same framework
can be incorporated into a web application and be
used to dynamically query MongoDB in response to
user commands; indeed, we plan to develop such
tools and, as we discuss later, we have already made
changes to the framework to allow for more real-time
processing of OSM data by analysis windows. Next,
we discuss a few additional implementation-related
concerns in more detail.
In an effort to support interoperability via many
types of analysis and by not forcing OSM researchers
to use a single tool, epic-osm writes output to a predefined file structure: a series of JSON files. These
files can then be easily parsed and visualized by a
variety of libraries and analysis tools, leaving the
visual inspection and analysis environment open to a
user’s preference. Currently, we build a static website
from these JSON files that can be used to view and
easily share the results of the analysis but many other
options for how to make use of these files from more
interactive web-based dashboards to network analysis
toolsets are being pursued, both by our group and the
OSM community. These multiple pursuits validate
our design decision to create a common output
directory of single JSON files.
5. Use of the Framework
At the time of this writing, our framework has
supported academic research by our group as well as
OSM community members. The initial release was in
support of our post hoc research on the growth of the
OSM organization between 2010 and 2014 in
response to two distinct humanitarian events [13].
This required the processing of a month’s worth of
historical OSM data for each event, consisting of
edits by nearly 500 users and 1500 users,
respectively. Since then, the framework has been
Figure 5: Count of OSM Changesets and Users.
Graph shows the by-hour contributions to the map
of Nepal after the April 25, 2015 earthquake.
available on GitHub and has been forked, contributed
to, and adapted to support real time analysis and
statistics of specific OSM mapping events.
For example, MapGive, a mapping initiative
sponsored by the U.S. State Department, used epicosm to visualize results of a competition between two
universities to see which could create more data
(mapgive.state.gov/events/mapoff). Additionally, it
was deployed to monitor the first-ever mapping event
at the White House (mapgive.state.gov/whmapathon).
Another project, moabi.org, is also running an
instance of the framework to monitor the mapping of
logging roads in the Congo (loggingroads.org). The
statistics are used to populate a “leaderboard”
showing the highest-contributing users.
5.1. Nepal Earthquake Deployment &
Improvements for Real Time Analysis
On April 25, 2015 a 7.8 magnitude earthquake
struck central Nepal, killing over 8,500 people and
destroying over 500,000 homes. Due to previous
OSM work in the country [17], the city of
Kathmandu was already mapped in detail. Yet many
of the affected rural areas outside of Kathmandu were
not well covered on the map. In what is believed to
be the largest convergence of OSM mapping activity
to date, over 7,000 contributors from all over the
world mapped roads, buildings, and other features.
Our team deployed an instance of epic-osm
immediately following the earthquake, which proved
to be a valuable test case. A real-time import module
developed by an epic-osm contributor that interfaces
with a newly available OSM changeset streaming
service (github.com/osmlab/osm-meta-util) supported
this instance. Figure 5 illustrates this impressive
convergence as tracked by epic-osm, showing the
number of users editing and the number of
changesets created per hour for the weeks following.
However, tracking this huge mapping activity in
real-time exposed a problem. Designed to be a static
snapshot in time that reads historical edits from a
database, the analysis window could mimic near-real
time results by running new queries every 10 minutes
with bounds that spanned the time from the event to
the current time. This solution worked well until the
second day when the database had grown so large
that the time it took to run the queries was longer
than 10 minutes, creating a backlog.
To resolve this problem, we added a new feature:
a rolling analysis window that would update the
analysis window’s constraints at each run to start at
the top of the hour and end at the current time, thus
never querying more than an hour’s worth of data.
These results were then output to separate directories,
which could be iterated over to create the new totals.
As a result, the framework was able to support a
website providing visualizations of edits over the past
hour. This site received over 1,700 unique visitors
from 79 countries in the first week and was the OSM
community’s primary tool during the response for
tracking its activity. This ad hoc solution worked in
this particular use-case, but more importantly,
exposed the weaknesses in the framework for similar
use cases, which have since generated great interest
in the OSM community.
6. Extensibility and Future Development
The desire to support both historical and realtime analysis of user contributions to OSM is strong
across both industry and academia. At a June 2015
OSM conference (The State of the Map US) held in
New York City, OSM users from the Red Cross, the
US State Department, and three digital cartographyoriented start-up companies held a Birds-of-a-Feather
discussion on the need for developing and supporting
analysis tools such as epic-osm.
6.1. Stream Processing
The real time tracking of mapping activity in
response to the Nepal earthquake identified a very
powerful use-case for epic-osm that will significantly
influence the next development iteration, specifically
the ability to process the edits to the map as an
incoming stream directly, instead of first importing to
a database and extracting distinct time chunks. We
will use contemporary big data solutions such as
Apache Spark and its streaming capabilities to
achieve better real time performance.
6.2. Database Improvements
With an emphasis on stream processing, the role
of the persistence layer will also change in the next
iteration. New user-level models will need to be
developed to track mapping behavior, while the
persistence of the individual object edits should also
be preserved for later analysis, should users desire to
perform new queries post-event. Alternative geospatial database technologies will be explored as
well, which may improve query performance for
geographic oriented analysis, such as “how many
kilometers of a road did a particular user map?”
the Case of OpenStreetMap. American Behavioral
Scientist. 2013. 54(5): 548-575.
[4] Chilton, S. Crowdsourcing is Radically Changing the
Geodata Landscape: Case Study of OpenStreetMap. In
Proc. of UK Cartographic Conference, 2009.
[5] Elwood, S. Volunteered Geographic Information:
Future Research Directions Motivated by Critical,
Participatory, and Feminist GIS. GeoJournal, 72(3&4):
173–183, 2008.
[6] Goodchild, M. Citizens as Sensors: The World of
Volunteered Geography. GeoJournal, 69(4): 211–221,
2007.
7. Conclusions
We have presented and discussed the design of
epic-osm, the first full software framework to support
the analysis of volunteered geographic information
contributed to OSM. The framework was initially
developed to support crisis informatics research
surrounding the production of map data in two major
crisis events, and has continued to grow and gain
exposure to a larger community of developers and
mappers alike, with hopes of allowing the entire
OSM community to better reflect on its production of
open geographical data. Our framework makes use of
a number of techniques to efficiently handle large
volumes of OSM data and serves as an example of
how to design frameworks for data-intensive
software systems. We believe that our framework,
our lessons learned from initial deployments, and our
iterative development approach, which is deeply
grounded in empirical knowledge of a target
domain—in this case, crisis mapping—will be of use
to other designers and researchers of data-intensive
software systems.
8. Acknowledgments
This material is based upon work sponsored by
NSF Grants IIS-0910586 and IIS-1524806. We thank
the OSM community for their involvement,
particularly Mikel Maron of MapGive, Humanitarian
OpenStreetMap Team and Kathmandu Living Labs.
9. References
[1] Anderson, K. Embrace the Challenges: Software
Engineering in a Big Data World. In Proc. 1st International
Workshop on Big Data Software Engineering, Part of 2015
Intl. Conf. on Software Engineering, pp. 19-25, IEEE.
[2] Barrenechea, M., Anderson, K., Palen, L., and White, J.
Engineering Crowdwork for Disaster Events: The HumanCentered Development of a Lost-and-Found Tasking
Environment. In Proc. 48 Hawaii International Conference
on System Sciences, pp. 182-191, IEEE 2015.
th
[7] Keegan, B. Breaking News on Wikipedia: Dynamics,
Structures, and Roles in High-Tempo Collaboration.
In Proc. of CSCW Companion, pp. 315-318, 2012.
[8] Maron, M. Haiti OpenStreetMap Response. Blog. Jan.
14, 2010. Retrieved May 20, 2015 from
http://brainoff.com/weblog/2010/01/14/1518
[9] Mooney, P., and Corcoran, P. Analysis of Interaction
and Co-editing Patterns amongst OpenStreetMap
Contributors. Transactions in GIS, 18(5): 633–659, 2014.
[10] Neis, P., & Zipf, A. Analyzing the Contributor
Activity of a Volunteered Geographic Information
Project—The Case of OpenStreetMap. ISPRS International
Journal of Geo-Information, 2012.
[11] OpenStreetMap Statistics. Accessed June 14, 2015.
http://www.openstreetmap.org/stats/data_stats.html.
[12] Palen, L. and Liu, S. Citizen Communications in
Crisis: Anticipating a Future of ICT-Supported
Participation, In Proc. of 2007 Conference on Human
Factors in Computing Systems, pp. 727-736.
[13] Palen, L., Soden, R., Anderson, J. and Barrenechea,
M. Success and Scale in a Data-Producing Organization:
The Socio-Technical Evolution of OpenStreetMap in
Response to Humanitarian Events In Proc. of 2015 Conf. of
Human Factors in Computing Systems, pp. 4113-4122.
[14] Perrotta, P. Metaprogramming Ruby 2. The Pragmatic
Programmers, LLC. 262 pages. 2014.
[15] Schmidt, M. and Klettner, S. Gender and Experiencerelated Motivators for Contributing to OpenStreetMap. In:
Mooney, P. and Rehrl, K. (eds). International Workshop on
Action and Interaction in Volunteered Geographic
Information, pp. 13–18, 2013.
[16] Schram, A. and Anderson, K. MySQL to NoSQL:
Data modeling challenges in supporting scalability. In Proc.
of 3 Conf. on Systems, Programming, Languages and
Applications: Software for Humanity, pp. 191–202, 2012.
rd
[17] Soden, R., Budhathoki, N., Palen, L. Resilience and
the Crisis Informatics Agenda: Lessons Learned from Open
Cities Kathmandu. In Proc. of Conf. on Information
Systems for Crisis Response and Management, 2014.
[18] Soden, R., & Palen, L. From Crowdsourced Mapping
to Community Mapping: The Post-Earthquake Work of
OpenStreetMap Haiti. In Proc of The 11 Intl. Conference
of the Design of Cooperative Systems, 2014.
th
[3] Budhathoki, N., and Haythornthwaite, C. Motivation for
Open Collaboration Crowd and Community Models and