The Architecture of Open Source Applications, Volume II
The Architecture of Open Source Applications, Volume II
The Architecture of Open Source Applications, Volume II
s
x
x
t
x
y
s
y
t
y
0 0 1
x
y
1
Two-dimensional coordinates can then easily be transformed by simply multiplying them by the
transformation matrix. Ane transformations also have the useful property that they can be composed
1
We could also go one step further and draw text using draw_path, removing the need for the draw_text method, but we
havent gotten around to making that simplication. Of course, a backend would still be free to implement its own draw_text
method to output real text.
John Hunter and Michael Droettboom 173
together using matrix multiplication. This means that to perform a series of ane transformations,
the transformation matrices can rst be multiplied together only once, and the resulting matrix can
be used to transform coordinates. matplotlibs transformation framework automatically composes
(freezes) ane transformation matrices together before transforming coordinates to reduce the
amount of computation. Having fast ane transformations is important, because it makes interactive
panning and zooming in a GUI window more ecient.
Figure 11.5: The same data plotted with three dierent non-ane transformations: logarithmic, polar and
Lambert
Non-ane transformations in matplotlib are dened using Python functions, so they are truly
arbitrary. Within the matplotlib core, non-ane transformations are used for logarithmic scaling,
polar plots and geographical projections (Figure 11.5). These non-ane transformations can be
freely mixed with ane ones in the transformation graph. matplotlib will automatically simplify the
ane portion and only fall back to the arbitrary functions for the non-ane portion.
From these simple pieces, matplotlib can do some pretty advanced things. A blended transforma-
tion is a special transformation node that uses one transformation for the x axis and another for the y
axis. This is of course only possible if the given transformations are separable, meaning the x and
y coordinates are independent, but the transformations themselves may be either ane or non-ane.
This is used, for example, to plot logarithmic plots where either or both of the x and y axes may have
a logarithmic scale. Having a blended transformation node allow the available scales to be combined
in arbitrary ways. Another thing the transform graph allows is the sharing of axes. It is possible to
link the limits of one plot to another and ensure that when one is panned or zoomed, the other is
updated to match. In this case, the same transform node is simply shared between two axes, which
may even be on two dierent gures. Figure 11.6 shows an example transformation graph with some
of these advanced features at work. axes1 has a logarithmic x axis; axes1 and axes2 share the same
y axis.
og x
axes1
gure near y
axes2
near x
bended1
bended2
data1
data2
dspay
Figure 11.6: An example transformation graph
174 matplotlib
11.5 The Polyline Pipeline
When plotting line plots, there are a number of steps that are performed to get from the raw data to
the line drawn on screen. In an earlier version of matplotlib, all of these steps were tangled together.
They have since been refactored so they are discrete steps in a path conversion pipeline. This
allows each backend to choose which parts of the pipeline to perform, since some are only useful in
certain contexts.
Figure 11.7: A close-up view of the eect of pixel snapping. On the left, without pixel snapping; on the right,
with pixel snapping.
Transformation: The coordinates are transformed from data coordinates to gure coordinates.
If this is a purely ane transformation, as described above, this is as simple as a matrix
multiplication. If this involves arbitrary transformations, transformation functions are called to
transform the coordinates into gure space.
Handle missing data: The data array may have portions where the data is missing or invalid.
The user may indicate this either by setting those values to NaN, or using numpy masked arrays.
Vector output formats, such as PDF, and rendering libraries, such as Agg, do not often have a
concept of missing data when plotting a polyline, so this step of the pipeline must skip over
the missing data segments using MOVETO commands, which tell the renderer to pick up the pen
and begin drawing again at a new point.
Clipping: Points outside of the boundaries of the gure can increase the le size by including
many invisible points. More importantly, very large or very small coordinate values can cause
overow errors in the rendering of the output le, which results in completely garbled output.
This step of the pipeline clips the polyline as it enters and exits the edges of the gure to prevent
both of these problems.
Snapping: Perfectly vertical and horizontal lines can look fuzzy due to antialiasing when
their centers are not aligned to the center of a pixel (see Figure 11.7). The snapping step of
the pipeline rst determines whether the entire polyline is made up of horizontal and vertical
segments (such as an axis-aligned rectangle), and if so, rounds each resulting vertex to the
nearest pixel center. This step is only used for raster backends, since vector backends should
continue to have exact data points. Some renderers of vector le formats, such as Adobe
Acrobat, perform pixel snapping when viewed on screen.
Simplication: When plotting really dense plots, many of the points on the line may not
actually be visible. This is particularly true of plots representing a noisy waveform. Including
these points in the plot increases le size, and may even hit limits on the number of points
allowed in the le format. Therefore, any points that lie exactly on the line between their two
neighboring points are removed (see Figure 11.8). The determination depends on a threshold
based on what would be visible at a given resolution specied by the user.
John Hunter and Michael Droettboom 175
Figure 11.8: The gure on the right is a close-up of the gure on the left. The circled vertex is automatically
removed by the path simplication algorithm, since it lies exactly on the line between its neighboring vertices,
and therefore is redundant.
11.6 Math Text
Since the users of matplotlib are often scientists, it is useful to put richly formatted math expressions
directly on the plot. Perhaps the most widely used syntax for math expressions is from Donald
Knuths TeX typesetting system. Its a way to turn input in a plain-text language like this:
\sqrt{\frac{\delta x}{\delta y}}
into a properly formatted math expression like this:
x
y
matplotlib provides two ways to render math expressions. The rst, usetex, uses a full copy of
TeX on the users machine to render the math expression. TeX outputs the location of the characters
and lines in the expression in its native DVI (device independent) format. matplotlib then parses the
DVI le and converts it to a set of drawing commands that one of its output backends then renders
directly onto the plot. This approach handles a great deal of obscure math syntax. However, it
requires that the user have a full and working installation of TeX. Therefore, matplotlib also includes
its own internal math rendering engine, called mathtext.
mathtext is a direct port of the TeX math-rendering engine, glued onto a much simpler parser
written using the pyparsing [?] parsing framework. This port was written based on the published
copy of the TeX source code [?]. The simple parser builds up a tree of boxes and glue (in TeX
nomenclature), that are then laid out by the layout engine. While the complete TeX math rendering
engine is included, the large set of third-party TeX and LaTeX math libraries is not. Features in such
libraries are ported on an as-needed basis, with an emphasis on frequently used and non-discipline-
specic features rst. This makes for a nice, lightweight way to render most math expressions.
11.7 Regression Testing
Historically, matplotlib has not had a large number of low-level unit tests. Occasionally, if a serious
bug was reported, a script to reproduce it would be added to a directory of such les in the source
176 matplotlib
tree. The lack of automated tests created all of the usual problems, most importantly regressions
in features that previously worked. (We probably dont need to sell you on the idea that automated
testing is a good thing.) Of course, with so much code and so many conguration options and
interchangeable pieces (e.g., the backends), it is arguable that low-level unit tests alone would ever
be enough; instead weve followed the belief that it is most cost-eective to test all of the pieces
working together in concert.
To this end, as a rst eort, a script was written that generated a number of plots exercising
various features of matplotlib, particularly those that were hard to get right. This made it a little
easier to detect when a new change caused inadvertent breakage, but the correctness of the images
still needed to be veried by hand. Since this required a lot of manual eort, it wasnt done very
often.
As a second pass, this general approach was automated. The current matplotlib testing script
generates a number of plots, but instead of requiring manual intervention, those plots are automatically
compared to baseline images. All of the tests are run inside of the nose testing framework, which
makes it very easy to generate a report of which tests failed.
Complicating matters is that the image comparison cannot be exact. Subtle changes in versions
of the Freetype font-rendering library can make the output of text slightly dierent across dierent
machines. These dierences are not enough to be considered wrong, but are enough to throw o
any exact bit-for-bit comparison. Instead, the testing framework computes the histogram of both
images, and calculates the root-mean-square of their dierence. If that dierence is greater than a
given threshold, the images are considered too dierent and the comparison test fails. When tests
fail, dierence images are generated which show where on the plot a change has occurred (see
Figure 11.9). The developer can then decide whether the failure is due to an intentional change and
update the baseline image to match the new image, or decide the image is in fact incorrect and track
down and x the bug that caused the change.
Figure 11.9: A regression test image comparison. From left to right: a) The expected image, b) the result of
broken legend placement, c) the dierence between the two images.
Since dierent backends can contribute dierent bugs, the testing framework tests multiple
backends for each plot: PNG, PDF and SVG. For the vector formats, we dont compare the vector
information directly, since there are multiple ways to represent something that has the same end
result when rasterized. The vector backends should be free to change the specics of their output to
increase eciency without causing all of the tests to fail. Therefore, for vector backends, the testing
framework rst renders the le to a raster using an external tool (Ghostscript for PDF and Inkscape
for SVG) and then uses those rasters for comparison.
Using this approach, we were able to bootstrap a reasonably eective testing framework from
scratch more easily than if we had gone on to write many low-level unit tests. Still, it is not perfect;
John Hunter and Michael Droettboom 177
the code coverage of the tests is not very complete, and it takes a long time to run all of the tests.
2
Therefore, some regressions do still fall through the cracks, but overall the quality of the releases has
improved considerably since the testing framework was implemented.
11.8 Lessons Learned
One of the important lessons from the development of matplotlib is, as Le Corbusier said, Good
architects borrow. The early authors of matplotlib were largely scientists, self-taught programmers
trying to get their work done, not formally trained computer scientists. Thus we did not get the
internal design right on the rst try. The decision to implement a user-facing scripting layer largely
compatible with the MATLAB API beneted the project in three signicant ways: it provided a
time-tested interface to create and customize graphics, it made for an easy transition to matplotlib
from the large base of MATLAB users, andmost importantly for us in the context of matplotlib
architectureit freed developers to refactor the internal object-oriented API several times with
minimal impact to most users because the scripting interface was unchanged. While we have had
API users (as opposed to scripting users) from the outset, most of them are power users or developers
able to adapt to API changes. The scripting users, on the other hand, can write code once and pretty
much assume it is stable for all subsequent releases.
For the internal drawing API, while we did borrow from GDK, we did not spend enough
eort determining whether this was the right drawing API, and had to expend considerable eort
subsequently after many backends were written around this API to extend the functionality around
a simpler and more exible drawing API. We would have been well-served by adopting the PDF
drawing specication [?], which itself was developed from decades of experience Adobe had with
its PostScript specication; it would have given us mostly out-of-the-box compatibility with PDF
itself, the Quartz Core Graphics framework, and the Enthought Enable Kiva drawing kit [?].
One of the curses of Python is that it is such an easy and expressive language that developers
often nd it easier to re-invent and re-implement functionality that exists in other packages than work
to integrate code from other packages. matplotlib could have beneted in early development from
expending more eort on integration with existing modules and APIs such as Enthoughts Kiva and
Enable toolkits which solve many similar problems, rather than reinventing functionality. Integration
with existing functionality is, however, a double-edged sword, as it can make builds and releases
more complex and reduce exibility in internal development.
2
Around 15 minutes on a 2.33 GHz Intel Core 2 E6550.
178 matplotlib
[chapter12]
MediaWiki
Sumana Harihareswara and Guillaume Paumier
From the start, MediaWiki was developed specically to be Wikipedias software. Developers
have worked to facilitate reuse by third-party users, but Wikipedias inuence and bias have shaped
MediaWikis architecture throughout its history.
Wikipedia is one of the top ten websites in the world, currently getting about 400 million unique
visitors a month. It gets over 100,000 hits per second. Wikipedia isnt commercially supported by
ads; it is entirely supported by a non-prot organization, the Wikimedia Foundation, which relies on
donations as its primary funding model. This means that MediaWiki must not only run a top-ten
website, but also do so on a shoestring budget. To meet these demands, MediaWiki has a heavy
bias towards performance, caching and optimization. Expensive features that cant be enabled on
Wikipedia are either reverted or disabled through a conguration variable; there is an endless balance
between performance and features.
The inuence of Wikipedia on MediaWikis architecture isnt limited to performance. Unlike
generic content management systems (CMSes), MediaWiki was originally written for a very specic
purpose: supporting a community that creates and curates freely reusable knowledge on an open
platform. This means, for example, that MediaWiki doesnt include regular features found in corporate
CMSes, like a publication workow or access control lists, but does oer a variety of tools to handle
spam and vandalism.
So, from the start, the needs and actions of a constantly evolving community of Wikipedia
participants have aected MediaWikis development, and vice versa. The architecture of MediaWiki
has been driven many times by initiatives started or requested by the community, such as the creation
of Wikimedia Commons, or the Flagged Revisions feature. Developers made major architectural
changes because the way that MediaWiki was used by Wikipedians made it necessary.
MediaWiki has also gained a solid external user base by being open source software from the
beginning. Third-party reusers know that, as long as such a high-prole website as Wikipedia uses
MediaWiki, the software will be maintained and improved. MediaWiki used to be really focused on
Wikimedia sites, but eorts have been made to make it more generic and better accommodate the
needs of these third-party users. For example, MediaWiki now ships with an excellent web-based
installer, making the installation process much less painful than when everything had to be done via
the command line and the software contained hardcoded paths for Wikipedia.
Still, MediaWiki is and remains Wikipedias software, and this shows throughout its history and
architecture.
This chapter is organized as follows:
Historical Overview gives a short overview of the history of MediaWiki, or rather its prehistory,
and the circumstances of its creation.
MediaWiki Code Base and Practices explains the choice of PHP, the importance and imple-
mentation of secure code, and how general conguration is handled.
Database and Text Storage dives into the distributed data storage system, and how its structure
evolved to accommodate growth.
Requests, Caching and Delivery follows the execution of a web request through the components
of MediaWiki it activates. This section includes a description of the dierent caching layers,
and the asset delivery system.
Languages details the pervasive internationalization and localization system, why it matters,
and how it is implemented.
Users presents how users are represented in the software, and how user permissions work.
Content details how content is structured, formatted and processed to generate the nal HTML.
A subsection focuses on how MediaWiki handles media les.
Customizing and Extending MediaWiki explains howJavaScript, CSS, extensions, and skins can
be used to customize a wiki, and how they modify its appearance and behavior. A subsection
presents the softwares machine-readable web API.
12.1 Historical Overview
Phase I: UseModWiki
Wikipedia was launched in January 2001. At the time, it was mostly an experiment to try to boost
the production of content for Nupedia, a free-content, but peer-reviewed, encyclopedia created by
Jimmy Wales. Because it was an experiment, Wikipedia was originally powered by UseModWiki,
an existing GPL wiki engine written in Perl, using CamelCase and storing all pages in individual
text les with no history of changes made.
It soon appeared that CamelCase wasnt really appropriate for naming encyclopedia articles. In
late January 2001, UseModWiki developer and Wikipedia participant Cliord Adams added a new
feature to UseModWiki: free links; i.e., the ability to link to pages with a special syntax (double
square brackets), instead of automatic CamelCase linking. A few weeks later, Wikipedia upgraded
to the new version of UseModWiki supporting free links, and enabled them.
While this initial phase isnt about MediaWiki per se, it provides some context and shows that,
even before MediaWiki was created, Wikipedia started to shape the features of the software that
powered it. UseModWiki also inuenced some of MediaWikis features; for example, its markup
language. The Nostalgia Wikipedia
1
contains a complete copy of the Wikipedia database from
December 2001, when Wikipedia still used UseModWiki.
Phase II: The PHP Script
In 2001, Wikipedia was not yet a top ten website; it was an obscure project sitting in a dark corner of
the Interwebs, unknown to most search engines, and hosted on a single server. Still, performance
was already an issue, notably because UseModWiki stored its content in a at le database. At the
1
http://nostalgia.wikipedia.org
180 MediaWiki
time, Wikipedians were worried about being inundated with trac following articles in the New
York Times, Slashdot and Wired.
So in summer 2001, Wikipedia participant Magnus Manske (then a university student) started
to work on a dedicated Wikipedia wiki engine in his free time. He aimed to improve Wikipedias
performance using a database-driven app, and to develop Wikipedia-specic features that couldnt
be provided by a generic wiki engine. Written in PHP and MySQL-backed, the new engine was
simply called the PHP script, PHP wiki, Wikipedia software or phase II.
The PHP script was made available in August 2001, shared on SourceForge in September, and
tested until late 2001. As Wikipedia suered fromrecurring performance issues because of increasing
trac, the English language Wikipedia eventually switched from UseModWiki to the PHP script in
January 2002. Other language versions also created in 2001 were slowly upgraded as well, although
some of them would remain powered by UseModWiki until 2004.
As PHP software using a MySQL database, the PHP script was the rst iteration of what would
later become MediaWiki. It introduced many critical features still in use today, like namespaces to
organize content (including talk pages), skins, and special pages (including maintenance reports, a
contributions list and a user watchlist).
Phase III: MediaWiki
Despite the improvements from the PHP script and database backend, the combination of increasing
trac, expensive features and limited hardware continued to cause performance issues on Wikipedia.
In 2002, Lee Daniel Crocker rewrote the code again, calling the new software Phase III
2
. Because
the site was experiencing frequent diculties, Lee thought there wasnt much time to sit down and
properly architect and develop a solution, so he just reorganized the existing architecture for better
performance and hacked all the code. Proling features were added to track down slow functions.
The Phase III software kept the same basic interface, and was designed to look and behave as
much like the Phase II software as possible. A few new features were also added, like a new le
upload system, side-by-side dis of content changes, and interwiki links.
Other features were added over 2002, like new maintenance special pages, and the edit on
double click option. Performance issues quickly reappeared, though. For example, in November
2002, administrators had to temporarily disable the view count and site statistics which were
causing two database writes on every page view. They would also occasionally switch the site to
read-only mode to maintain the service for readers, and disable expensive maintenance pages during
high-access times because of table locking problems.
In early 2003, developers discussed whether they should properly re-engineer and re-architect
the software from scratch, before the re-ghting became unmanageable, or continue to tweak and
improve the existing code base. They chose the latter solution, mostly because most developers
were suciently happy with the code base, and condent enough that further iterative improvements
would be enough to keep up with the growth of the site.
In June 2003, administrators added a second server, the rst database server separate from the web
server. (The new machine was also the web server for non-English Wikipedia sites.) Load-balancing
between the two servers would be set up later that year. Admins also enabled a new page-caching
system that used the le system to cache rendered, ready-to-output pages for anonymous users.
June 2003 is also when Jimmy Wales created the non-prot Wikimedia Foundation to support
Wikipedia and manage its infrastructure and day-to-day operations. The Wikipedia software was
2
http://article.gmane.org/gmane.science.linguistics.wikipedia.technical/2794
Sumana Harihareswara and Guillaume Paumier 181
ocially named MediaWiki in July, as wordplay on the Wikimedia Foundations name. What was
thought at the time to be a clever pun would confuse generations of users and developers.
New features were added in July, like the automatically generated table of contents and the ability
to edit page sections, both still in use today. The rst release under the name MediaWiki happened
in August 2003, concluding the long genesis of an application whose overall structure would remain
fairly stable from there on.
12.2 MediaWiki Code Base and Practices
PHP
PHP was chosen as the framework for Wikipedias Phase II software in 2001; MediaWiki has
grown organically since then, and is still evolving. Most MediaWiki developers are volunteers
contributing in their free time, and there were very few of them in the early years. Some software
design decisions or omissions may seem wrong in retrospect, but its hard to criticize the founders
for not implementing some abstraction which is now found to be critical, when the initial code base
was so small, and the time taken to develop it so short.
For example, MediaWiki uses unprexed class names, which can cause conicts when PHP core
and PECL (PHP Extension Community Library) developers add new classes: MediaWiki Namespace
class had to be renamed to MWNamespace to be compatible with PHP 5.3. Consistently using a prex
for all classes (e.g., MW) would have made it easier to embed MediaWiki inside another application
or library.
Relying on PHP was probably not the best choice for performance, since it has not benetted
from improvements that some other dynamic languages have seen. Using Java would have been
much better for performance, and simplied execution scaling for back-end maintenance tasks. On
the other hand, PHP is very popular, which facilitates recruiting new developers.
Even if MediaWiki still contains ugly legacy code, major improvements have been made over
the years, and new architectural elements have been introduced to MediaWiki throughout its history.
They include the Parser, SpecialPage, and Database classes, the Image class and the FileRepo
class hierarchy, ResourceLoader, and the Action hierarchy. MediaWiki started without any of these
things, but all of them support features that have been around since the beginning. Many developers
are interested primarily in feature development and architecture is often left behind, only to catch up
later as the cost of working within an inadequate architecture becomes apparent.
Security
Because MediaWiki is the platform for high-prole sites such as Wikipedia, core developers and code
reviewers have enforced strict security rules
3
. To make it easier to write secure code, MediaWiki
gives developers wrappers around HTML output and database queries to handle escaping. To sanitize
user input, a develop uses the WebRequest class, which analyzes data passed in the URL or via a
POSTed form. It removes magic quotes and slashes, strips illegal input characters and normalizes
Unicode sequences. Cross-site request forgery (CSRF) is avoided by using tokens, and cross-site
scripting (XSS) by validating inputs and escaping outputs, usually with PHPs htmlspecialchars()
function. MediaWiki also provides (and uses) an XHTML sanitizer with the Sanitizer class, and
database functions that prevent SQL injection.
3
See https://www.mediawiki.org/wiki/Security_for_developers for a detailed guide.
182 MediaWiki
Conguration
MediaWiki oers hundreds of conguration settings, stored in global PHP variables. Their default
value is set in DefaultSettings.php, and the system administrator can override them by editing
LocalSettings.php.
MediaWiki used to over-depend on global variables, including for conguration and context
processing. Globals cause serious security implications with PHPs register_globals function
(which MediaWiki hasnt needed since version 1.2). This system also limits potential abstractions for
conguration, and makes it more dicult to optimize the start-up process. Moreover, the conguration
namespace is shared with variables used for registration and object context, leading to potential
conicts. From a user perspective, global conguration variables have also made MediaWiki seem
dicult to congure and maintain. MediaWiki development has been a story of slowly moving
context out of global variables and into objects. Storing processing context in object member
variables allows those objects to be reused in a much more exible way.
12.3 Database and Text Storage
MediaWiki has been using a relational database backend since the Phase II software. The default
(and best-supported) database management system (DBMS) for MediaWiki is MySQL, which is
the one that all Wikimedia sites use, but other DBMSes (such as PostgreSQL, Oracle, and SQLite)
have community-supported implementations. A sysadmin can choose a DBMS while installing
MediaWiki, and MediaWiki provides both a database abstraction and a query abstraction layer that
simplify database access for developers.
The current layout contains dozens of tables. Many are about the wikis content (e.g., page,
revision, category, and recentchanges). Other tables include data about users (user,
user_groups), media les (image, filearchive), caching (objectcache, l1n_cache,
querycache) and internal tools (job for the job queue), among others
4
. (See Figure 12.1.) In-
dices and summary tables are used extensively in MediaWiki, since SQL queries that scan huge
numbers of rows can be very expensive, particularly on Wikimedia sites. Unindexed queries are
usually discouraged.
The database went through dozens of schema changes over the years, the most notable being the
decoupling of text storage and revision tracking in MediaWiki 1.5.
In the 1.4 model, the content was stored in two important tables, cur (containing the text and
metadata of the current revision of the page) and old (containing previous revisions); deleted pages
were kept in archive. When an edit was made, the previously current revision was copied to the
old table, and the new edit was saved to cur. When a page was renamed, the page title had to be
updated in the metadata of all the old revisions, which could be a long operation. When a page was
deleted, its entries in both the cur and old tables had to be copied to the archive table before being
deleted; this meant moving the text of all revisions, which could be very large and thus take time.
In the 1.5 model, revision metadata and revision text were split: the cur and old tables were
replaced with page (pages metadata), revision (metadata for all revisions, old or current) and
text (text of all revisions, old, current or deleted). Now, when an edit is made, revision metadata
dont need to be copied around tables: inserting a new entry and updating the page_latest pointer
is enough. Also, the revision metadata dont include the page title anymore, only its ID: this removes
the need for renaming all revisions when a page is renamed
4
Complete documentation of the database layout in MediaWiki is available at https://www.mediawiki.org/wiki/Manual:
Database_layout.
Sumana Harihareswara and Guillaume Paumier 183
Figure 12.1: Main content tables in MediaWiki 1.4 and 1.5
The revision table stores metadata for each revision, but not their text; instead, they contain a
text ID pointing to the text table, which contains the actual text. When a page is deleted, the text of
all revisions of the page stays there and doesnt need to be moved to another table. The text table
is composed of a mapping of IDs to text blobs; a flags eld indicates if the text blob is gzipped
(for space savings) or if the text blob is only a pointer to external text storage. Wikimedia sites use a
MySQL-backed external storage cluster with blobs of a few dozen revisions. The rst revision of
the blob is stored in full, and following revisions to the same page are stored as dis relative to the
previous revision; the blobs are then gzipped. Because the revisions are grouped per page, they tend
to be similar, so the dis are relatively small and gzip works well. The compression ratio achieved
on Wikimedia sites nears 98%.
On the hardware side, MediaWiki has built-in support for load balancing, added as early as 2004
in MediaWiki 1.2 (when Wikipedia got its second servera big deal at the time). The load balancer
(MediaWikis PHP code that decides which server to connect to) is now a critical part of Wikimedias
infrastructure, which explains its inuence on some algorithm decisions in the code. The system
administrator can specify, in MediaWikis conguration, that there is one master database server and
any number of slave database servers; a weight can be assigned to each server. The load balancer
will send all writes to the master, and will balance reads according to the weights. It also keeps track
of the replication lag of each slave. If a slaves replication lag exceeds 30 seconds, it will not receive
any read queries to allow it to catch up; if all slaves are lagged more than 30 seconds, MediaWiki
will automatically put itself in read-only mode.
MediaWikis chronology protector ensures that replication lag never causes a user to see a
page that claims an action theyve just performed hasnt happened yet: for instance, if a user renames
a page, another user may still see the old name, but the one who renamed will always see the new
184 MediaWiki
name, because hes the one who renamed it. This is done by storing the masters position in the
users session if a request they made resulted in a write query. The next time the user makes a read
request, the load balancer reads this position from the session, and tries to select a slave that has
caught up to that replication position to serve the request. If none is available, it will wait until one
is. It may appear to other users as though the action hasnt happened yet, but the chronology remains
consistent for each user.
12.4 Requests, Caching and Delivery
Execution Workow of a Web Request
index.php is the main entry point for MediaWiki, and handles most requests processed by the
application servers (i.e., requests that were not served by the caching infrastructure; see below).
The code executed from index.php performs security checks, loads default conguration settings
from includes/DefaultSettings.php, guesses conguration with includes/Setup.php and
then applies site settings contained in LocalSettings.php. Next it instantiates a MediaWiki object
($mediawiki), and creates a Title object ($wgTitle) depending on the title and action parameters
from the request.
index.php can take a variety of action parameters in the URL request; the default action is
view, which shows the regular view of an articles content. For example, the request https://en.
wikipedia.org/w/index.php?title=Apple\&action=view displays the content of the article
Apple on the English Wikipedia
5
. Other frequent actions include edit (to open an article for
editing), submit (to preview or save an article), history (to show an articles history) and watch
(to add an article to the users watchlist). Administrative actions include delete (to delete an article)
and protect (to prevent edits to an article).
MediaWiki::performRequest() is then called to handle most of the URL request. It checks
for bad titles, read restrictions, local interwiki redirects, and redirect loops, and determines whether
the request is for a normal or a special page.
Normal page requests are handed over to MediaWiki::initializeArticle(), to create an
Article object for the page ($wgArticle), and then to MediaWiki::performAction(), which
handles standard actions. Once the action has been completed, MediaWiki::finalCleanup()
nalizes the request by committing database transactions, outputting the HTML and launching
deferred updates through the job queue. MediaWiki::restInPeace() commits the deferred updates
and closes the task gracefully.
If the page requested is a Special page (i.e., not a regular wiki content page, but a special
software-related page such as Statistics), SpecialPageFactory::executePath is called instead
of initializeArticle(); the corresponding PHP script is then called. Special pages can do all
sorts of magical things, and each has a specic purpose, usually independent of any one article or its
content. Special pages include various kinds of reports (recent changes, logs, uncategorized pages)
and wiki administration tools (user blocks, user rights changes), among others. Their execution
workow depends on their function.
Many functions contain proling code, which makes it possible to follow the execution work-
ow for debugging if proling is enabled. Proling is done by calling the wfProfileIn and
wfProfileOut functions to respectively start and stop proling a function; both functions take
the functions name as a parameter. On Wikimedia sites, proling is done for a percentage of all
5
View requests are usually prettied with URL rewriting, in this example to https://en.wikipedia.org/wiki/Apple.
Sumana Harihareswara and Guillaume Paumier 185
requests, to preserve performance. MediaWiki sends UDP packets to a central server that collects
them and produces proling data.
Caching
MediaWiki itself is improved for performance because it plays a central role on Wikimedia sites, but it
is also part of a larger operational ecosystem that has inuenced its architecture. Wikimedias caching
infrastructure (structured in layers) has imposed limitations in MediaWiki; developers worked around
the issues, not by trying to shape Wikimedias extensively optimized caching infrastructure around
MediaWiki, but rather by making MediaWiki more exible, so it could work within that infrastructure
without compromising on performance and caching needs. For example, by default MediaWiki
displays the users IP in the top-right corner of the interface (for left-to-right languages) as a reminder
that thats how theyre known to the software when theyre not logged in. The $wgShowIPinHeader
conguration variable allows the system administrator to disable this feature, thus making the page
content independent of the user: all anonymous visitors can then be served the exact same version of
each page.
The rst level of caching (used on Wikimedia sites) consists of reverse caching proxies (Squids)
that intercept and serve most requests before they make it to the MediaWiki application servers.
Squids contain static versions of entire rendered pages, served for simple reads to users who arent
logged in to the site. MediaWiki natively supports Squid and Varnish, and integrates with this caching
layer by, for example, notifying them to purge a page from the cache when it has been changed. For
logged-in users, and other requests that cant be served by Squids, Squid forwards the requests to the
web server (Apache).
The second level of caching happens when MediaWiki renders and assembles the page from
multiple objects, many of which can be cached to minimize future calls. Such objects include
the pages interface (sidebar, menus, UI text) and the content proper, parsed from wikitext. The
in-memory object cache has been available in MediaWiki since the early 1.1 version (2003), and is
particularly important to avoid re-parsing long and complex pages.
Login session data can also be stored in memcached, which lets sessions work transparently on
multiple front-end web servers in a load-balancing setup (Wikimedia heavily relies on load balancing,
using LVS with PyBal).
Since version 1.16, MediaWiki uses a dedicated object cache for localized UI text; this was
added after noticing that a large part of the objects cached in memcached consisted of UI messages
localized into the users language. The system is based on fast fetches of individual messages from
constant databases (CDB), e.g., les with key-value pairs. CDBs minimize memory overhead and
start-up time in the typical case; theyre also used for the interwiki cache.
The last caching layer consists of the PHP opcode cache, commonly enabled to speed up PHP
applications. Compilation can be a lengthy process; to avoid compiling PHP scripts into opcode
every time theyre invoked, a PHP accelerator can be used to store the compiled opcode and execute
it directly without compilation. MediaWiki will just work with many accelerators such as APC,
PHP accelerator and eAccelerator.
Because of its Wikimedia bias, MediaWiki is optimized for this complete, multi-layer, distributed
caching infrastructure. Nonetheless, it also natively supports alternate setups for smaller sites. For
example, it oers an optional simplistic le caching system that stores the output of fully rendered
pages, like Squid does. Also, MediaWikis abstract object caching layer lets it store the cached
objects in several places, including the le system, the database, or the opcode cache.
186 MediaWiki
ResourceLoader
As in many web applications, MediaWikis interface has become more interactive and responsive
over the years, mostly through the use of JavaScript. Usability eorts initiated in 2008, as well
as advanced media handling (e.g., online editing of video les), called for dedicated front-end
performance improvements.
To optimize the delivery of JavaScript and CSS assets, the ResourceLoader module was developed
to optimize delivery of JS and CSS. Started in 2009, it was completed in 2011 and has been a core
feature of MediaWiki since version 1.17. ResourceLoader works by loading JS and CSS assets on
demand, thus reducing loading and parsing time when features are unused, for example by older
browsers. It also minies the code, groups resources to save requests, and can embed images as data
URIs
6
.
12.5 Languages
Context and Rationale
A central part of eectively contributing and disseminating free knowledge to all is to provide it in as
many languages as possible. Wikipedia is available in more than 280 languages, and encyclopedia
articles in English represent less than 20% of all articles. Because Wikipedia and its sister sites exist
in so many languages, it is important not only to provide the content in the readers native language,
but also to provide a localized interface, and eective input and conversion tools, so that participants
can contribute content.
For this reason, localization and internationalization (l10n and i18n) are central components of
MediaWiki. The i18n system is pervasive, and impacts many parts of the software; its also one of the
most exible and feature-rich
7
. Translator convenience is usually preferred to developer convenience,
but this is believed to be an acceptable cost.
MediaWiki is currently localized in more than 350 languages, including non-Latin and right-
to-left (RTL) languages, with varying levels of completion. The interface and content can be in
dierent languages, and have mixed directionality.
Content Language
MediaWiki originally used per-language encoding, which led to a lot of issues; for example, foreign
scripts could not be used in page titles. UTF-8 was adopted instead. Support for character sets other
than UTF-8 was dropped in 2005, along with the major database schema change in MediaWiki 1.5;
content must now be encoded in UTF-8.
Characters not available on the editors keyboard can be customized and inserted via Medi-
aWikis Edittools, an interface message that appears below the edit window; its JavaScript version
automatically inserts the character clicked into the edit window. The WikiEditor extension for
MediaWiki, developed as part of a usability eort, merges special characters with the edit toolbar.
Another extension, called Narayam, provides additional input methods and key mapping features for
non-ASCII characters.
6
For more on ResourceLoader, see https://www.mediawiki.org/wiki/ResourceLoader for the ocial documentation,
and the talk Low Hanging Fruit vs. Micro-optimization: Creative Techniques for Loading Web Pages Faster given by Trevor
Parscal and Roan Kattouw at OSCON 2011.
7
For an exhaustive guide to internationalization and localization in MediaWiki, see https://www.mediawiki.org/wiki/
Localisation.
Sumana Harihareswara and Guillaume Paumier 187
Interface Language
Interface messages have been stored in PHP arrays of key-values pairs since the Phase III software
was created. Each message is identied by a unique key, which is assigned dierent values across
languages. Keys are determined by developers, who are encouraged to use prexes for extensions;
for example, message keys for the UploadWizard extension will start with mwe-upwiz-, where mwe
stands for MediaWiki extension.
MediaWiki messages can embed parameters provided by the software, which will often inuence
the grammar of the message. In order to support virtually any possible language, MediaWikis
localization system has been improved and complexied over time to accommodate languages
specic traits and exceptions, often considered oddities by English speakers.
For example, adjectives are invariable words in English, but languages like French require adjec-
tive agreement with nouns. If the user specied their gender in their preferences, the GENDER: switch
can be used in interface messages to appropriately address them. Other switches include PLURAL:,
for simple plurals and languages like Arabic with dual, trial or paucal numbers, and GRAMMAR:,
providing grammatical transformation functions for languages like Finnish whose grammatical cases
cause alterations or inections.
Localizing Messages
Localized interface messages for MediaWiki reside in MessagesXx.php les, where Xx is the ISO-
639 code of the language (e.g. MessagesFr.php for French); default messages are in English
and stored in MessagesEn.php. MediaWiki extensions use a similar system, or host all localized
messages in an <Extension-name>.i18n.php le. Along with translations, Message les also
include language-dependent information such as date formats.
Contributing translations used to be done by submitting PHP patches for the MessagesXx.php
les. In December 2003, MediaWiki 1.1 introduced database messages, a subset of wiki pages in the
MediaWiki namespace containing interface messages. The content of the wiki page
MediaWiki:<Message-key> is the messages text, and overrides its value in the PHP le. Local-
ized versions of the message are at MediaWiki:<Message-key>/<language-code>; for example,
MediaWiki:Rollbacklink/de.
This feature has allowed power users to translate (and customize) interface messages locally
on their wiki, but the process doesnt update i18n les shipping with MediaWiki. In 2006, Niklas
Laxstrm created a special, heavily hacked MediaWiki website (now hosted at
http://translatewiki.net) where translators can easily localize interface messages in all lan-
guages simply by editing a wiki page. The MessagesXx.php les are then updated in the MediaWiki
code repository, where they can be automatically fetched by any wiki, and updated using the Localisa-
tionUpdate extension. On Wikimedia sites, database messages are now only used for customization,
and not for localization any more. MediaWiki extensions and some related programs, such as bots,
are also localized at translatewiki.net.
To help translators understand the context and meaning of an interface message, it is consid-
ered a good practice in MediaWiki to provide documentation for every message. This documen-
tation is stored in a special Message le, with the qqq language code which doesnt correspond
to a real language. The documentation for each message is then displayed in the translation in-
terface on translatewiki.net. Another helpful tool is the qqx language code; when used with the
&uselang parameter to display a wiki page (e.g., https://en.wikipedia.org/wiki/Special:
188 MediaWiki
RecentChanges?uselang=qqx), MediaWiki will display the message keys instead of their values in
the user interface; this is very useful to identify which message to translate or change.
Registered users can set their own interface language in their preferences, to override the sites
default interface language. MediaWiki also supports fallback languages: if a message isnt available
in the chosen language, it will be displayed in the closest possible language, and not necessarily in
English. For example, the fallback language for Breton is French.
12.6 Users
Users are represented in the code using instances of the User class, which encapsulates all of the user-
specic settings (user id, name, rights, password, email address, etc.). Client classes use accessors to
access these elds; they do all the work of determining whether the user is logged in, and whether
the requested option can be satised from cookies or whether a database query is needed. Most of
the settings needed for rendering normal pages are set in the cookie to minimize use of the database.
MediaWiki provides a very granular permissions system, with a user permission for, basically, ev-
ery possible action. For example, to perform the Rollback action (i.e., to quickly rollback the edits
of the last user who edited a particular page), a user needs the rollback permission, included by
default in MediaWikis sysop user group. But it can also be added to other user groups, or have a ded-
icated user group only providing this permission (this is the case on the English Wikipedia, with the
Rollbackers group). Customization of user rights is done by editing the $wgGroupPermissions ar-
ray in LocalSettings.php; for instance, $wgGroupPermissions[user][movefile] = true;
allows all registered users to rename les. A user can belong to several groups, and inherits the
highest rights associated with each of them.
However, MediaWikis user permissions system was really designed with Wikipedia in mind: a
site whose content is accessible to all, and where only certain actions are restricted to some users.
MediaWiki lacks a unied, pervasive permissions concept; it doesnt provide traditional CMS features
like restricting read or write access by topic or type of content. A few MediaWiki extensions provide
such features to some extent.
12.7 Content
Content Structure
The concept of namespaces was used in the UseModWiki era of Wikipedia, where talk pages were
at the title <article name>/Talk. Namespaces were formally introduced in Magnus Manskes rst
PHP script. They were reimplemented a few times over the years, but have kept the same function:
to separate dierent kinds of content. They consist of a prex separated from the page title by a colon
(e.g. Talk: or File: and Template:); the main content namespace has no prex. Wikipedia users
quickly adopted them, and they provided the community with dierent spaces to evolve. Namespaces
have proven to be an important feature of MediaWiki, as they create the necessary preconditions for
a wikis community and set up meta-level discussions, community processes, portals, user proles,
etc.
The default conguration for MediaWikis main content namespace is to be at (no subpages),
because its how Wikipedia works, but it is trivial to enable subpages. They are enabled in other
namespaces (e.g., User:, where people can, for instance, work on draft articles) and display bread-
crumbs.
Sumana Harihareswara and Guillaume Paumier 189
Namespaces separate content by type; within the same namespace, pages can be organized by
topic using categories, a pseudo-hierarchical organization scheme introduced in MediaWiki 1.3.
Content Processing: MediaWiki Markup Language and Parser
The user-generated content stored by MediaWiki isnt in HTML, but in a markup language specic
to MediaWiki, sometimes called wikitext. It allows users to make formatting changes (e.g. bold,
italic using quotes), add links (using square brackets), include templates, insert context-dependent
content (like a date or signature), and make an incredible number of other magical things happen
8
.
To display a page, this content needs to be parsed, assembled from all the external or dynamic
pieces it calls, and converted to proper HTML. The parser is one of the most essential parts of
MediaWiki, which makes it dicult to change or improve. Because hundreds of millions of wiki
pages worldwide depend on the parser to continue outputting HTML the way it always has, it has to
remain extremely stable.
The markup language wasnt formally specced from the beginning; it started based on Use-
ModWikis markup, then morphed and evolved as needs demanded. In the absence of a formal
specication, the MediaWiki markup language has become a complex and idiosyncratic language,
basically only compatible with MediaWikis parser; it cant be represented as a formal grammar. The
current parsers specication is jokingly referred to as whatever the parser spits out from wikitext,
plus a few hundred test cases.
There have been many attempts at alternative parsers, but none has succeeded so far. In 2004 an
experimental tokenizer was written by Jens Frank to parse wikitext, and enabled on Wikipedia; it had
to be disabled three days later because of the poor performance of PHP array memory allocations.
Since then, most of the parsing has been done with a huge pile of regular expressions, and a ton of
helper functions. The wiki markup, and all the special cases the parser needs to support, have also
become considerably more complex, making future attempts even more dicult.
A notable improvement was Tim Starlings preprocessor rewrite in MediaWiki 1.12, whose
main motivation was to improve the parsing performance on pages with complex templates. The
preprocessor converts wikitext to an XML DOM tree representing parts of the document (template
invocations, parser functions, tag hooks, section headings, and a few other structures), but can skip
dead branches, such as unfollowed #switch cases and unused defaults for template arguments, in
template expansion. The parser then iterates through the DOM structure and converts its content to
HTML.
Recent work on a visual editor for MediaWiki has made it necessary to improve the parsing
process (and make it faster), so work has resumed on the parser and intermediate layers between
MediaWiki markup and nal HTML (see Future, below).
Magic Words and Templates
MediaWiki oers magic words that modify the general behavior of the page or include dynamic
content into it. They consist of: behavior switches like __NOTOC__ (to hide the automatic table of
content) or __NOINDEX__ (to tell search engines not to index the page); variables like CURRENTTIME
or SITENAME; and parser functions, i.e., magic words that can take parameters, like lc:<string> (to
output <string> in lowercase). Constructs like GENDER:, PLURAL: and GRAMMAR:, used to localize
the UI, are parser functions.
8
Detailed documentation is available at https://www.mediawiki.org/wiki/Markup_spec and the associated pages.
190 MediaWiki
The most common way to include content from other pages in a MediaWiki page is to use
templates. Templates were really intended to be used to include the same content on dierent pages,
e.g., navigation panels or maintenance banners on Wikipedia articles; having the ability to create
partial page layouts and reuse them in thousands of articles with central maintenance made a huge
impact on sites like Wikipedia.
However, templates have also been used (and abused) by users for a completely dierent purpose.
MediaWiki 1.3 made it possible for templates to take parameters that change their output; the ability
to add a default parameter (introduced in MediaWiki 1.6) enabled the construction of a functional
programming language implemented on top of PHP, which was ultimately one of the most costly
features in terms of performance.
Tim Starling then developed additional parser functions (the ParserFunctions extension), as a
stopgap measure against insane constructs created by Wikipedia users with templates. This set
of functions included logical structures like #if and #switch, and other functions like #expr (to
evaluate mathematical expressions) and #time (for time formatting).
Soon enough, Wikipedia users started to create even more complex templates using the new
functions, which considerably degraded the parsing performance on template-heavy pages. The
new preprocessor introduced in MediaWiki 1.12 (a major architectural change) was implemented to
partly remedy this issue. Recently, MediaWiki developers have discussed the possibility of using an
actual scripting language, perhaps Lua, to improve performance.
Media Files
Users upload les through the Special:Upload page; administrators can congure the allowed le
types through an extension whitelist. Once uploaded, les are stored in a folder on the le system,
and thumbnails in a dedicated thumb directory.
Because of Wikimedias educational mission, MediaWiki supports le types that may be uncom-
mon in other web applications or CMSes, like SVG vector images, and multipage PDFs and DjVus.
They are rendered as PNG les, and can be thumbnailed and displayed inline, as are more common
image les like GIFs, JPGs and PNGs.
When a le is uploaded, it is assigned a File: page containing information entered by the
uploader; this is free text and usually includes copyright information (author, license) and items
describing or classifying the content of the le (description, location, date, categories, etc.). While
private wikis may not care much about this information, on media libraries like Wikimedia Commons
it are critical to organise the collection and ensure the legality of sharing these les. It has been
argued that most of these metadata should, in fact, be stored in a queryable structure like a database
table. This would considerably facilitate search, but also attribution and reuse by third partiesfor
example, through the API.
Most Wikimedia sites also allow local uploads to each wiki, but the community tries to store
freely licensed media les in Wikimedias free media library, Wikimedia Commons. Any Wikimedia
site can display a le hosted on Commons as if it were hosted locally. This custom avoids having to
upload a le to every wiki to use it there.
As a consequence, MediaWiki natively supports foreign media repositories, i.e., the ability to
access media les hosted on another wiki through its API and the ForeignAPIRepo system. Since
version 1.16, any MediaWiki website can easily use les from Wikimedia Commons through the
InstantCommons feature. When using a foreign repository, thumbnails are stored locally to save
bandwidth. However, it is not (yet) possible to upload to a foreign media repository from another
wiki.
Sumana Harihareswara and Guillaume Paumier 191
12.8 Customizing and Extending MediaWiki
Levels
MediaWikis architecture provides dierent ways to customize and extend the software. This can be
done at dierent levels of access:
System administrators can install extensions and skins, and congure the wikis separate
helper programs (e.g., for image thumbnailing and TeX rendering) and global settings (see
Conguration above).
Wiki sysops (sometimes called administrators too) can edit site-wide gadgets, JavaScript
and CSS settings.
Any registered user can customize their own experience and interface using their preferences
(for existing settings, skins and gadgets) or make their own modications (using their personal
JS and CSS pages).
External programs can also communicate with MediaWiki through its machine API, if its
enabled, basically making any feature and data accessible to the user.
JavaScript and CSS
MediaWiki can read and apply site-wide or skin-wide JavaScript and CSS using custom wiki
pages; these pages are in the MediaWiki: namespace, and thus can only be edited by sysops;
for example, JavaScript modications from MediaWiki:Common.js apply to all skins, CSS from
MediaWiki:Common.css applies to all skins, but MediaWiki:Vector.css only applies to users with
the Vector skin.
Users can do the same types of changes, which will only apply to their own interface, by
editing subpages of their user page (e.g. User:<Username>/common.js for JavaScript on all skins,
User:<Username>/common.css for CSS on all skins, or User:<Username>/vector.css for CSS
modications that only apply to the Vector skin).
If the Gadgets extension is installed, sysops can also edit gadgets, i.e., snippets of JavaScript code,
providing features that can be turned on and o by users in their preferences. Upcoming developments
on gadgets will make it possible to share gadgets across wikis, thus avoiding duplication.
This set of tools has had a huge impact and greatly increased the democratization of MediaWikis
software development. Individual users are empowered to add features for themselves; power users
can share them with others, both informally and through globally congurable sysop-controlled
systems. This framework is ideal for small, self-contained modications, and presents a lower barrier
to entry than heavier code modications done through hooks and extensions.
Extensions and Skins
When JavaScript and CSS modications are not enough, MediaWiki provides a system of hooks
that let third-party developers run custom PHP code before, after, or instead of MediaWiki code for
particular events
9
. MediaWiki extensions use hooks to plug into the code.
Before hooks existed in MediaWiki, adding custom PHP code meant modifying the core code,
which was neither easy nor recommended. The rst hooks were proposed and added in 2004 by
Evan Prodromou; many more have been added over the years when needed. Using hooks, it is even
possible to extend MediaWikis wiki markup with additional capabilities using tag extensions.
9
MediaWiki hooks are referenced at https://www.mediawiki.org/wiki/Manual:Hooks.
192 MediaWiki
The extension system isnt perfect; extension registration is based on code execution at startup,
rather than cacheable data, which limits abstraction and optimization and hurts MediaWikis perfor-
mance. But overall, the extension architecture is now a fairly exible infrastructure that has helped
make specialized code more modular, keeping the core software from expanding (too) much, and
making it easier for third-party users to build custom functionality on top of MediaWiki.
Conversely, its very dicult to write a new skin for MediaWiki without reinventing the wheel.
In MediaWiki, skins are PHP classes each extending the parent Skin class; they contain functions
that gather the information needed to generate the HTML. The long-lived MonoBook skin was
dicult to customize because it contained a lot of browser-specic CSS to support old browsers;
editing the template or CSS required many subsequent changes to reect the change for all browsers
and platforms.
API
The other main entry point for MediaWiki, besides index.php, is api.php, used to access its
machine-readable web query API (Application Programming Interface).
Wikipedia users originally created bots that worked by screen scraping the HTML content
served by MediaWiki; this method was very unreliable and broke many times. To improve this
situation, developers introduced a read-only interface (located at query.php), which then evolved into
a full-edged read and write machine API providing direct, high-level access to the data contained
in the MediaWiki database
10
.
Client programs can use the API to login, get data, and post changes. The API supports thin
web-based JavaScript clients and end-user applications. Almost anything that can be done via the
web interface can basically be done through the API. Client libraries implementing the MediaWiki
API are available in many languages, including Python and .NET.
12.9 Future
What started as a summer project done by a single volunteer PHP developer has grown into MediaWiki,
a mature, stable wiki engine powering a top-ten website with a ridiculously small operational
infrastructure. This has been made possible by constant optimization for performance, iterative
architectural changes and a team of awesome developers.
The evolution of web technologies, and the growth of Wikipedia, call for ongoing improvements
and new features, some of which require major changes to MediaWikis architecture. This is, for
example, the case for the ongoing visual editor project, which has prompted renewed work on the
parser and on the wiki markup language, the DOM and nal HTML conversion.
MediaWiki is a tool used for very dierent purposes. Within Wikimedia projects, for instance,
its used to create and curate an encyclopedia (Wikipedia), to power a huge media library (Wiki-
media Commons), to transcribe scanned reference texts (Wikisource), and so on. In other contexts,
MediaWiki is used as a corporate CMS, or as a data repository, sometimes combined with a semantic
framework. These specialized uses that werent planned for will probably continue to drive constant
adjustments to the softwares internal structure. As such, MediaWikis architecture is very much
alive, just like the immense community of users it supports.
10
Exhaustive documentation of the API is available at https://www.mediawiki.org/wiki/API.
Sumana Harihareswara and Guillaume Paumier 193
12.10 Further Reading
MediaWiki documentation and support: https://www.mediawiki.org.
Automatically generated MediaWiki documentation: http://svn.wikimedia.org/doc/.
Domas Mituzas, Wikipedia: site internals, conguration, code examples and management
issues, MySQL Users conference, 2007. Full text available at http://dom.as/talks/.
12.11 Acknowledgments
This chapter was created collaboratively. Guillaume Paumier wrote most of the content by organizing
the input provided by MediaWiki users and core developers. Sumana Harihareswara coordinated
the interviews and input-gathering phases. Many thanks to Antoine Musso, Brion Vibber, Chad
Horohoe, Tim Starling, Roan Kattouw, Sam Reed, Siebrand Mazeland, Erik Mller, Magnus Manske,
Rob Lanphier, Amir Aharoni, Federico Leva, Graham Pearce and others for providing input and/or
reviewing the content.
194 MediaWiki
[chapter13]
Moodle
Tim Hunt
Moodle is a web application used in educational settings. While this chapter will try to give an
overview of all aspects of how Moodle works, it focuses on those areas where Moodles design is
particularly interesting:
The way the application is divided into plugins;
The permission system, which controls which users can perform which actions in dierent
parts of the system;
The way output is generated, so that dierent themes (skins) can be used to give dierent
appearances, and so that the interface can be localised.
The database abstraction layer.
Moodle
1
provides a place online where students and teachers can come together to teach and
learn. A Moodle site is divided into courses. A course has users enrolled in it with dierent roles,
such as Student or Teacher. Each course comprises a number of resources and activities. A resource
might be a PDF le, a page of HTML within Moodle, or a link to something elsewhere on the web.
An activity might be a forum, a quiz or a wiki. Within the course, these resources and activities will
be structured in some way. For example they may be grouped into logical topics, or into weeks on a
calendar.
Figure 13.1: Moodle course
1
http://moodle.org/
Moodle can be used as a standalone application. Should you wish to teach courses on software
architecture (for example) you could download Moodle to your web host, install it, start creating
courses, and wait for students to come and self-register. Alternatively, if you are a large institution,
Moodle would be just one of the systems you run. You would probably also have the infrastructure
shown in Figure 13.2.
Figure 13.2: Typical university systems architecture
An authentication/identity provider (for example LDAP) to control user accounts across all
your systems.
A student information system; that is, a database of all your students, which program of study
they are on, and hence which courses they need to complete; and their transcripta high-level
summary of the results of the courses they have completed. This would also deal with other
administrative functions, like tracking whether they have paid their fees.
A document repository (for example, Alfresco); to store les, and track workow as users
collaborate to create les.
An ePortfolio; this is a place where students can assemble assets, either to build a CV (resume),
or to provide evidence that they have met the requirements of a practice-based course.
A reporting or analytics tool; to generate high-level information about what is going on in your
institution.
Moodle focuses on providing an online space for teaching and learning, rather than any of the
other systems that an educational organisation might need. Moodle provides a basic implementation
of the other functionalities, so that it can function either as a stand-alone system or integrated with
other systems. The role Moodle plays is normally called a virtual learning environment (VLE), or
learning or course management system (LMS, CMS or even LCMS).
Moodle is open source or free software (GPL). It is written in PHP. It will run on most common
web servers, on common platforms. It requires a database, and will work with MySQL, PostgreSQL,
Microsoft SQL Server or Oracle.
The Moodle project was started by Martin Dougiamas in 1999, while he was working at Curtin
University, Australia. Version 1.0 was released in 2002, at which time PHP4.2 and MySQL 3.23
were the technologies available. This limited the kind of architecture that was possible initially, but
much has changed since then. The current release is the Moodle 2.2.x series.
196 Moodle
13.1 An Overview of How Moodle Works
A Moodle installation comprises three parts:
1. The code, typically in a folder like /var/www/moodle or ~/htdocs/moodle. This should not
be writable by the web server.
2. The database, managed by one of the supported RDMSs. In fact, Moodle adds a prex to all
the table names, so it can share a database with other applications if desired.
3. The moodledata folder. This is a folder where Moodle stores uploaded and generated les,
and so needs to be writable by the web server. For security reasons, the should be outside the
web root.
These can all be on a single server. Alternatively, in a load-balanced set-up, there will be multiple
copies of the code on each web server, but just one shared copy of the database and moodledata,
probably on other servers.
The conguration information about these three parts is stored in a le called config.php in the
root of the moodle folder when Moodle is installed.
Request Dispatching
Moodle is a web applications, so users interact with it using their web browser. From Moodles
point of view that means responding to HTTP requests. An important aspect of Moodles design is,
therefore, the URL namespace, and how URLs get dispatched to dierent scripts.
Moodle uses the standard PHP approach to this. To view the main page for a course,
the URL would be .../course/view.php?id=123, where 123 is the unique id of the course
in the database. To view a forum discussion, the URL would be something like
.../mod/forum/discuss.php?id=456789. That is, these particular scripts, course/view.php
or mod/forum/discuss.php, would handle these requests.
This is simple for the developer. To understand how Moodle handles a particular request, you
look at the URL and start reading code there. It is ugly from the users point of view. These URLs
are, however, permanent. The URLs do not change if the course is renamed, or if a moderator moves
a discussion to a dierent forum.
2
The alternative approach one could take is to have a single entry point .../index.php/[extra-
information-to-make-the-request-unique]. The single script index.php would then dispatch
the requests in some way. This approach adds a layer of indirection, which is something software
developers always like to do. The lack of this layer of indirection does not seem to hurt Moodle.
Plugins
Like many successful open source projects, Moodle is built out of many plugins, working together
with the core of the system. This is a good approach because at allows people to change and enhance
Moodle in dened ways. An important advantage of an open source system is that you can tailor
it to your particular needs. Making extensive customisations to the code can, however, lead to big
problems when the time comes to upgrade, even when using a good version control system. By
allowing as many customisations and new features as possible to be implemented as self-contained
plugins that interact with the Moodle core through a dened API, it is easier for people to customise
2
This is a good property for URLs to have, as explained in Tim Berners-Lees article Cool URIs dont change http:
//www.w3.org/Provider/Style/URI.html
Tim Hunt 197
Moodle to their needs, and to share customisations, while still being able to upgrade the core Moodle
system.
There are various ways a system can be built as a core surrounded by plugins. Moodle has a
relatively fat core, and the plugins are strongly-typed. When I say a fat core, I mean that there is a lot
of functionality in the core. This contrasts with the kind of architecture where just about everything,
except for a small plugin-loader stub, is a plugin.
When I say plugins are strongly typed, I mean that depending on which type of functionality
you want to implement, you have to write a dierent type of plugin, and implement a dierent API.
For example, a new Activity module plugin would be very dierent from a new Authentication
plugin or a new Question type. At the last count there are about 35 dierent types of plugin
3
. This
contrasts with the kind of architecture where all plugins use basically the same API and then, perhaps,
subscribe to the subset of hooks or events they are interested in.
Generally, the trend in Moodle has been to try to shrink the core, by moving more functionality
into plugins. This eort has only been somewhat successful, however, because an increasing feature-
set tends to expand the core. The other trend has been to try to standardise the dierent types of
plugin as much as possible, so that in areas of common functionality, like install and upgrade, all
types of plugins work the same way.
A plugin in Moodle takes the form of a folder containing les. The plugin has a type and a name,
which together make up the Frankenstyle component name of the plugin
4
. The plugin type and
name determine the path to the plugin folder. The plugin type gives a prex, and the foldername is
the plugin name. Here are some examples:
Plugin type Plugin name Frankenstyle Folder
mod (Activity module) forum mod_forum mod/forum
mod (Activity module) quiz mod_quiz mod/quiz
block (Side-block) navigation block_navigation blocks/navigation
qtype (Question type) shortanswer qtype_shortanswer question/type/shortanswer
quiz (Quiz report) statistics quiz_statistics mod/quiz/report/statistics
The last example shows that each activity module is allowed to declare sub-plugin types. At the
moment only activity modules can do this, for two reasons. If all plugins could have sub-plugins that
might cause performance problems. Activity modules are the main educational activities in Moodle,
and so are the most important type of plugin, thus they get special privileges.
An Example Plugin
I will explain a lot of details of the Moodle architecture by considering a specic example plugin.
As is traditional, I have chosen to implement a plugin that displays Hello world.
This plugin does not really t naturally into any of the standard Moodle plugin types. It is just a
script, with no connection to anything else, so I will choose to implement it as a local plugin. This
is a catch-all plugin type for miscellaneous functionality that does not t anywhere better. I will name
my plugin greet, to give a Frankensyle name of local_greet, and a folder path of local/greet.
5
Each plugin must contain a le called version.php which denes some basic metadata about
the plugin. This is used by the Moodles plugin installer system to install and upgrade the plugin.
For example, local/greet/version.php contains:
3
For a full list of Moodle plugin types see http://docs.moodle.org/dev/Plugins.
4
The word Frankenstyle arose out of an argument in the developers Jabber channel, but everyone liked it and it stuck.
5
The plugin code can be downloaded from https://github.com/timhunt/moodle-local_greet.
198 Moodle
<?php
$plugin->component = local_greet;
$plugin->version = 211129;
$plugin->requires = 211127;
$plugin->maturity = MATURITY_STABLE;
It may seem redundant to include the component name, since this can be deduced from the path,
but the installer uses this to verify that the plugin has been installed in the right place. The version
eld is the version of this plugin. Maturity is ALPHA, BETA, RC (release candidate), or STABLE.
Requires is the minimum version of Moodle that this plugin is compatible with. If necessary, one
can also document other plugins that this one depends on.
Here is the main script for this simple plugin (stored in local/greet/index.php):
<?php
require_once(dirname(__FILE__) . /../../config.php); // 1
require_login(); // 2
$context = context_system::instance(); // 3
require_capability(local/greet:begreeted, $context); // 4
$name = optional_param(name, , PARAM_TEXT); // 5
if (!$name) {
$name = fullname($USER); // 6
}
add_to_log(SITEID, local_greet, begreeted,
local/greet/index.php?name= . urlencode($name)); // 7
$PAGE->set_context($context); // 8
$PAGE->set_url(new moodle_url(/local/greet/index.php),
array(name => $name)); // 9
$PAGE->set_title(get_string(welcome, local_greet)); // 1
echo $OUTPUT->header(); // 11
echo $OUTPUT->box(get_string(greet, local_greet,
format_string($name))); // 12
echo $OUTPUT->footer(); // 13
Line 1: Bootstrapping Moodle
require_once(dirname(__FILE__) . /../../config.php); // 1
The single line of this script that does the most work is the rst. I said above that config.php
contains the details Moodle needs to connect to the database and nd the moodledata folder. It ends,
however, with the line require_once(lib/setup.php). This:
1. loads all the standard Moodle libraries using require_once;
2. starts the session handling;
3. connects to the database; and
4. sets up a number of global variables, which we shall meet later.
Tim Hunt 199
Line 2: Checking the User Is Logged In
require_login(); // 2
This line causes Moodle to check that the current user is logged in, using whatever authentication
plugin the administrator has congured. If not, the user will be redirected to the log-in form, and
this function will never return.
A script that was more integrated into Moodle would pass more arguments here, to say which
course or activity this page is part of, and then require_login would also verify that the user is
enrolled in, or otherwise allowed to access this course, and is allowed to see this activity. If not, an
appropriate error would be displayed.
13.2 Moodles Roles and Permissions System
The next two lines of code show how to check that the user has permission to do something. As you
can see, from the developers point of view, the API is very simple. Behind the scenes, however,
there is a sophisticated access system which gives the administrator great exibility to control who
can do what.
Line 3: Getting the Context
$context = context_system::instance(); // 3
In Moodle, users can have dierent permissions in dierent places. For example, a user might be
a Teacher in one course, and a Student in another, and so have dierent permissions in each place.
These places are called contexts. Contexts in Moodle form a hierarchy rather like a folder hierarchy in
a le-system. At the top level is the System context (and, since this script is not very well integrated
into Moodle, it uses that context).
Within the System context are a number of contexts for the dierent categories that have been
created to organise courses. These can be nested, with one category containing other categories.
Category contexts can also contain Course contexts. Finally, each activity in a course will have its
own Module context.
Line 4: Checking the User Has Permission to Use This Script
require_capability(local/greet:begreeted, $context); // 4
Having got the contextthe relevant area of Moodlethe permission can be checked. Each
bit of functionality that a user may or may not have is called a capability. Checking a capability
provides more ne-grained access control than the basic checks performed by require_login. Our
simple example plugin has just one capability: local/greet:begreeted.
The check is done using the require_capability function, which takes the capability name
and the context. Like other require_... functions, it will not return if the user does not have the
capability. It will display an error instead. In other places the non-fatal has_capability function,
which returns a Boolean would be used, for example, to determine whether to display a link to this
script from another page.
How does the administrator congure which user has which permission? Here is the calculation
that has_capability performs (at least conceptually):
200 Moodle
Figure 13.3: Contexts
1. Start from the current Context.
2. Get a list of the Roles that the user has in this Context.
3. Then work out what the Permission is for each Role in this Context.
4. Aggregate those permissions to get a nal answer.
Dening Capabilities
As the example shows, a plugin can dene new capabilities relating to the particular functionality it
provides. Inside each Moodle plugin there is a sub-folder of the code called db. This contains all the
information required to install or upgrade the plugin. One of those bits of information is a le called
access.php that denes the capabilities. Here is the access.php le for our plugin, which lives in
local/greet/db/access.php:
<?php
$capabilities = array(local/greet:begreeted => array(
captype => read,
contextlevel => CONTEXT_SYSTEM,
archetypes => array(guest => CAP_ALLOW, user => CAP_ALLOW)
));
This gives some metadata about each capability which are used when constructing the permissions
management user interface. It also give default permissions for common types of role.
Roles
The next part of the Moodle permissions system is roles. A role is really just a named set of
permissions. When you are logged into Moodle, you will have the Authenticated user role in
the System context, and since the System context is the root of the hierarchy, that role will apply
everywhere.
Within a particular course, you may be a Student, and that role assignment will apply in the
Course context and all the Module contexts within it. In another course, however, you may have a
Tim Hunt 201
dierent role. For example, Mr Gradgrind may be Teacher in the Facts, Facts, Facts course, but a
Student in the professional development course Facts Arent Everything. Finally, a user might be
given the Moderator role in one particular forum (Module context).
Permissions
A role denes a permission for each capability. For example the Teacher role will probably ALLOW
moodle/course:manage, but the Student role will not. However, both Student and Teacher will
allow mod/forum:startdiscussion.
The roles are normally dened globally, but they can be re-dened in each context. For exam-
ple, one particular wiki can be made read-only to students by overriding the permission for the
mod/wiki:edit capability for the Student role in that wiki (Module) context, to PREVENT.
There are four Permissions:
NOT SET/INHERIT (default)
ALLOW
PREVENT
PROHIBIT
In a given context, a role will have one of these four permissions for each capability. One dierence
between PROHIBIT and PREVENT is that a PROHIBIT cannot be overridden in sub-contexts.
Permission Aggregation
Finally the permissions for all the roles the user has in this context are aggregated.
If any role gives the permission PROHIBIT for this capability, return false.
Otherwise, if any role gives ALLOW for this capability, return true.
Otherwise return false.
A use case for PROHIBIT is this: Suppose a user has been making abusive posts in a number
of forums, and we want to stop them immediately. We can create a Naughty user role, which sets
mod/forum:post and other such capabilities to PROHIBIT. We can then assign this role to the
abusive user in the System context. That way, we can be sure that the user will not be able to post any
more in any forum. (We would then talk to the student, and having reached a satisfactory outcome,
remove that role assignment so that they may use the system again.)
So, Moodles permissions system gives administrators a huge amount of exibility. They can
dene whichever roles they like with dierent permissions for each capability; they can alter the role
denitions in sub-contexts; and then they can assign dierent roles to users in dierent contexts.
13.3 Back to Our Example Script
The next part of the script illustrates some miscellaneous points:
Line 5: Get Data From the Request
$name = optional_param(name, , PARAM_TEXT); // 5
202 Moodle
Something that every web application has to do is get data froma request (GET or POST variables)
without being susceptible to SQL injection or cross-site scripting attacks. Moodle provides two ways
to do this.
The simple method is the one shown here. It gets a single variable given the parameter name
(here name) a default value, and the expected type. The expected type is used to clean the input of all
unexpected characters. There are numerous types like PARAM_INT, PARAM_ALPHANUM, PARAM_EMAIL,
and so on.
There is also a similar required_param function, which like other require_... functions stops
execution and displays an error message if the expected parameter is not found.
The other mechanism Moodle has for getting data from the request is a fully edged forms library.
This is a wrapper around the HTML QuickForm library from PEAR
6
. This seemed like a good choice
when it was selected, but is now no longer maintained. At some time in the future we will have
to tackle moving to a new forms library, which many of us look forwards to, because QuickForm
has several irritating design issues. For now, however, it is adequate. Forms can be dened as a
collection of elds of various types (e.g. text box, select drop-down, date-selector) with client- and
server- side validation (including use of the same PARAM_... types).
Line 6: Global Variables
if (!$name) {
$name = fullname($USER); // 6
}
This snippet shows the rst of the global variables Moodle provides. $USER makes accessible
the information about the user accessing this script. Other globals include:
$CFG: holds the commonly used conguration settings.
$DB: the database connection.
$SESSION: a wrapper around the PHP session.
$COURSE: the course the current request relates to.
and several others, some of which we will encounter below.
You may have read the words global variable with horror. Note, however, that PHP processes a
single request at a time. Therefore these variables are not as global as all that. In fact, PHP global
variables can be seen as an implementation of the thread-scoped registry pattern
7
and this is the way
in which Moodle uses them. It is very convenient in that it makes commonly used objects available
throughout the code, without requiring them to be passed to every function and method. It is only
infrequently abused.
Nothing is Simple
This line also serves to make a point about the problem domain: nothing is ever simple. To dis-
play a users name is more complicated than simply concatenating $USER->firstname, , and
$USER->lastname. The school may have policies about showing either of those parts, and dierent
cultures have dierent conventions for which order to show names. Therefore, there are several
congurations settings and a function to assemble the full name according to the rules.
6
For non-PHP programmers, PEAR is PHPs equivalent of CPAN.
7
See Martin Fowlers Patterns of Enterprise Application Architecture.
Tim Hunt 203
Dates are a similar problem. Dierent users may be in dierent time-zones. Moodle stores
all dates as Unix time-stamps, which are integers, and so work in all databases. There is then a
userdate function to display the time-stamp to the user using the appropriate timezone and locale
settings.
Line 7: Logging
add_to_log(SITEID, local_greet, begreeted,
local/greet/index.php?name= . urlencode($name)); // 7
All signicant actions in Moodle are logged. Logs are written to a table in the database. This is
a trade-o. It makes sophisticated analysis quite easy, and indeed various reports based on the logs
are included with Moodle. On a large and busy site, however, it is a performance problem. The log
table gets huge, which makes backing up the database more dicult, and makes queries on the log
table slow. There can also be write contention on the log table. These problems can be mitigated in
various ways, for example by batching writes, or archiving or deleting old records to remove them
from the main database.
13.4 Generating Output
Output is mainly handled via two global objects.
Line 8: The $PAGE Global
$PAGE->set_context($context); // 8
$PAGE stores the information about the page to be output. This information is then readily
available to the code that generates the HTML. This script needs to explicitly specify the current
context. (In other situations, this might have been set automatically by require_login.) The URL
for this page must also be set explicitly. This may seem redundant, but the rationale for requiring it is
that you might get to a particular page using any number of dierent URLs, but the URL passed to
set_url should be the canonical URL for the pagea good permalink, if you like. The page title is
also set. This will end up in the head element of the HTML.
Line 9: Moodle URL
$PAGE->set_url(new moodle_url(/local/greet/index.php),
array(name => $name)); // 9
I just wanted to ag this nice little helper class which makes manipulating URLs much easier. As
an aside, recall that the add_to_log function call above did not use this helper class. Indeed, the log
API cannot accept moodle_url objects. This sort of inconsistency is a typical sign of a code-base as
old as Moodles.
Line 10: Internationalisation
$PAGE->set_title(get_string(welcome, local_greet)); // 1
204 Moodle
Moodle uses its own system to allow the interface to be translated into any language. There may
now be good PHP internationalisation libraries, but in 2002 when it was rst implemented there was
not one available that was adequate. The system is based around the get_string function. Strings
are identied by a key and the plugin Frankenstyle name. As can be seen on line 12, it is possible to
interpolate values into the string. (Multiple values are handled using PHP arrays or objects.)
The strings are looked up in language les that are just plain PHP arrays. Here is the language
le local/greet/lang/en/local_greet.php for our plugin:
<?php
$string[greet:begreeted] = Be greeted by the hello world example;
$string[welcome] = Welcome;
$string[greet] = Hello, {$a}!;
$string[pluginname] = Hello world example;
Note that, as well as the two string used in our script, there are also strings to give a name to the
capability, and the name of the plugin as it appears in the user interface.
The dierent languages are identied by the two-letter country code (en here). Languages packs
may derive from other language packs. For example the fr_ca (French Canadian) language pack
declares fr (French) as the parent language, and thus only has to dene those strings that dier from
the French. Since Moodle originated in Australia, en means British English, and en_us (American
English) is derived from it.
Again, the simple get_string API for plugin developers hides a lot of complexity, including
working out the current language (which may depend on the current users preferences, or the settings
for the particular course they are currently in), and then searching through all the language packs and
parent language packs to nd the string.
Producing the language pack les, and co-ordinating the translation eort is managed at http:
//lang.moodle.org/, which is Moodle with a custom plugin
8
. It uses both Git and the database as
a backend to store the language les with full version history.
Line 11: Starting Output
echo $OUTPUT->header(); // 11
This is another innocuous-looking line that does much more than it seems. The point is that
before any output can be done, the applicable theme (skin) must be worked out. This may depend on
a combination of the page context and the users preferences. $PAGE->context was, however, only
set on line 8, so the $OUTPUT global could not have been initialised at the start of the script. In order
to solve this problem, some PHP magic is used to create the proper $OUTPUT object based on the
information in $PAGE the rst time any output method is called.
Another thing to consider is that every page in Moodle may contain blocks. These are extra
congurable bits of content that are normally displayed to the left or right of the main content. (They
are a type of plugin.) Again, the exact collection of blocks to display depends, in a exible way
(that the administrator can control) on the page context and some other aspects of the page identity.
Therefore, another part of preparing for output is a call to $PAGE->blocks->load_blocks().
Once all the necessary information has been worked out, the theme plugin (that controls the
overall look of the page) is called to generate the overall page layout, including whatever standard
header and footer is desired. This call is also responsible for adding the output from the blocks
8
local_amos, http://docs.moodle.org/22/en/AMOS.
Tim Hunt 205
at the appropriate place in the HTML. In the middle of the layout there will be a div where the
specic content for this page goes. The HTML of this layout is generated, and then split in half after
the start of the main content div. The rst half is returned, and the rest is stored to be returned by
$OUTPUT->footer().
Line 12: Outputting the Body of the Page
echo $OUTPUT->box(get_string(greet, local_greet,
format_string($name))); // 12
This line outputs the body of the page. Here it simply displays the greeting in a box. The
greeting is, again, a localised string, this time with a value substituted into a placeholder. The core
renderer $OUTPUT provides many convenience methods like box to describe the required output in
quite high-level terms. Dierent themes can control what HTML is actually output to make the box.
The content that originally came from the user ($name) is output though the format_string
function. This is the other part of providing XSS protection. It also enables the user of text lters
(another plugin type). An example lter would be the LaTeX lter, which replaces input like
$$x + 1$$ with an image of the equation. I will mention, but not explain, that there are actually
three dierent functions (s, format_string, and format_text) depending on the particular type
of content being output.
Line 13: Finishing Output
echo $OUTPUT->footer(); // 13
Finally, the footer of the page is output. This example does not show it, but Moodle tracks all the
JavaScript that is required by the page, and outputs all the necessary script tags in the footer. This is
standard good practice. It allows users to see the page without waiting for all the JavaScript to load. A
developer would include JavaScript using API calls like $PAGE->requires->js(/local/greet/
cooleffect.js).
Should This Script Mix Logic and Output?
Obviously, putting the output code directly in index.php, even if at a high level of abstraction, limits
the exibility that themes have to control the output. This is another sign of the age of the Moodle
code-base. The $OUTPUT global was introduced in 2010 as a stepping stone on the way from the old
code, where the output and controller code were in the same le, to a design where all the view code
was properly separated. This also explains the rather ugly way that the entire page layout is generated,
then split in half, so that any output from the script itself can be placed between the header and the
footer. Once the view code has been separated out of the script, into what Moodle calls a renderer,
the theme can then choose to completely (or partially) override the view code for a given script.
A small refactoring can move all the output code out of our index.php and into a renderer. The
end of index.php (lines 11 to 13) would change to:
$output = $PAGE->get_renderer(local_greet);
echo $output->greeting_page($name);
and there would be a new le local/greet/renderer.php:
206 Moodle
<?php
class local_greet_renderer extends plugin_renderer_base {
public function greeting_page($name) {
$output = ;
$output .= $this->header();
$output .= $this->box(get_string(greet, local_greet, $name));
$output .= $this->footer();
return $output;
}
}
If the theme wished to completely change this output, it would dene a subclass of this renderer
that overrides the greeting_page method. $PAGE->get_renderer() determines the appropriate
renderer class to instantiate depending on the current theme. Thus, the output (view) code is fully
separated from the controller code in index.php, and the plugin has been refactored from typical
legacy Moodle code to a clean MVC architecture.
13.5 Database Abstraction
The "Hello world" script was suciently simple that it did not need to access the database, although
several of the Moodle library calls used did do database queries. I will now briey describe the
Moodle database layer.
Moodle used to use the ADOdb library as the basis of its database abstraction layer, but there were
issues for us, and the extra layer of library code had a noticeable impact on performance. Therefore,
in Moodle 2.0 we switched to our own abstraction layer, which is a thin wrapper around the various
PHP database libraries.
The moodle_database Class
The heart of the library is the moodle_database class. This denes the interface provided by the
$DB global variable, which gives access to the database connection. A typical usage might be:
$course = $DB->get_record(course, array(id => $courseid));
That translates into the SQL:
SELECT * FROM mdl_course WHERE id = $courseid;
and returns the data as a plain PHP object with public elds, so you could access $course->id,
$course->fullname, etc.
Simple methods like this deal with basic queries, and simple updates and inserts. Sometimes it
is necessary to do more complex SQL, for example to run reports. In that case, there are methods to
execute arbitrary SQL:
$courseswithactivitycounts = $DB->get_records_sql(
SELECT c.id, . $DB->sql_concat(shortname, " ", fullname) . AS coursename,
COUNT(1) AS activitycount
FROM {course} c
JOIN {course_modules} cm ON cm.course = c.id
Tim Hunt 207
WHERE c.category = :categoryid
GROUP BY c.id, c.shortname, c.fullname ORDER BY c.shortname, c.fullname,
array(categoryid => $category));
Some things to note there:
The table names are wrapped in {} so that the library can nd them and prepend the table
name prex.
The library uses placeholders to insert values into the SQL. In some cases this uses the facilities
of the underlying database driver. In other cases the values have to be escaped and inserted
into the SQL using string manipulation. The library supports both named placeholders (as
above) and anonymous ones, using ? as the placeholder.
For queries to work on all our supported databases a safe subset of standard SQL must be used.
For example, you can see that I have used the AS keyword for column aliases, but not for table
aliases. Both of these usage rules are necessary.
Even so, there are some situations where no subset of standard SQL will work on all our
supported databases; for example, every database has a dierent way to concatenate strings. In
these cases there are compatibility functions to generate the correct SQL.
Dening the Database Structure
Another area where database management systems dier a lot is in the SQL syntax required to dene
tables. To get around this problem, each Moodle plugin (and Moodle core) denes the required
database tables in an XML le. The Moodle install system parses the install.xml les and uses
the information they contain to create the required tables and indexes. There is a developer tool
called XMLDB built into Moodle to help create and edit these install les.
If the database structure needs to change between two releases of Moodle (or of a plugin) then
the developer is responsible for writing code (using an additional database object that provides DDL
methods) to update the database structure, while preserving all the users data. Thus, Moodle will
always self-update from one release to the next, simplifying maintenance for administrators.
One contentious point, stemming from the fact that Moodle started out using MySQL 3, is that the
Moodle database does not use foreign keys. This allows some buggy behaviour to remain undetected
even though modern databases would be capable of detecting the problem. The diculty is that
people have been running Moodle sites without foreign keys for years, so there is almost certainly
inconsistent data present. Adding the keys now would be impossible, without a very dicult clean-up
job. Even so, since the XMLDB system was added to Moodle 1.7 (in 2006!) the install.xml les
have contained the denitions of the foreign keys that should exist, and we are still hoping, one day,
to do all the work necessary to allow us to create them during the install process.
13.6 What Has Not Been Covered
I hope I have given you a good overview of how Moodle works. Due to lack of space I have had to
omit several interesting topics, including how authentication, enrolment and grade plugins allow
Moodle to interoperate with student information systems, and the interesting content-addressed way
that Moodle stores uploaded les. Details of these, and other aspects of Moodles design, can be
found in the developer documentation
9
.
9
http://docs.moodle.org/dev/
208 Moodle
13.7 Lessons Learned
One interesting aspect of working on Moodle is that it came out of a research project. Moodle enables
(but does not enforce) a social constructivist pedagogy
10
. That is, we learn best by actually creating
something, and we learn from each other as a community. Martin Dougiamass PhD question did
not ask whether this was an eective model for education, but rather whether it is an eective model
for running an open source project. That is, can we view the Moodle project as an attempt to learn
how to build and use a VLE, and an attempt to learn that by actually building and using Moodle as a
community where teachers, developers, administrators and students all teach and learn from each
other? I nd this a good model for thinking about an open source software development project. The
main place where developers and users learn from each other is in discussions in the Moodle project
forums, and in the bug database.
Perhaps the most important consequence of this learning approach is that you should not be
afraid to start by implementing the simplest possible solution rst. For example, early versions of
Moodle had just a few hard-coded roles like Teacher, Student and Administrator. That was enough
for many years, but eventually the limitations had to be addressed. When the time came to design
the Roles system for Moodle 1.7, there was a lot of experience in the community about how people
were using Moodle, and many little feature requests that showed what people needed to be able to
adjust using a more exible access control system. This all helped design the Roles system to be as
simple as possible, but as complex as necessary. (In fact, the rst version of the roles system ended
up slightly too complex, and it was subsequently simplied a little in Moodle 2.0.)
If you take the view that programming is a problem-solving exercise, then you might think that
Moodle got the design wrong the rst time, and later had to waste time correcting it. I suggest that
is an unhelpful viewpoint when trying to solve complex real-world problems. At the time Moodle
started, no-one knew enough to design the roles system we now have. If you take the learning
viewpoint, then the various stages Moodle went through to reach the current design were necessary
and inevitable.
For this perspective to work, it must be possible to change almost any aspect of a systems
architecture once you have learned more. I think Moodle shows that this is possible. For example,
we found a way for code to be gradually refactored from legacy scripts to a cleaner MVC architecture.
This requires eort, but it seems that when necessary, the resources to implement these changes can
be found in open source projects. From the users point of view, the system gradually evolves with
each major release.
10
http://docs.moodle.org/22/en/Pedagogy
209
210 Moodle
[chapter14]
nginx
Andrew Alexeev
nginx (pronounced engine x) is a free open source web server written by Igor Sysoev, a Russian
software engineer. Since its public launch in 2004, nginx has focused on high performance, high
concurrency and low memory usage. Additional features on top of the web server functionality, like
load balancing, caching, access and bandwidth control, and the ability to integrate eciently with a
variety of applications, have helped to make nginx a good choice for modern website architectures.
Currently nginx is the second most popular open source web server on the Internet.
14.1 Why Is High Concurrency Important?
These days the Internet is so widespread and ubiquitous its hard to imagine it wasnt exactly there,
as we know it, a decade ago. It has greatly evolved, from simple HTML producing clickable text,
based on NCSA and then on Apache web servers, to an always-on communication medium used by
more than 2 billion users worldwide. With the proliferation of permanently connected PCs, mobile
devices and recently tablets, the Internet landscape is rapidly changing and entire economies have
become digitally wired. Online services have become much more elaborate with a clear bias towards
instantly available live information and entertainment. Security aspects of running online business
have also signicantly changed. Accordingly, websites are now much more complex than before,
and generally require a lot more engineering eorts to be robust and scalable.
One of the biggest challenges for a website architect has always been concurrency. Since
the beginning of web services, the level of concurrency has been continuously growing. Its not
uncommon for a popular website to serve hundreds of thousands and even millions of simultaneous
users. A decade ago, the major cause of concurrency was slow clientsusers with ADSL or dial-
up connections. Nowadays, concurrency is caused by a combination of mobile clients and newer
application architectures which are typically based on maintaining a persistent connection that
allows the client to be updated with news, tweets, friend feeds, and so on. Another important factor
contributing to increased concurrency is the changed behavior of modern browsers, which open four
to six simultaneous connections to a website to improve page load speed.
To illustrate the problem with slow clients, imagine a simple Apache-based web server which
produces a relatively short 100 KB responsea web page with text or an image. It can be merely a
fraction of a second to generate or retrieve this page, but it takes 10 seconds to transmit it to a client
with a bandwidth of 80 kbps (10 KB/s). Essentially, the web server would relatively quickly pull 100
KB of content, and then it would be busy for 10 seconds slowly sending this content to the client
before freeing its connection. Now imagine that you have 1,000 simultaneously connected clients
who have requested similar content. If only 1 MB of additional memory is allocated per client, it
would result in 1000 MB (about 1 GB) of extra memory devoted to serving just 1000 clients 100 KB
of content. In reality, a typical web server based on Apache commonly allocates more than 1 MB of
additional memory per connection, and regrettably tens of kbps is still often the eective speed of
mobile communications. Although the situation with sending content to a slow client might be, to
some extent, improved by increasing the size of operating system kernel socket buers, its not a
general solution to the problem and can have undesirable side eects.
With persistent connections the problem of handling concurrency is even more pronounced,
because to avoid latency associated with establishing new HTTP connections, clients would stay
connected, and for each connected client theres a certain amount of memory allocated by the web
server.
Consequently, to handle the increased workloads associated with growing audiences and hence
higher levels of concurrencyand to be able to continuously do soa website should be based on a
number of very ecient building blocks. While the other parts of the equation such as hardware (CPU,
memory, disks), network capacity, application and data storage architectures are obviously important,
it is in the web server software that client connections are accepted and processed. Thus, the web
server should be able to scale nonlinearly with the growing number of simultaneous connections and
requests per second.
Isnt Apache Suitable?
Apache, the web server software that still largely dominates the Internet today, has its roots in the
beginning of the 1990s. Originally, its architecture matched the then-existing operating systems
and hardware, but also the state of the Internet, where a website was typically a standalone physical
server running a single instance of Apache. By the beginning of the 2000s it was obvious that the
standalone web server model could not be easily replicated to satisfy the needs of growing web
services. Although Apache provided a solid foundation for future development, it was architected
to spawn a copy of itself for each new connection, which was not suitable for nonlinear scalability
of a website. Eventually Apache became a general purpose web server focusing on having many
dierent features, a variety of third-party extensions, and universal applicability to practically any
kind of web application development. However, nothing comes without a price and the downside to
having such a rich and universal combination of tools in a single piece of software is less scalability
because of increased CPU and memory usage per connection.
Thus, when server hardware, operating systems and network resources ceased to be major
constraints for website growth, web developers worldwide started to look around for a more ecient
means of running web servers. Around ten years ago, Daniel Kegel, a prominent software engineer,
proclaimed that its time for web servers to handle ten thousand clients simultaneously
1
and
predicted what we now call Internet cloud services. Kegels C10K manifest spurred a number of
attempts to solve the problem of web server optimization to handle a large number of clients at the
same time, and nginx turned out to be one of the most successful ones.
Aimed at solving the C10K problem of 10,000 simultaneous connections, nginx was written
with a dierent architecture in mindone which is much more suitable for nonlinear scalability in
both the number of simultaneous connections and requests per second. nginx is event-based, so it
does not follow Apaches style of spawning new processes or threads for each web page request. The
end result is that even as load increases, memory and CPU usage remain manageable. nginx can
now deliver tens of thousands of concurrent connections on a server with typical hardware.
1
http://www.kegel.com/c1k.html
212 nginx
When the rst version of nginx was released, it was meant to be deployed alongside Apache
such that static content like HTML, CSS, JavaScript and images were handled by nginx to ooad
concurrency and latency processing from Apache-based application servers. Over the course of
its development, nginx has added integration with applications through the use of FastCGI, uswgi
or SCGI protocols, and with distributed memory object caching systems like memcached. Other
useful functionality like reverse proxy with load balancing and caching was added as well. These
additional features have shaped nginx into an ecient combination of tools to build a scalable web
infrastructure upon.
In February 2012, the Apache 2.4.x branch was released to the public. Although this latest
release of Apache has added new multi-processing core modules and new proxy modules aimed
at enhancing scalability and performance, its too soon to tell if its performance, concurrency and
resource utilization are now on par with, or better than, pure event-driven web servers. It would be
very nice to see Apache application servers scale better with the new version, though, as it could
potentially alleviate bottlenecks on the backend side which still often remain unsolved in typical
nginx-plus-Apache web congurations.
Are There More Advantages to Using nginx?
Handling high concurrency with high performance and eciency has always been the key benet of
deploying nginx. However, there are now even more interesting benets.
In the last few years, web architects have embraced the idea of decoupling and separating their
application infrastructure from the web server. However, what would previously exist in the form
of a LAMP (Linux, Apache, MySQL, PHP, Python or Perl)-based website, might now become not
merely a LEMP-based one (E standing for Engine x), but more and more often an exercise in
pushing the web server to the edge of the infrastructure and integrating the same or a revamped set
of applications and database tools around it in a dierent way.
nginx is very well suited for this, as it provides the key features necessary to conveniently ooad
concurrency, latency processing, SSL (secure sockets layer), static content, compression and caching,
connections and requests throttling, and even HTTP media streaming from the application layer to a
much more ecient edge web server layer. It also allows integrating directly with memcached/Redis
or other NoSQL solutions, to boost performance when serving a large number of concurrent users.
With recent avors of development kits and programming languages gaining wide use, more
and more companies are changing their application development and deployment habits. nginx has
become one of the most important components of these changing paradigms, and it has already
helped many companies start and develop their web services quickly and within their budgets.
The rst lines of nginx were written in 2002. In 2004 it was released to the public under the
two-clause BSD license. The number of nginx users has been growing ever since, contributing ideas,
and submitting bug reports, suggestions and observations that have been immensely helpful and
benecial for the entire community.
The nginx codebase is original and was written entirely from scratch in the C programming
language. nginx has been ported to many architectures and operating systems, including Linux,
FreeBSD, Solaris, Mac OS X, AIX and Microsoft Windows. nginx has its own libraries and with
its standard modules does not use much beyond the systems C library, except for zlib, PCRE and
OpenSSL which can be optionally excluded from a build if not needed or because of potential license
conicts.
A few words about the Windows version of nginx. While nginx works in a Windows environment,
the Windows version of nginx is more like a proof-of-concept rather than a fully functional port.
Andrew Alexeev 213
There are certain limitations of the nginx and Windows kernel architectures that do not interact well
at this time. The known issues of the nginx version for Windows include a much lower number
of concurrent connections, decreased performance, no caching and no bandwidth policing. Future
versions of nginx for Windows will match the mainstream functionality more closely.
14.2 Overview of nginx Architecture
Traditional process- or thread-based models of handling concurrent connections involve handling
each connection with a separate process or thread, and blocking on network or input/output operations.
Depending on the application, it can be very inecient in terms of memory and CPU consumption.
Spawning a separate process or thread requires preparation of a new runtime environment, including
allocation of heap and stack memory, and the creation of a new execution context. Additional CPU
time is also spent creating these items, which can eventually lead to poor performance due to thread
thrashing on excessive context switching. All of these complications manifest themselves in older
web server architectures like Apaches. This is a tradeo between oering a rich set of generally
applicable features and optimized usage of server resources.
From the very beginning, nginx was meant to be a specialized tool to achieve more performance,
density and economical use of server resources while enabling dynamic growth of a website, so it
has followed a dierent model. It was actually inspired by the ongoing development of advanced
event-based mechanisms in a variety of operating systems. What resulted is a modular, event-driven,
asynchronous, single-threaded, non-blocking architecture which became the foundation of nginx
code.
nginx uses multiplexing and event notications heavily, and dedicates specic tasks to separate
processes. Connections are processed in a highly ecient run-loop in a limited number of single-
threaded processes called workers. Within each worker nginx can handle many thousands of
concurrent connections and requests per second.
Code Structure
The nginx worker code includes the core and the functional modules. The core of nginx is responsible
for maintaining a tight run-loop and executing appropriate sections of modules code on each stage of
request processing. Modules constitute most of the presentation and application layer functionality.
Modules read from and write to the network and storage, transform content, do outbound ltering,
apply server-side include actions and pass the requests to the upstream servers when proxying is
activated.
nginxs modular architecture generally allows developers to extend the set of web server features
without modifying the nginx core. nginx modules come in slightly dierent incarnations, namely
core modules, event modules, phase handlers, protocols, variable handlers, lters, upstreams and
load balancers. At this time, nginx doesnt support dynamically loaded modules; i.e., modules are
compiled along with the core at build stage. However, support for loadable modules and ABI is
planned for the future major releases. More detailed information about the roles of dierent modules
can be found in Section 14.4.
While handling a variety of actions associated with accepting, processing and managing network
connections and content retrieval, nginx uses event notication mechanisms and a number of disk I/O
performance enhancements in Linux, Solaris and BSD-based operating systems, like kqueue, epoll,
and event ports. The goal is to provide as many hints to the operating system as possible, in regards
214 nginx
to obtaining timely asynchronous feedback for inbound and outbound trac, disk operations, reading
from or writing to sockets, timeouts and so on. The usage of dierent methods for multiplexing and
advanced I/O operations is heavily optimized for every Unix-based operating system nginx runs on.
A high-level overview of nginx architecture is presented in Figure 14.1.
Figure 14.1: Diagram of nginxs architecture
Workers Model
As previously mentioned, nginx doesnt spawn a process or thread for every connection. Instead,
worker processes accept new requests from a shared listen socket and execute a highly ecient
run-loop inside each worker to process thousands of connections per worker. Theres no specialized
arbitration or distribution of connections to the workers in nginx; this work is done by the OS kernel
mechanisms. Upon startup, an initial set of listening sockets is created. workers then continuously
accept, read from and write to the sockets while processing HTTP requests and responses.
The run-loop is the most complicated part of the nginx worker code. It includes comprehensive
inner calls and relies heavily on the idea of asynchronous task handling. Asynchronous operations
are implemented through modularity, event notications, extensive use of callback functions and
ne-tuned timers. Overall, the key principle is to be as non-blocking as possible. The only situation
where nginx can still block is when theres not enough disk storage performance for a worker process.
Because nginx does not fork a process or thread per connection, memory usage is very conserva-
tive and extremely ecient in the vast majority of cases. nginx conserves CPU cycles as well because
theres no ongoing create-destroy pattern for processes or threads. What nginx does is check the
state of the network and storage, initialize new connections, add them to the run-loop, and process
asynchronously until completion, at which point the connection is deallocated and removed from the
run-loop. Combined with the careful use of syscalls and an accurate implementation of supporting
Andrew Alexeev 215
interfaces like pool and slab memory allocators, nginx typically achieves moderate-to-low CPU
usage even under extreme workloads.
Because nginx spawns several workers to handle connections, it scales well across multiple
cores. Generally, a separate worker per core allows full utilization of multicore architectures, and
prevents thread thrashing and lock-ups. Theres no resource starvation and the resource controlling
mechanisms are isolated within single-threaded worker processes. This model also allows more
scalability across physical storage devices, facilitates more disk utilization and avoids blocking on
disk I/O. As a result, server resources are utilized more eciently with the workload shared across
several workers.
With some disk use and CPU load patterns, the number of nginx workers should be adjusted.
The rules are somewhat basic here, and system administrators should try a couple of congurations
for their workloads. General recommendations might be the following: if the load pattern is CPU
intensivefor instance, handling a lot of TCP/IP, doing SSL, or compressionthe number of nginx
workers should match the number of CPU cores; if the load is mostly disk I/O boundfor instance,
serving dierent sets of content from storage, or heavy proxyingthe number of workers might be
one and a half to two times the number of cores. Some engineers choose the number of workers
based on the number of individual storage units instead, though eciency of this approach depends
on the type and conguration of disk storage.
One major problem that the developers of nginx will be solving in upcoming versions is how to
avoid most of the blocking on disk I/O. At the moment, if theres not enough storage performance to
serve disk operations generated by a particular worker, that worker may still block on reading/writing
from disk. A number of mechanisms and conguration le directives exist to mitigate such disk I/O
blocking scenarios. Most notably, combinations of options like sendle and AIO typically produce a
lot of headroom for disk performance. An nginx installation should be planned based on the data set,
the amount of memory available for nginx, and the underlying storage architecture.
Another problem with the existing worker model is related to limited support for embedded
scripting. For one, with the standard nginx distribution, only embedding Perl scripts is supported.
There is a simple explanation for that: the key problem is the possibility of an embedded script to
block on any operation or exit unexpectedly. Both types of behavior would immediately lead to a
situation where the worker is hung, aecting many thousands of connections at once. More work is
planned to make embedded scripting with nginx simpler, more reliable and suitable for a broader
range of applications.
nginx Process Roles
nginx runs several processes in memory; there is a single master process and several worker processes.
There are also a couple of special purpose processes, specically a cache loader and cache manager.
All processes are single-threaded in version 1.x of nginx. All processes primarily use shared-memory
mechanisms for inter-process communication. The master process is run as the root user. The cache
loader, cache manager and workers run as an unprivileged user.
The master process is responsible for the following tasks:
Reading and validating conguration
Creating, binding and closing sockets
Starting, terminating and maintaining the congured number of worker processes
Reconguring without service interruption
Controlling non-stop binary upgrades (starting new binary and rolling back if necessary)
Re-opening log les
216 nginx
Compiling embedded Perl scripts
The worker processes accept, handle and process connections from clients, provide reverse
proxying and ltering functionality and do almost everything else that nginx is capable of. In regards
to monitoring the behavior of an nginx instance, a system administrator should keep an eye on
workers as they are the processes reecting the actual day-to-day operations of a web server.
The cache loader process is responsible for checking the on-disk cache items and populating
nginxs in-memory database with cache metadata. Essentially, the cache loader prepares nginx
instances to work with les already stored on disk in a specially allocated directory structure. It
traverses the directories, checks cache content metadata, updates the relevant entries in shared
memory and then exits when everything is clean and ready for use.
The cache manager is mostly responsible for cache expiration and invalidation. It stays in memory
during normal nginx operation and it is restarted by the master process in the case of failure.
Brief Overview of nginx Caching
Caching in nginx is implemented in the form of hierarchical data storage on a lesystem. Cache
keys are congurable, and dierent request-specic parameters can be used to control what gets into
the cache. Cache keys and cache metadata are stored in the shared memory segments, which the
cache loader, cache manager and workers can access. Currently there is not any in-memory caching
of les, other than optimizations implied by the operating systems virtual lesystem mechanisms.
Each cached response is placed in a dierent le on the lesystem. The hierarchy (levels and naming
details) are controlled through nginx conguration directives. When a response is written to the
cache directory structure, the path and the name of the le are derived from an MD5 hash of the
proxy URL.
The process for placing content in the cache is as follows: When nginx reads the response from
an upstream server, the content is rst written to a temporary le outside of the cache directory
structure. When nginx nishes processing the request it renames the temporary le and moves it to
the cache directory. If the temporary les directory for proxying is on another le system, the le
will be copied, thus its recommended to keep both temporary and cache directories on the same le
system. It is also quite safe to delete les from the cache directory structure when they need to be
explicitly purged. There are third-party extensions for nginx which make it possible to control cached
content remotely, and more work is planned to integrate this functionality in the main distribution.
14.3 nginx Conguration
nginxs conguration system was inspired by Igor Sysoevs experiences with Apache. His main
insight was that a scalable conguration system is essential for a web server. The main scaling
problem was encountered when maintaining large complicated congurations with lots of virtual
servers, directories, locations and datasets. In a relatively big web setup it can be a nightmare if not
done properly both at the application level and by the system engineer himself.
As a result, nginx conguration was designed to simplify day-to-day operations and to provide
an easy means for further expansion of web server conguration.
nginx conguration is kept in a number of plain text les which typically reside in /usr/local-
/etc/nginx or /etc/nginx. The main conguration le is usually called nginx.conf. To keep
it uncluttered, parts of the conguration can be put in separate les which can be automatically
included in the main one. However, it should be noted here that nginx does not currently support
Andrew Alexeev 217
Apache-style distributed congurations (i.e., .htaccess les). All of the conguration relevant to
nginx web server behavior should reside in a centralized set of conguration les.
The conguration les are initially read and veried by the master process. A compiled read-only
form of the nginx conguration is available to the worker processes as they are forked from the
master process. Conguration structures are automatically shared by the usual virtual memory
management mechanisms.
nginx conguration has several dierent contexts for main, http, server, upstream, location
(and also mail for mail proxy) blocks of directives. Contexts never overlap. For instance, there is no
such thing as putting a location block in the main block of directives. Also, to avoid unnecessary
ambiguity there isnt anything like a global web server conguration. nginx conguration is meant
to be clean and logical, allowing users to maintain complicated conguration les that comprise
thousands of directives. In a private conversation, Sysoev said, Locations, directories, and other
blocks in the global server conguration are the features I never liked in Apache, so this is the reason
why they were never implemented in nginx.
Conguration syntax, formatting and denitions follow a so-called C-style convention. This
particular approach to making conguration les is already being used by a variety of open source
and commercial software applications. By design, C-style conguration is well-suited for nested
descriptions, being logical and easy to create, read and maintain, and liked by many engineers.
C-style conguration of nginx can also be easily automated.
While some of the nginx directives resemble certain parts of Apache conguration, setting up an
nginx instance is quite a dierent experience. For instance, rewrite rules are supported by nginx,
though it would require an administrator to manually adapt a legacy Apache rewrite conguration to
match nginx style. The implementation of the rewrite engine diers too.
In general, nginx settings also provide support for several original mechanisms that can be very
useful as part of a lean web server conguration. It makes sense to briey mention variables and the
try_files directive, which are somewhat unique to nginx. Variables in nginx were developed to
provide an additional even-more-powerful mechanism to control run-time conguration of a web
server. Variables are optimized for quick evaluation and are internally pre-compiled to indices.
Evaluation is done on demand; i.e., the value of a variable is typically calculated only once and
cached for the lifetime of a particular request. Variables can be used with dierent conguration
directives, providing additional exibility for describing conditional request processing behavior.
The try_files directive was initially meant to gradually replace conditional if conguration
statements in a more proper way, and it was designed to quickly and eciently try/match against
dierent URI-to-content mappings. Overall, the try_files directive works well and can be extremely
ecient and useful. It is recommended that the reader thoroughly check the try_files directive
and adopt its use whenever applicable
2
.
14.4 nginx Internals
As was mentioned before, the nginx codebase consists of a core and a number of modules. The core
of nginx is responsible for providing the foundation of the web server, web and mail reverse proxy
functionalities; it enables the use of underlying network protocols, builds the necessary run-time
environment, and ensures seamless interaction between dierent modules. However, most of the
protocol- and application-specic features are done by nginx modules, not the core.
2
See http://nginx.org/en/docs/http/ngx_http_core_module.html#try_files for more details.
218 nginx
Internally, nginx processes connections through a pipeline, or chain, of modules. In other words,
for every operation theres a module which is doing the relevant work; e.g., compression, modifying
content, executing server-side includes, communicating to the upstream application servers through
FastCGI or uwsgi protocols, or talking to memcached.
There are a couple of nginx modules that sit somewhere between the core and the real functional
modules. These modules are http and mail. These two modules provide an additional level of
abstraction between the core and lower-level components. In these modules, the handling of the
sequence of events associated with a respective application layer protocol like HTTP, SMTP or IMAP
is implemented. In combination with the nginx core, these upper-level modules are responsible for
maintaining the right order of calls to the respective functional modules. While the HTTP protocol
is currently implemented as part of the http module, there are plans to separate it into a functional
module in the future, due to the need to support other protocols like SPDY
3
.
The functional modules can be divided into event modules, phase handlers, output lters, variable
handlers, protocols, upstreams and load balancers. Most of these modules complement the HTTP
functionality of nginx, though event modules and protocols are also used for mail. Event modules
provide a particular OS-dependent event notication mechanism like kqueue or epoll. The event
module that nginx uses depends on the operating systemcapabilities and build conguration. Protocol
modules allow nginx to communicate through HTTPS, TLS/SSL, SMTP, POP3 and IMAP.
A typical HTTP request processing cycle looks like the following:
1. Client sends HTTP request
2. nginx core chooses the appropriate phase handler based on the congured location matching
the request
3. If congured to do so, a load balancer picks an upstream server for proxying
4. Phase handler does its job and passes each output buer to the rst lter
5. First lter passes the output to the second lter
6. Second lter passes the output to third (and so on)
7. Final response is sent to the client
nginx module invocation is extremely customizable. It is performed through a series of callbacks
using pointers to the executable functions. However, the downside of this is that it may place a big
burden on programmers who would like to write their own modules, because they must dene exactly
how and when the module should run. Both the nginx API and developers documentation are being
improved and made more available to alleviate this.
Some examples of where a module can attach are:
Before the conguration le is read and processed
For each conguration directive for the location and the server where it appears
When the main conguration is initialized
When the server (i.e., host/port) is initialized
When the server conguration is merged with the main conguration
When the location conguration is initialized or merged with its parent server conguration
When the master process starts or exits
When a new worker process starts or exits
When handling a request
When ltering the response header and the body
When picking, initiating and re-initiating a request to an upstream server
3
See "SPDY: An experimental protocol for a faster web" at http://www.chromium.org/spdy/spdy-whitepaper
Andrew Alexeev 219
When processing the response from an upstream server
When nishing an interaction with an upstream server
Inside a worker, the sequence of actions leading to the run-loop where the response is generated
looks like the following:
1. Begin ngx_worker_process_cycle()
2. Process events with OS specic mechanisms (such as epoll or kqueue)
3. Accept events and dispatch the relevant actions
4. Process/proxy request header and body
5. Generate response content (header, body) and stream it to the client
6. Finalize request
7. Re-initialize timers and events
The run-loop itself (steps 5 and 6) ensures incremental generation of a response and streaming it
to the client.
A more detailed view of processing an HTTP request might look like this:
1. Initialize request processing
2. Process header
3. Process body
4. Call the associated handler
5. Run through the processing phases
Which brings us to the phases. When nginx handles an HTTP request, it passes it through a
number of processing phases. At each phase there are handlers to call. In general, phase handlers
process a request and produce the relevant output. Phase handlers are attached to the locations
dened in the conguration le.
Phase handlers typically do four things: get the location conguration, generate an appropriate
response, send the header, and send the body. A handler has one argument: a specic structure
describing the request. A request structure has a lot of useful information about the client request,
such as the request method, URI, and header.
When the HTTP request header is read, nginx does a lookup of the associated virtual server
conguration. If the virtual server is found, the request goes through six phases:
1. Server rewrite phase
2. Location phase
3. Location rewrite phase (which can bring the request back to the previous phase)
4. Access control phase
5. try_files phase
6. Log phase
In an attempt to generate the necessary content in response to the request, nginx passes the
request to a suitable content handler. Depending on the exact location conguration, nginx may try
so-called unconditional handlers rst, like perl, proxy_pass, flv, mp4, etc. If the request does not
match any of the above content handlers, it is picked by one of the following handlers, in this exact
order: random index, index, autoindex, gzip_static, static.
Indexing module details can be found in the nginx documentation, but these are the modules
which handle requests with a trailing slash. If a specialized module like mp4 or autoindex isnt
appropriate, the content is considered to be just a le or directory on disk (that is, static) and is served
by the static content handler. For a directory it would automatically rewrite the URI so that the
trailing slash is always there (and then issue an HTTP redirect).
220 nginx
The content handlers content is then passed to the lters. Filters are also attached to locations,
and there can be several lters congured for a location. Filters do the task of manipulating the
output produced by a handler. The order of lter execution is determined at compile time. For the
out-of-the-box lters its predened, and for a third-party lter it can be congured at the build stage.
In the existing nginx implementation, lters can only do outbound changes and there is currently no
mechanism to write and attach lters to do input content transformation. Input ltering will appear
in future versions of nginx.
Filters follow a particular design pattern. A lter gets called, starts working, and calls the next
lter until the nal lter in the chain is called. After that, nginx nalizes the response. Filters dont
have to wait for the previous lter to nish. The next lter in a chain can start its own work as soon
as the input from the previous one is available (functionally much like the Unix pipeline). In turn,
the output response being generated can be passed to the client before the entire response from the
upstream server is received.
There are header lters and body lters; nginx feeds the header and the body of the response to
the associated lters separately.
A header lter consists of three basic steps:
1. Decide whether to operate on this response
2. Operate on the response
3. Call the next lter
Body lters transform the generated content. Examples of body lters include:
Server-side includes
XSLT ltering
Image ltering (for instance, resizing images on the y)
Charset modication
gzip compression
Chunked encoding
After the lter chain, the response is passed to the writer. Along with the writer there are a couple
of additional special purpose lters, namely the copy lter, and the postpone lter. The copy lter
is responsible for lling memory buers with the relevant response content which might be stored in
a proxy temporary directory. The postpone lter is used for subrequests.
Subrequests are a very important mechanism for request/response processing. Subrequests are
also one of the most powerful aspects of nginx. With subrequests nginx can return the results from
a dierent URL than the one the client originally requested. Some web frameworks call this an
internal redirect. However, nginx goes furthernot only can lters perform multiple subrequests and
combine the outputs into a single response, but subrequests can also be nested and hierarchical. A
subrequest can perform its own sub-subrequest, and a sub-subrequest can initiate sub-sub-subrequests.
Subrequests can map to les on the hard disk, other handlers, or upstream servers. Subrequests are
most useful for inserting additional content based on data from the original response. For example,
the SSI (server-side include) module uses a lter to parse the contents of the returned document,
and then replaces include directives with the contents of specied URLs. Or, it can be an example
of making a lter that treats the entire contents of a document as a URL to be retrieved, and then
appends the new document to the URL itself.
Upstream and load balancers are also worth describing briey. Upstreams are used to implement
what can be identied as a content handler which is a reverse proxy (proxy_pass handler). Upstream
modules mostly prepare the request to be sent to an upstream server (or backend) and receive
the response from the upstream server. There are no calls to output lters here. What an upstream
Andrew Alexeev 221
module does exactly is set callbacks to be invoked when the upstream server is ready to be written to
and read from. Callbacks implementing the following functionality exist:
Crafting a request buer (or a chain of them) to be sent to the upstream server
Re-initializing/resetting the connection to the upstream server (which happens right before
creating the request again)
Processing the rst bits of an upstream response and saving pointers to the payload received
from the upstream server
Aborting requests (which happens when the client terminates prematurely)
Finalizing the request when nginx nishes reading from the upstream server
Trimming the response body (e.g. removing a trailer)
Load balancer modules attach to the proxy_pass handler to provide the ability to choose an
upstreamserver when more than one upstreamserver is eligible. Aload balancer registers an enabling
conguration le directive, provides additional upstream initialization functions (to resolve upstream
names in DNS, etc.), initializes the connection structures, decides where to route the requests, updates
stats information. Currently nginx supports two standard disciplines for load balancing to upstream
servers: round-robin and ip-hash.
Upstream and load balancing handling mechanisms include algorithms to detect failed upstream
servers and to re-route new requests to the remaining onesthough a lot of additional work is
planned to enhance this functionality. In general, more work on load balancers is planned, and in the
next versions of nginx the mechanisms for distributing the load across dierent upstream servers as
well as health checks will be greatly improved.
There are also a couple of other interesting modules which provide an additional set of variables
for use in the conguration le. While the variables in nginx are created and updated across dierent
modules, there are two modules that are entirely dedicated to variables: geo and map. The geo
module is used to facilitate tracking of clients based on their IP addresses. This module can create
arbitrary variables that depend on the clients IP address. The other module, map, allows for the
creation of variables from other variables, essentially providing the ability to do exible mappings
of hostnames and other run-time variables. This kind of module may be called the variable handler.
Memory allocation mechanisms implemented inside a single nginx worker were, to some extent,
inspired by Apache. A high-level description of nginx memory management would be the following:
For each connection, the necessary memory buers are dynamically allocated, linked, used for
storing and manipulating the header and body of the request and the response, and then freed upon
connection release. It is very important to note that nginx tries to avoid copying data in memory as
much as possible and most of the data is passed along by pointer values, not by calling memcpy.
Going a bit deeper, when the response is generated by a module, the retrieved content is put in a
memory buer which is then added to a buer chain link. Subsequent processing works with this
buer chain link as well. Buer chains are quite complicated in nginx because there are several
processing scenarios which dier depending on the module type. For instance, it can be quite tricky
to manage the buers precisely while implementing a body lter module. Such a module can only
operate on one buer (chain link) at a time and it must decide whether to overwrite the input buer,
replace the buer with a newly allocated buer, or insert a new buer before or after the buer in
question. To complicate things, sometimes a module will receive several buers so that it has an
incomplete buer chain that it must operate on. However, at this time nginx provides only a low-level
API for manipulating buer chains, so before doing any actual implementation a third-party module
developer should become really uent with this arcane part of nginx.
A note on the above approach is that there are memory buers allocated for the entire life of a
connection, thus for long-lived connections some extra memory is kept. At the same time, on an idle
222 nginx
keepalive connection, nginx spends just 550 bytes of memory. A possible optimization for future
releases of nginx would be to reuse and share memory buers for long-lived connections.
The task of managing memory allocation is done by the nginx pool allocator. Shared memory
areas are used to accept mutex, cache metadata, the SSL session cache and the information associated
with bandwidth policing and management (limits). There is a slab allocator implemented in nginx to
manage shared memory allocation. To allow simultaneous safe use of shared memory, a number
of locking mechanisms are available (mutexes and semaphores). In order to organize complex data
structures, nginx also provides a red-black tree implementation. Red-black trees are used to keep
cache metadata in shared memory, track non-regex location denitions and for a couple of other
tasks.
Unfortunately, all of the above was never described in a consistent and simple manner, making
the job of developing third-party extensions for nginx quite complicated. Although some good
documents on nginx internals existfor instance, those produced by Evan Millersuch documents
required a huge reverse engineering eort, and the implementation of nginx modules is still a black
art for many.
Despite certain diculties associated with third-party module development, the nginx user
community recently saw a lot of useful third-party modules. There is, for instance, an embedded Lua
interpreter module for nginx, additional modules for load balancing, full WebDAV support, advanced
cache control and other interesting third-party work that the authors of this chapter encourage and
will support in the future.
14.5 Lessons Learned
When Igor Sysoev started to write nginx, most of the software enabling the Internet already existed,
and the architecture of such software typically followed denitions of legacy server and network
hardware, operating systems, and old Internet architecture in general. However, this didnt prevent
Igor from thinking he might be able to improve things in the web servers area. So, while the rst
lesson might seem obvious, it is this: there is always room for improvement.
With the idea of better web software in mind, Igor spent a lot of time developing the initial code
structure and studying dierent ways of optimizing the code for a variety of operating systems. Ten
years later he is developing a prototype of nginx version 2.0, taking into account the years of active
development on version 1. It is clear that the initial prototype of a new architecture, and the initial
code structure, are vitally important for the future of a software product.
Another point worth mentioning is that development should be focused. The Windows version
of nginx is probably a good example of how it is worth avoiding the dilution of development eorts
on something that is neither the developers core competence or the target application. It is equally
applicable to the rewrite engine that appeared during several attempts to enhance nginx with more
features for backward compatibility with the existing legacy setups.
Last but not least, it is worth mentioning that despite the fact that the nginx developer community
is not very large, third-party modules and extensions for nginx have always been a very important
part of its popularity. The work done by Evan Miller, Piotr Sikora, Valery Kholodkov, Zhang
Yichun (agentzh) and other talented software engineers has been much appreciated by the nginx user
community and its original developers.
223
224 nginx
[chapter15]
Open MPI
Jerey M. Squyres
15.1 Background
Open MPI [?] is an open source software implementation of The Message Passing Interface (MPI)
standard. Before the architecture and innards of Open MPI will make any sense, a little background
on the MPI standard must be discussed.
The Message Passing Interface (MPI)
The MPI standard is created and maintained by the MPI Forum
1
, an open group consisting of
parallel computing experts from both industry and academia. MPI denes an API that is used for a
specic type of portable, high-performance inter-process communication (IPC): message passing.
Specically, the MPI document describes the reliable transfer of discrete, typed messages between
MPI processes. Although the denition of an MPI process is subject to interpretation on a given
platform, it usually corresponds to the operating systems concept of a process (e.g., a POSIX
process). MPI is specically intended to be implemented as middleware, meaning that upper-level
applications call MPI functions to perform message passing.
MPI denes a high-level API, meaning that it abstracts away whatever underlying transport is
actually used to pass messages between processes. The idea is that sending-process X can eectively
say take this array of 1,073 double precision values and send them to process Y . The corresponding
receiving-process Y eectively says receive an array of 1,073 double precision values from process
X. A miracle occurs, and the array of 1,073 double precision values arrives in Y s waiting buer.
Notice what is absent in this exchange: there is no concept of a connection occurring, no stream
of bytes to interpret, and no network addresses exchanged. MPI abstracts all of that away, not only
to hide such complexity from the upper-level application, but also to make the application portable
across dierent environments and underlying message passing transports. Specically, a correct
MPI application is source-compatible across a wide variety of platforms and network types.
MPI denes not only point-to-point communication (e.g., send and receive), it also denes other
communication patterns, such as collective communication. Collective operations are where multiple
processes are involved in a single communication action. Reliable broadcast, for example, is where
one process has a message at the beginning of the operation, and at the end of the operation, all
1
http://www.mpi-forum.org/
processes in a group have the message. MPI also denes other concepts and communications patterns
that are not described here.
2
Uses of MPI
There are many implementations of the MPI standard that support a wide variety of platforms,
operating systems, and network types. Some implementations are open source, some are closed
source. Open MPI, as its name implies, is one of the open source implementations. Typical MPI
transport networks include (but are not limited to): various protocols over Ethernet (e.g., TCP,
iWARP, UDP, raw Ethernet frames, etc.), shared memory, and InniBand.
MPI implementations are typically used in so-called high-performance computing (HPC)
environments. MPI essentially provides the IPC for simulation codes, computational algorithms,
and other big number crunching types of applications. The input data sets on which these codes
operate typically represent too much computational work for just one server; MPI jobs are spread out
across tens, hundreds, or even thousands of servers, all working in concert to solve one computational
problem.
That is, the applications using MPI are both parallel in nature and highly compute-intensive.
It is not unusual for all the processor cores in an MPI job to run at 100% utilization. To be clear,
MPI jobs typically run in dedicated environments where the MPI processes are the only application
running on the machine (in addition to bare-bones operating system functionality, of course).
As such, MPI implementations are typically focused on providing extremely high performance,
measured by metrics such as:
Extremely low latency for short message passing. As an example, a 1-byte message can be
sent from a user-level Linux process on one server, through an InniBand switch, and received
at the target user-level Linux process on a dierent server in a little over 1 microsecond (i.e.,
0.000001 second).
Extremely high message network injection rate for short messages. Some vendors have MPI
implementations (paired with specied hardware) that can inject up to 28 million messages
per second into the network.
Quick ramp-up (as a function of message size) to the maximum bandwidth supported by the
underlying transport.
Low resource utilization. All resources used by MPI (e.g., memory, cache, and bus bandwidth)
cannot be used by the application. MPI implementations therefore try to maintain a balance of
low resource utilization while still providing high performance.
Open MPI
The rst version of the MPI standard, MPI-1.0, was published in 1994 [?]. MPI-2.0, a set of additions
on top of MPI-1, was completed in 1996 [?].
In the rst decade after MPI-1 was published, a variety of MPI implementations sprung up. Many
were provided by vendors for their proprietary network interconnects. Many other implementations
arose from the research and academic communities. Such implementations were typically research-
quality, meaning that their purpose was to investigate various high-performance networking concepts
and provide proofs-of-concept of their work. However, some were high enough quality that they
gained popularity and a number of users.
2
As of this writing, the most recent version of the MPI standard is MPI-2.2 [?]. Draft versions of the upcoming MPI-3
standard have been published; it may be nalized as early as late 2012.
226 Open MPI
Open MPI represents the union of four research/academic, open source MPI implementations:
LAM/MPI, LA/MPI (Los Alamos MPI), and FT-MPI (Fault-Tolerant MPI). The members of the
PACX-MPI team joined the Open MPI group shortly after its inception.
The members of these four development teams decided to collaborate when we had the collective
realization that, aside from minor dierences in optimizations and features, our software code bases
were quite similar. Each of the four code bases had their own strengths and weaknesses, but on the
whole, they more-or-less did the same things. So why compete? Why not pool our resources, work
together, and make an even better MPI implementation?
After much discussion, the decision was made to abandon our four existing code bases and take
only the best ideas from the prior projects. This decision was mainly predicated upon the following
premises:
Even though many of the underlying algorithms and techniques were similar among the four
code bases, they each had radically dierent implementation architectures, and would be
incredible dicult (if not impossible) to merge.
Each of the four also had their own (signicant) strengths and (signicant) weaknesses. Specif-
ically, there were features and architecture decisions from each of the four that were desirable
to carry forward. Likewise, there were poorly optimized and badly designed code in each of
the four that were desirable to leave behind.
The members of the four developer groups had not worked directly together before. Starting
with an entirely new code base (rather than advancing one of the existing code bases) put all
developers on equal ground.
Thus, Open MPI was born. Its rst Subversion commit was on November 22, 2003.
15.2 Architecture
For a variety of reasons (mostly related to either performance or portability), C and C++ were the only
two possibilities for the primary implementation language. C++ was eventually discarded because
dierent C++ compilers tend to lay out structs/classes in memory according to dierent optimization
algorithms, leading to dierent on-the-wire network representations. C was therefore chosen as the
primary implementation language, which inuenced several architectural design decisions.
When Open MPI was started, we knew that it would be a large, complex code base:
In 2003, the current version of the MPI standard, MPI-2.0, dened over 300 API functions.
Each of the four prior projects were large in themselves. For example, LAM/MPI had over
1,900 les of source code, comprising over 300,000 lines of code (including comments and
blanks).
We wanted Open MPI to support more features, environments, and networks than all four prior
projects put together.
We therefore spent a good deal of time designing an architecture that focused on three things:
1. Grouping similar functionality together in distinct abstraction layers
2. Using run-time loadable plugins and run-time parameters to choose between multiple dierent
implementations of the same behavior
3. Not allowing abstraction to get in the way of performance
Jerey M. Squyres 227
!"#$
!&'(
!)*+,-./ 0123*4
5,+67,+*
&'( 8))9:;,-<.
!'8=
Figure 15.1: Abstraction layer architectural view of Open MPI showing its three main layers: OPAL, ORTE,
and OMPI
Abstraction Layer Architecture
Open MPI has three main abstraction layers, shown in Figure 15.1:
Open, Portable Access Layer (OPAL): OPAL is the bottom layer of Open MPIs abstractions.
Its abstractions are focused on individual processes (versus parallel jobs). It provides utility
and glue code such as generic linked lists, string manipulation, debugging controls, and other
mundaneyet necessaryfunctionality.
OPAL also provides Open MPIs core portability between dierent operating systems, such as
discovering IP interfaces, sharing memory between processes on the same server, processor
and memory anity, high-precision timers, etc.
Open MPI Run-Time Environment (ORTE)
3
: An MPI implementation must provide not only the
required message passing API, but also an accompanying run-time system to launch, monitor,
and kill parallel jobs. In Open MPIs case, a parallel job is comprised of one or more processes
that may span multiple operating system instances, and are bound together to act as a single,
cohesive unit.
In simple environments with little or no distributed computational support, ORTE uses rsh
or ssh to launch the individual processes in parallel jobs. More advanced, HPC-dedicated
environments typically have schedulers and resource managers for fairly sharing computational
resources between many users. Such environments usually provide specialized APIs to launch
and regulate processes on compute servers. ORTE supports a wide variety of such managed
environments, such as (but not limited to): Torque/PBS Pro, SLURM, Oracle Grid Engine,
and LSF.
Open MPI (OMPI): The MPI layer is the highest abstraction layer, and is the only one exposed
to applications. The MPI API is implemented in this layer, as are all the message passing
semantics dened by the MPI standard.
Since portability is a primary requirement, the MPI layer supports a wide variety of network
types and underlying protocols. Some networks are similar in their underlying characteristics
and abstractions; some are not.
Although each abstraction is layered on top of the one below it, for performance reasons the
ORTE and OMPI layers can bypass the underlying abstraction layers and interact directly with the
3
Pronounced or-tay.
228 Open MPI
operating system and/or hardware when needed (as depicted in Figure 15.1). For example, the
OMPI layer uses OS-bypass methods to communicate with certain types of NIC hardware to obtain
maximum networking performance.
Each layer is built into a standalone library. The ORTE library depends on the OPAL library; the
OMPI library depends on the ORTE library. Separating the layers into their own libraries has acted
as a wonderful tool for preventing abstraction violations. Specically, applications will fail to link
if one layer incorrectly attempts to use a symbol in a higher layer. Over the years, this abstraction
enforcement mechanism has saved many developers from inadvertently blurring the lines between
the three layers.
Plugin Architecture
Although the initial members of the Open MPI collaboration shared a similar core goal (produce a
portable, high-performance implementation of the MPI standard), our organizational backgrounds,
opinions, and agendas wereand still arewildly dierent. We therefore spent a considerable
amount of time designing an architecture that would allow us to be dierent, even while sharing a
common code base.
Run-time loadable components were a natural choice (a.k.a., dynamic shared objects, or DSOs,
or plugins). Components enforce a common API but place few limitations on the implementation of
that API. Specically: the same interface behavior can be implemented multiple dierent ways. Users
can then choose, at run time, which plugin(s) to use. This even allows third parties to independently
develop and distribute their own Open MPI plugins outside of the core Open MPI package. Allowing
arbitrary extensibility is quite a liberating policy, both within the immediate set of Open MPI
developers and in the greater Open MPI community.
This run-time exibility is a key component of the Open MPI design philosophy and is deeply
integrated throughout the entire architecture. Case in point: the Open MPI v1.5 series includes
155 plugins. To list just a few examples, there are plugins for dierent memcpy() implementations,
plugins for how to launch processes on remote servers, and plugins for how to communicate on
dierent types of underlying networks.
One of the major benets of using plugins is that multiple groups of developers have freedom
to experiment with alternate implementations without aecting the core of Open MPI. This was a
critical feature, particularly in the early days of the Open MPI project. Sometimes the developers
didnt always know what was the right way to implement something, or sometimes they just disagreed.
In both cases, each party would implement their solution in a component, allowing the rest of the
developer community to easily compare and contrast. Code comparisons can be done without
components, of course, but the component concept helps guarantee that all implementations expose
exactly the same external API, and therefore provide exactly the same required semantics.
As a direct result of the exibility that it provides, the component concept is utilized heavily
throughout all three layers of Open MPI; in each layer there are many dierent types of components.
Each type of component is enclosed in a framework. A component belongs to exactly one framework,
and a framework supports exactly one kind of component. Figure 15.2 is a template of Open MPIs
architectural layout; it shows a few of Open MPIs frameworks and some of the components that
they contain. (The rest of Open MPIs frameworks and components are laid out in the same manner.)
Open MPIs set of layers, frameworks, and components is referred to as the Modular Component
Architecture (MCA).
Finally, another major advantage of using frameworks and components is their inherent compos-
ability. With over 40 frameworks in Open MPI v1.5, giving users the ability to mix-n-match dierent
Jerey M. Squyres 229
!"#$ &'( )*+# ,!'-./ !012/ 3$4 !&'( 536#+78
&'( 96:#
:+3$7;#+ 536#+
,9:58
&'( )*55#)<=#
*"#+3<*$7
,)*558
>?@A
+#7*5B<*$
<C#+7 ,<C#+8
D
3
7
#
1
E
'
F
A
3
+
#
4
C
#
C
*
+
6
!
"
#
$
G
3
9
+
?
)
7
H
D
3
7
#
1
B
$
#
4
F
A
3
+
#
4
C
#
C
*
+
6
.
*
*
"
9
3
)
I
H
D
3
7
#
+
7
A
J
7
7
A
F
.
K
0
&
-
.
'
F
,
E
+
3
6
8
H
D
3
7
#
.
?
$
B
L
M
3
+
N
?
$
O
?
$
4
*
N
7
H
H
G+3C#N*+I7
D37# 3$4
)*C"*$#$:7
,?P#P/ "5B@?$78
'+*)#77 53B$)A?$@
3$4 C*$?:*+?$@
,"5C8
Figure 15.2: Framework architectural view of Open MPI, showing just a few of Open MPIs frameworks and
components (i.e., plugins). Each framework contains a base and one or more components. This structure is
replicated in each of the layers shown in Figure 15.1. The sample frameworks listed in this gure are spread
across all three layers: btl and coll are in the OMPI layer, plm is in the ORTE layer, and timer is in the OPAL
layer.
plugins of dierent types allows them to create a software stack that is eectively tailored to their
individual system.
Plugin Frameworks
Each framework is fully self-contained in its own subdirectory in the Open MPI source code tree. The
name of the subdirectory is the same name as the framework; for example, the memory framework is
in the memory directory. Framework directories contain at least the following three items:
1. Component interface denition: A header le named <framework>.h will be located in the
top-level framework directory (e.g., the Memory framework contains memory/memory.h).
This well-known header le denes the interfaces that each component in the framework must
support. This header includes function pointer typedefs for the interface functions, structs for
marshaling these function pointers, and any other necessary types, attribute elds, macros,
declarations, etc.
2. Base code: The base subdirectory contains the glue code that provides the core functionality
of the framework. For example, the memory frameworks base directory is memory/base. The
base is typically comprised of logistical grunt work such as nding and opening components
at run-time, common utility functionality that may be utilized by multiple components, etc.
3. Components: All other subdirectories in the framework directory are assumed to be compo-
nents. Just like the framework, the names of the components are the same names as their
subdirectories (e.g., the memory/posix subdirectory contains the POSIX component in the
Memory framework).
Similar to how each framework denes the interfaces to which its components must adhere,
frameworks also dene other operational aspects, such as how they bootstrap themselves, how they
pick components to use, and how they are shut down. Two common examples of how frameworks
dier in their setup are many-of-many versus one-of-many frameworks, and static versus dynamic
frameworks.
230 Open MPI
Many-of-many frameworks. Some frameworks have functionality that can be implemented
multiple dierent ways in the same process. For example, Open MPIs point-to-point network
framework will load multiple driver plugins to allow a single process to send and receive messages
on multiple network types.
Such frameworks will typically open all components that they can nd and then query each
component, eectively asking, Do you want to run? The components determine whether they want
to run by examining the system on which they are running. For example, a point-to-point network
component will look to see if the network type it supports is both available and active on the system.
If it is not, the component will reply No, I do not want to run, causing the framework to close and
unload that component. If that network type is available, the component will reply Yes, I want to
run, causing the framework to keep the component open for further use.
One-of-many frameworks. Other frameworks provide functionality for which it does not make
sense to have more than one implementation available at run-time. For example, the creation of
a consistent checkpoint of a parallel jobmeaning that the job is eectively frozen and can be
arbitrarily resumed latermust be performed using the same back-end checkpointing system for
each process in the job. The plugin that interfaces to the desired back-end checkpointing system is
the only checkpoint plugin that must be loaded in each processall others are unnecessary.
Dynamic frameworks. Most frameworks allow their components to be loaded at run-time via
DSOs. This is the most exible method of nding and loading components; it allows features such
as explicitly not loading certain components, loading third-party components that were not included
in the main-line Open MPI distribution, etc.
Static frameworks. Some one-of-many frameworks have additional constraints that force their
one-and-only-one component to be selected at compile time (versus run time). Statically linking one-
of-many components allows direct invocation of its member functions (versus invocation via function
pointer), which may be important in highly performance-sensitive functionality. One example is the
memcpy framework, which provides platform-optimized memcpy() implementations.
Additionally, some frameworks provide functionality that may need to be utilized before Open
MPI is fully initialized. For example, the use of some network stacks require complicated memory
registration models, which, in turn, require replacing the C librarys default memory management
routines. Since memory management is intrinsic to an entire process, replacing the default scheme
can only be done pre-main. Therefore, such components must be statically linked into Open MPI
processes so that they can be available for pre-main hooks, long before MPI has even been initialized.
Plugin Components
Open MPI plugins are divided into two parts: a component struct and a module struct. The component
struct and the functions to which it refers are typically collectively referred to as the component.
Similarly, the module collectively refers to the module struct and its functions. The division
is somewhat analogous to C++ classes and objects. There is only one component per process; it
describes the overall plugin with some elds that are common to all components (regardless of
framework). If the component elects to run, it is used to generate one or more modules, which
typically perform the bulk of the functionality required by the framework.
Jerey M. Squyres 231
Throughout the next few sections, well build up the structures necessary for the TCP component
in the BTL (byte transfer layer) framework. The BTL framework eects point-to-point message
transfers; the TCP component, not surprisingly, uses TCP as its underlying transport for message
passing.
Component struct. Regardless of framework, each component contains a well-known, statically
allocated and initialized component struct. The struct must be named according to the template
mca_<framework>_<component>_component. For example, the TCP network driver components
struct in the BTL framework is named mca_btl_tcp_component.
Having templated component symbols both guarantees that there will be no name collisions
between components, and allows the MCA core to nd any arbitrary component struct via dlsym(2)
(or the appropriate equivalent in each supported operating system).
The base component struct contains some logistical information, such as the components formal
name, version, framework version adherence, etc. This data is used for debugging purposes, inventory
listing, and run-time compliance and compatibility checking.
struct mca_base_component_2___t {
/* Component struct version number */
int mca_major_version, mca_minor_version, mca_release_version;
/* The string name of the framework that this component belongs to,
and the frameworks API version that this component adheres to */
char mca_type_name[MCA_BASE_MAX_TYPE_NAME_LEN + 1];
int mca_type_major_version, mca_type_minor_version,
mca_type_release_version;
/* This components name and version number */
char mca_component_name[MCA_BASE_MAX_COMPONENT_NAME_LEN + 1];
int mca_component_major_version, mca_component_minor_version,
mca_component_release_version;
/* Function pointers */
mca_base_open_component_1___fn_t mca_open_component;
mca_base_close_component_1___fn_t mca_close_component;
mca_base_query_component_2___fn_t mca_query_component;
mca_base_register_component_params_2___fn_t
mca_register_component_params;
};
The base component struct is the core of the TCP BTL component; it contains the following
function pointers:
Open. The open call is the initial query function invoked on a component. It allows a component
to initialize itself, look around the system where it is running, and determine whether it wants
to run. If a component can always be run, it can provide a NULL open function pointer.
The TCP BTL component open function mainly initializes some data structures and ensures
that invalid parameters were not set by the user.
Close. When a framework decides that a component is no longer needed, it calls the close
function to allow the component to release any resources that it has allocated. The close
function is invoked on all remaining components when processes are shutting down. However,
232 Open MPI
close can also be invoked on components that are rejected at run time so that they can be closed
and ignored for the duration of the process.
The TCP BTL component close function closes listening sockets and frees resources (e.g.,
receiving buers).
Query. This call is a generalized Do you want to run? function. Not all frameworks utilize
this specic callsome need more specialized query functions.
The BTL framework does not use the generic query function (it denes its own; see below), so
the TCP BTL does not ll it in.
Parameter registration. This function is typically the rst function called on a component. It
allows the component to register any relevant run-time, user-settable parameters. Run-time
parameters are discussed further below.
The TCP BTL component register function creates a variety of user-settable run-time parame-
ters, such as one which allows the user to specify which IP interface(s) to use.
The component structure can also be extended on a per-framework and/or per-component basis.
Frameworks typically create a new component struct with the component base struct as the rst
member. This nesting allows frameworks to add their own attributes and function pointers. For
example, a framework that needs a more specialized query function (as compared to the query
function provided on the basic component) can add a function pointer in its framework-specic
component struct.
The MPI btl framework, which provides point-to-point MPI messaging functionality, uses this
technique.
struct mca_btl_base_component_2___t {
/* Base component struct */
mca_base_component_t btl_version;
/* Base component data block */
mca_base_component_data_t btl_data;
/* btl-framework specific query functions */
mca_btl_base_component_init_fn_t btl_init;
mca_btl_base_component_progress_fn_t btl_progress;
};
As an example of the TCP BTL framework query functions, the TCP BTL component btl_init
function does several things:
Creates a listening socket for each up IPv4 and IPv6 interface
Creates a module for each up IP interface
Registers the tuple (IP address, port) for each up IP interface with a central repository
so that other MPI processes know how to contact it
Similarly, plugins can extend the framework-specic component struct with their own members.
The tcp component in the btl framework does this; it caches many data members in its component
struct.
struct mca_btl_tcp_component_t {
/* btl framework-specific component struct */
mca_btl_base_component_2___t super;
Jerey M. Squyres 233
/* Some of the TCP BTL components specific data members */
/* Number of TCP interfaces on this server */
uint32_t tcp_addr_count;
/* IPv4 listening socket descriptor */
int tcp_listen_sd;
/* ...and many more not shown here */
};
This struct-nesting technique is eectively a simple emulation of C++ single inheritance: a
pointer to an instance of a struct mca_btl_tcp_component_t can be cast to any of the three types
such that it can be used by an abstraction layer than does not understand the derived types.
That being said, casting is generally frowned upon in Open MPI because it can lead to incredibly
subtle, dicult-to-nd bugs. An exception was made for this C++-emulation technique because it
has well-dened behaviors and helps enforce abstraction barriers.
Module struct. Module structs are individually dened by each framework; there is little com-
monality between them. Depending on the framework, components generate one or more module
struct instances to indicate that they want to be used.
For example, in the BTL framework, one module usually corresponds to a single network device.
If an MPI process is running on a Linux server with three up Ethernet devices, the TCP BTL
component will generate three TCP BTL modules; one corresponding to each Linux Ethernet device.
Each module will then be wholly responsible for all sending and receiving to and from its Ethernet
device.
Tying it all together. Figure 15.3 shows the nesting of the structures in the TCP BTL component,
and how it generates one module for each of the three Ethernet devices.
!"#$%" '%()*"+)"%,)%-',-./.")"
!"#$%" '%()*"+)*(!/)%-',-./.")0)1)1)"
!"#$%" '%()*(!/)%-',-./.")0)1)1)"
!"#$%" '%()*"+)"%,)'-2$+/)"
3-$.2 "- /"41
5-',-./."
%#/("/!
'-2$+/
!"#$%"!
!"#$%" '%()*"+)"%,)'-2$+/)"
3-$.2 "- /"46
!"#$%" '%()*"+)"%,)'-2$+/)"
3-$.2 "- /"40
Figure 15.3: The left side shows the nesting of structures in the TCP BTL component. The right side shows
how the component generates one module struct for each up Ethernet interface.
Composing BTL modules this way allows the upper-layer MPI progression engine both to treat
all network devices equally, and to perform user-level channel bonding.
For example, consider sending a large message across the three-device conguration described
above. Assume that each of the three Ethernet devices can be used to reach the intended receiver
(reachability is determined by TCP networks and netmasks, and some well-dened heuristics). In
this case, the sender will split the large message into multiple fragments. Each fragment will be
234 Open MPI
assignedin a round-robin fashionto one of the TCP BTL modules (each module will therefore
be assigned roughly one third of the fragments). Each module then sends its fragments over its
corresponding Ethernet device.
This may seem like a complex scheme, but it is surprisingly eective. By pipelining the sends
of a large message across the multiple TCP BTL modules, typical HPC environments (e.g., where
each Ethernet device is on a separate PCI bus) can sustain nearly maximum bandwidth speeds across
multiple Ethernet devices.
Run-Time Parameters
Developers commonly make decisions when writing code, such as:
Should I use algorithm A or algorithm B?
How large of a buer should I preallocate?
How long should the timeout be?
At what message size should I change network protocols?
. . . and so on.
Users tend to assume that the developers will answer such questions in a way that is generally
suitable for most types of systems. However, the HPC community is full of scientist and engineer
power users who want to aggressively tweak their hardware and software stacks to eke out every
possible compute cycle. Although these users typically do not want to tinker with the actual code
of their MPI implementation, they do want to tinker by selecting dierent internal algorithms,
choosing dierent resource consumption patterns, or forcing specic network protocols in dierent
circumstances.
Therefore, the MCA parameter system was included when designing Open MPI; the system is
a exible mechanism that allows users to change internal Open MPI parameter values at run time.
Specically, developers register string and integer MCA parameters throughout the Open MPI code
base, along with an associated default value and descriptive string dening what the parameter is
and how it is used. The general rule of thumb is that rather than hard-coding constants, developers
use run-time-settable MCA parameters, thereby allowing power users to tweak run-time behavior.
There are a number of MCA parameters in the base code of the three abstraction layers, but the
bulk of Open MPIs MCA parameters are located in individual components. For example, the TCL
BTL plugin has a parameter that species whether only TCPv4 interfaces, only TCPv6 interfaces, or
both types of interfaces should be used. Alternatively, another TCP BTL parameter can be set to
specify exactly which Ethernet devices to use.
Users can discover what parameters are available via a user-level command line tool (ompi_info).
Parameter values can be set in multiple ways: on the command line, via environment variables, via
the Windows registry, or in system- or user-level INI-style les.
The MCA parameter system complements the idea of run-time plugin selection exibility, and has
proved to be quite valuable to users. Although Open MPI developers try hard to choose reasonable
defaults for a wide variety of situations, every HPC environment is dierent. There are inevitably
environments where Open MPIs default parameter values will be unsuitableand possibly even
detrimental to performance. The MCA parameter system allows users to be proactive and tweak
Open MPIs behavior for their environment. Not only does this alleviate many upstream requests for
changes and/or bug reports, it allows users to experiment with the parameter space to nd the best
conguration for their specic system.
Jerey M. Squyres 235
15.3 Lessons Learned
With such a varied group of core Open MPI members, it is inevitable that we would each learn
something, and that as a group, we would learn many things. The following list describes just a few
of these lessons.
Performance
Message-passing performance and resource utilization are the king and queen of high-performance
computing. Open MPI was specically designed in such a way that it could operate at the very
bleeding edge of high performance: incredibly low latencies for sending short messages, extremely
high short message injection rates on supported networks, fast ramp-ups to maximum bandwidth for
large messages, etc. Abstraction is good (for many reasons), but it must be designed with care so
that it does not get in the way of performance. Or, put dierently: carefully choose abstractions that
lend themselves to shallow, performant call stacks (versus deep, feature-rich API call stacks).
That being said, we also had to accept that in some cases, abstractionnot architecturemust
be thrown out the window. Case in point: Open MPI has hand-coded assembly for some of its most
performance-critical operations, such as shared memory locking and atomic operations.
It is worth noting that Figures 15.1 and 15.2 show two dierent architectural views of Open MPI.
They do not represent the run-time call stacks or calling invocation layering for the high performance
code sections.
Lesson learned: It is acceptable (albeit undesirable) and unfortunately sometimes necessary to
have gross, complex code in the name of performance (e.g., the aforementioned assembly code).
However, it is always preferable to spend time trying to gure out how to have good abstractions to
discretize and hide complexity whenever possible. A few weeks of design can save literally hundreds
or thousands of developer-hours of maintenance on tangled, subtle, spaghetti code.
Standing on the Shoulders of Giants
We actively tried to avoid re-inventing code in Open MPI that someone else has already written (when
such code is compatible with Open MPIs BSD licensing). Specically, we have no compunctions
about either directly re-using or interfacing to someone elses code.
There is no place for the not invented here religion when trying to solve highly complex
engineering problems; it only makes good logistical sense to re-use external code whenever possible.
Such re-use frees developers to focus on the problems unique to Open MPI; there is no sense
re-solving a problem that someone else has solved already.
A good example of this kind of code re-use is the GNU Libtool Libltdl package. Libltdl is a
small library that provides a portable API for opening DSOs and nding symbols in them. Libltdl is
supported on a wide variety of operating systems and environments, including Microsoft Windows.
Open MPI could have provided this functionality itselfbut why? Libltdl is a ne piece of
software, is actively maintained, is compatible with Open MPIs license, and provides exactly the
functionality that was needed. Given these points, there is no realistic gain for Open MPI developers
to re-write this functionality.
Lesson learned: When a suitable solution exists elsewhere, do not hesitate to integrate it and stop
wasting time trying to re-invent it.
236 Open MPI
Optimize for the Common Case
Another guiding architectural principle has been to optimize for the common case. For example,
emphasis is placed on splitting many operations into two phases: setup and repeated action. The
assumption is that setup may be expensive (meaning: slow). So do it once and get it over with.
Optimize for the much more common case: repeated operation.
For example, malloc() can be slow, especially if pages need to be allocated from the operating
system. So instead of allocating just enough bytes for a single incoming network message, allocate
enough space for a bunch of incoming messages, divide the result up into individual message buers,
and set up a freelist to maintain them. In this way, the rst request for a message buer may be slow,
but successive requests will be much faster because they will just be de-queues from a freelist.
Lesson learned: Split common operations into (at least) two phases: setup and repeated action.
Not only will the code perform better, it may be easier to maintain over time because the distinct
actions are separated.
Miscellaneous
There are too many more lessons learned to describe in detail here; the following are a few more
lessons that can be summed up briey:
We were fortunate to drawupon 15+ years of HPCresearch and make designs that have (mostly)
successfully carried us for more than eight years. When embarking on a new software project,
look to the past. Be sure to understand what has already been done, why it was done, and what
its strengths and weaknesses were.
The concept of componentsallowing multiple dierent implementations of the same functionality
has saved us many times, both technically and politically. Plugins are good.
Similarly, we continually add and remove frameworks as necessary. When developers start
arguing about the right way to implement a new feature, add a framework that fronts
components that implement that feature. Or when newer ideas come along that obsolete older
frameworks, dont hesitate to delete such kruft.
Conclusion
If we had to list the three most important things that weve learned from the Open MPI project, I
think they would be as follows:
One size does not t all (users). The run-time plugin and companion MCA parameter system
allow users exibility that is necessary in the world of portable software. Complex software
systems cannot (always) magically adapt to a given system; providing user-level controls allows
a human to gure outand override when the software behaves sub-optimally.
Dierences are good. Developer disagreements are good. Embrace challenges to the status
quo; do not get complacent. A plucky grad student saying Hey, check this out. . . can lead to
the basis of a whole new feature or a major evolution of the product.
Although outside the scope of this book, people and community matter. A lot.
237
238 Open MPI
[chapter16]
OSCAR
Jennifer Ruttan
Since their initial adoption, EMR (electronic medical record) systems have attempted to bridge the
gap between the physical and digital worlds of patient care. Governments in countries around the
world have attempted to come up with a solution that enables better care for patients at a lower cost
while reducing the paper trail that medicine typically generates. Many governments have been very
successful in their attempts to create such a systemsome, like that of the Canadian province of
Ontario, have not (some may remember the so-called eHealth Scandal in Ontario that, according
to the Auditor General, cost taxpayers $1 billion CAD).
An EMR permits the digitization of a patient chart, and when used properly should make it
easier for a physician to deliver care. A good system should provide a physician a birds eye view
of a patients current and ongoing conditions, their prescription history, their recent lab results,
history of their previous visits, and so on. OSCAR (Open Source Clinical Application Resource), an
approximately ten-year-old project of McMaster University in Hamilton, Ontario, Canada, is the
open source communitys attempt to provide such a system to physicians at low or no cost.
OSCAR has many subsystems that provide functionality on a component-by-component basis.
For example, oscarEncounter provides an interface for interacting with a patients chart directly; Rx3
is a prescription module that checks for allergies and drug interactions automatically and allows a
physician to directly fax a prescription to a pharmacy from the UI; the Integrator is a component to
enable data sharing between multiple compatible EMRs. All of these separate components come
together to build the typical OSCAR user experience.
OSCAR wont be for every physician; for example, a specialist may not nd all the features of
the system useful, and it is not easily customizable. However, it oers a complete set of features for
a general physician interacting with patients on a day-to-day basis.
In addition, OSCAR is CMS 3.0 certied (and has applied for CMS 4.0 certication)which
allows physicians to receive funding for installing it in their clinic
1
. Receiving CMS certication
involves passing a set of requirements from the Government of Ontario and paying a fee.
This chapter will discuss the architecture of OSCAR in fairly general terms, describing the
hierarchy, major components, and most importantly the impact that past decisions have made on the
project. As a conclusion and to wrap up, there will be a discussion on how OSCAR might have been
designed today if there was an opportunity to do so.
1
See https://www.emradvisor.ca/ for details.
16.1 System Hierarchy
As a Tomcat web application, OSCAR generally follows the typical model-view-controller design
pattern. This means that the model code (Data Access Objects, or DAOs) is separate from the
controller code (servlets) and those are separated from the views (Java Server Pages, or JSPs). The
most signicant dierence between the two is that servlets are classes and JSPs are HTML pages
marked up with Java code. Data gets placed into memory when a servlet executes and the JSP reads
that same data, usually done via reads and writes to the attributes of the request object. Just about
every JSP page in OSCAR has this kind of design.
16.2 Past Decision Making
I mentioned that OSCAR is a fairly old project. This has implications for how eectively the MVC
pattern has been applied. In short, there are sections of the code that completely disregard the
pattern as they were written before tighter enforcement of the MVC pattern began. Some of the most
common features are written this way; for example, performing many actions related to demographics
(patient records) are done via the demographiccontrol.jsp lethis includes creating patients
and updating their data.
OSCARs age is a hurdle for tackling many of the problems that are facing the source tree today.
Indeed, there has been signicant eort made to improve the situation, including enforcing design
rules via a code review process. This is an approach that the community at present has decided will
allow better collaboration in the future, and will prevent poor code from becoming part of the code
base, which has been a problem in the past.
This is by no means a restriction on how we could design parts of the system now; it does,
however, make it more complicated when deciding to x bugs in a dated part of OSCAR. Do you, as
somebody tasked to x a bug in the Demographic Creation function, x the bug with code in the
same style as it currently exists? Or do you re-write the module completely so that it closely follows
the MVC design pattern?
As developers we must carefully weigh our options in situations like those. There is no guarantee
that if you re-architect a part of the system you will not create new bugs, and when patient data is on
the line, we must make the decision carefully.
16.3 Version Control
ACVS repository was used for much of OSCARs life. Commits werent often checked for consistency
and it was possible to commit code that could break the build. It was tough for developers to keep up
with changesespecially new developers joining the project late in its lifecycle. A new developer
could see something that they would want to change, make the change, and get it into the source
branch several weeks before anybody would notice that something signicant had been modied
(this was especially prevalent during long holidays, such as Christmas break, when not many people
were watching the source tree).
Things have changed; OSCARs source tree is now controlled by git. Any commits to the main
branch have to pass code-style checking and unit testing, successfully compile, and be code reviewed
by the developers (much of this is handled by the combination of Hudson
2
and Gerrit
3
). The project
2
A continuous integration server: http://hudson-ci.org/
3
A code review tool: http://code.google.com/p/gerrit/
240 OSCAR
has become much more tightly controlled. Many or all of the issues caused by poor handling of the
source tree have been solved.
16.4 Data Models/DAOs
When looking through the OSCAR source, you may notice that there are many dierent ways to
access the database: you can use a direct connection to the database via a class called DBHandler,
use a legacy Hibernate model, or use a generic JPA model. As new and easier database access models
became available, they were integrated into OSCAR. The result is that there is now a slightly noisy
picture of how OSCAR interacts with data in MySQL, and the dierences between the three types of
data access methods are best described with examples.
EForms (DBHandler)
The EForm system allows users to create their own forms to attach to patient recordsthis feature is
usually used to replace a paper-based form with a digital version. On each creation of a form of a
particular type, the forms template le is loaded; then the data in the form is stored in the database
for each instance. Each instance is attached to a patient record.
EForms allow you to pull in certain types of data from a patient chart or other area of the system
via free-form SQL queries (which are dened in a le called apconfig.xml). This can be extremely
useful, as a form can load and then immediately be populated with demographic or other relevant
information without intervention from the user; for example, you wouldnt have to type in a patients
name, age, date of birth, hometown, phone number, or the last note that was recorded for that patient.
A design decision was made, when originally developing the EForm module, to use raw database
queries to populate a POJO (plain-old Java object) called EForm in the controller that is then passed
to the view layer to display data on the screen, sort of like a JavaBean. Using a POJO in this case is
actually closer in design to the Hibernate or JPA architecture, as Ill discuss in the next sections.
All of the functionality regarding saving EForm instances and templates is done via raw SQL
queries run through the DBHandler class. Ultimately, DBHandler is a wrapper for a simple JDBC
object and does not scrutinize a query before sending it to the SQL server. It should be added here
that DBHandler is a potential security aw as it allows unchecked SQL to be sent to the server. Any
class that uses DBHandler must implement its own checking to make sure that SQL injection doesnt
occur.
Depending on the type of application youre writing, direct access of a database is sometimes
ne. In certain cases, it can even speed development up. Using this method to access the database
doesnt conform to the model-view-controller design pattern, though: if youre going to change your
database structure (the model), you have to change the SQL query elsewhere (in the controller).
Sometimes, adding certain columns or changing their type in OSCARs database tables requires this
kind of invasive procedure just to implement small features.
It may not surprise you to nd out that the DBHandler object is one of the oldest pieces of code
still intact in the source. I personally dont know where it originated from but I consider it to be the
most primitive of database access types that exist in the OSCAR source. No new code is permitted
to use this class, and if code is committed that uses it, the commit will be rejected automatically.
Jennifer Ruttan 241
Demographic Records (Hibernate)
A demographic record contains general metadata about a patient; for example, their name, age,
address, language, and sex; consider it to be the result of an intake form that a patient lls out during
their rst visit to a doctor. All of this data is retrieved and displayed as part of OSCARs Master
Record for a specic demographic.
Using Hibernate to access the database is far safer than using DBHandler. For one, you have
to explicitly dene which columns match to which elds in your model object (in this case, the
Demographic class). If you want to perform complex joins, they have to be done as prepared
statements. Finally, you will only ever receive an object of the type you ask for when performing a
query, which is very convenient.
The process of working with a Hibernate-style DAO and Model pair is quite simple. In the case
of the Demographic object, theres a le called Demographic.hbm.xml that describes the mapping
between object eld and database column. The le describes which table to look at and what type of
object to return. When OSCAR starts, this le will be read and a sanity check occurs to make sure
that this kind of mapping can actually be made (server startup fails if it cant). Once running, you
grab an instance of the DemographicDao object and run queries against it.
The best part about using Hibernate over DBHandler is that all of the queries to the server are
prepared statements. This restricts you from running free-form SQL during runtime, but it also
prevents any type of SQL injection attack. Hibernate will often build large queries to grab the data,
and it doesnt always perform in an extremely ecient way.
In the previous section I mentioned an example of the EForm module using DBHandler to
populate a POJO. This is the next logical step to preventing that kind of code from being written. If
the model has to change, only the .hbm.xml le and the model class have to change (a new eld and
getter/setter for the new column), and doing so wont impact the rest of the application.
While newer than DBHandler, the Hibernate method is also starting to show its age. Its not
always convenient to use and requires a big conguration le for each table you want to access.
Setting up a new object pair takes time and if you do it incorrectly OSCAR wont even start. For this
reason, nobody should be writing new code that uses pure Hibernate, either. Instead, generic JPA is
being embraced in new development.
Integrator Consent (JPA)
The newest form of database access is done via generic JPA. If the OSCAR project decided to switch
from Hibernate to another database access API, conforming to the JPA standard for DAOs and Model
objects would make it very easy to migrate. Unfortunately, because this is so new to the OSCAR
project, there are almost no areas of the system that actually use this method to get data.
In any case, let me explain how it works. Instead of a .hbm.xml le, you add annotations to your
Model and DAO objects. These annotations describe the table to look in, column mappings for elds,
and join queries. Everything is contained inside the two les and nothing else is necessary for their
operation. Hibernate still runs behind the scenes, though, in actually retrieving the data from the
database.
All of the Integrators models are written using JPAand they are pretty good examples of both
the new style of database access as well as demonstrating that as a new technology to be implemented
into OSCAR, it hasnt been used in very many places yet. The Integrator is a relatively new addition
to the source. It makes quite a lot of sense to use this new data access model as opposed to Hibernate.
242 OSCAR
Touching on a now-common theme in this section of the chapter, the annotated POJOs that JPA
uses allow for a far more streamlined experience. For example, during the Integrators build process,
an SQL le is created that sets up all of the tables for youan enormously useful thing to have. With
that ability, its impossible to create mismatching tables and model objects (as you can do with any
other type of database access method) and you never have to worry about naming of columns and
tables. There are no direct SQL queries, so its not possible to create SQL injection attacks. In short,
it just works.
The way that JPA works can be considered to be fairly similar to the way that ActiveRecord works
in Ruby on Rails. The model class denes the data type and the database stores it; what happens in
between thatgetting data in and outis not up to the user.
Issues with Hibernate and JPA
Both Hibernate and JPA oer some signicant benets in typical use cases. For simple retrieval and
storage, they really cut time out of development and debugging.
However, that doesnt mean that their implementation into OSCAR has been without issue.
Because the user doesnt dene the SQL between the database and the POJO referencing a specic
row, Hibernate gets to choose the best way to do it. The best way can manifest itself in a couple of
ways: Hibernate can choose to just retrieve the simple data from the row, or it can perform a join and
retrieve a lot of information at once. Sometimes these joins get out of hand.
Heres another example: The casemgmt_note table stores all patient notes. Each note object
stores lots of metadata about the notebut it also stores a list of all of the issues that the note deals
with (issues can be things like, smoking cessation or diabetes, which describe the contents of
the note). The list of issues is represented in the note object as a List<CaseManagementIssue>.
In order to get that list, the casemgmt_note table is joined with the casemgmt_issue_notes table
(which acts as a mapping table) and nally the casemgmt_issue table.
When you want to write a custom query in Hibernate, which this situation requires, you dont
write standard SQLyou write HQL (Hibernate Query Language) that is then translated to SQL (by
inserting internal column names for all the elds to be selected) before parameters are inserted and
the query is sent to the database server. In this specic case, the query was written with basic joins
with no join columnsmeaning that when the query was eventually translated to SQL, it was so
large that it wasnt immediately obvious what the query was gathering. Additionally, in almost all
cases, this never created a large enough temporary table for it to matter. For most users, this query
actually runs quickly enough that its not noticeable. However, this query is unbelievably inecient.
Lets step back for a second. When you perform a join on two tables, the server has to create a
temporary table in memory. In the most generic type of joins, the number of rows is equal to the
number of rows in the rst table multiplied by the number of rows in the second table. So if your
table has 500,000 rows, and you join it with a table that has 10,000,000 rows, youve just created
a 510
12
row temporary table in memory, which the select statement is then run against and that
temporary table is discarded.
In one extreme case that we ran into, the join across three tables caused a temporary table to be
created that was around 710
12
rows in length, of which about 1000 rows were eventually selected.
This operation took about 5 minutes and locked the casemgmt_note table while it was running.
The problem was solved, eventually, through the use of a prepared statement that restricted
the scope of the rst table before joining with the other two. The newer, far more ecient query
brought the number of rows to select down to a very manageable 300,000 and enormously improved
Jennifer Ruttan 243
performance of the notes retrieval operation (down to about 0.1 seconds to perform the same select
statement).
The moral of the story is simply that while Hibernate does a fairly good job, unless the join is
very explicitly dened and controlled (either in the .hbm.xml le or a join annotation in the object
class for a JPA model), it can very quickly get out of control. Dealing with objects instead of SQL
queries requires you to leave the actual implementation of the query up to the database access library
and only really allows you to control denition. Unless youre careful with how you dene things, it
can all fall apart under extreme conditions. Furthermore, if youre a database programmer with lots
of SQL knowledge, it wont really help much when designing a JPA-enabled class, and it removes
some of the control that you would have if you were writing an SQL statement manually. Ultimately,
a good knowledge of both SQL and JPA annotations and how they aect queries is required.
16.5 Permissions
CAISI (Client Access to Integrated Services and Information) was originally a standalone producta
fork of OSCARto help manage homeless shelters in Toronto. A decision was eventually made to
merge the code from CAISI into the main source branch. The original CAISI project may no longer
exist, but what it gave to OSCAR is very important: its permission model.
The permissions model in OSCAR is extremely powerful and can be used to create just about
as many roles and permission sets as possible. Providers belong to programs (as sta ) where
they have a specic role. Each program takes place at a facility. Each role has a description (for
example, doctor, nurse, social worker, and so on) and a set of attached global permissions.
The permissions are written in a format that makes them very easy to understand: read nurse notes
may be a permission that a doctor role may have, but the nurse role may not have the read doctor
notes permission.
This format may be easy to understand, but under the hood it requires quite a bit of heavy lifting
to actually check for these types of permissions. The name of the role that the current provider has is
checked against its list of permissions for a match with the action that they are trying to perform.
For example, a provider attempting to read a doctors notes would cause read doctor notes to be
checked for each and every note written by a doctor.
Another problem is the reliance on English for permission denition. Anybody using OSCAR in
a language other than English would still need to write their permissions in a format such as read
[role] notes, using the English words read, write, notes, and so on.
CAISIs permission model is a signicant part of OSCAR, but its not the only model in place.
Before CAISI was implemented, another role-based (but not program-based) system was developed
and is still in use in many parts of the system today.
For this second system, providers are assigned one or many roles (for example, doctor, nurse,
admin, and so on). They can be assigned as many roles as necessarythe roles permissions
stack on top of each other. These permissions are generally used for restricting access to parts of
the system, as opposed to CAISIs permissions which restrict access to certain pieces of data on
a patients chart. For example, a user has to have the _admin read permission on a role that
they have assigned to them to be able to access the Admin panel. Having the read permission will
exempt them from being able to perform administrative tasks, however. Theyll need the write
permission as well for that.
Both of these systems accomplish roughly the same goal; its due to CAISIs merge later in the
project lifecycle that they both exist. They dont always exist happily together, so in reality it can be
244 OSCAR
a lot easier to just focus on using one for day-to-day operations of OSCAR. You can generally date
code in OSCAR by knowing which permissions model preceded which other permissions model:
Provider Type then Provider Roles then CAISI Programs/Roles
The oldest type of permissions model, Provider Type, is so dated that its actually not used in
most parts of the system and is in fact defaulted to doctor during new provider creationhaving it
as any other value (such as receptionist) causes signicant issues throughout the system. Its easier
and more ne-grained to control permissions via Provider Roles instead.
16.6 Integrator
OSCARs Integrator component is a separate web application that independent OSCAR instances
use to exchange patient, program and provider information over a secure link. It can be optionally
installed as a component for an installation in an environment such as a LHN (Local Health Network)
or a hospital. The easiest way to describe the Integrator is as a temporary storage facility.
Consider the following use case and argument for use of the Integrator: in Hospital X, there is
an ENT (ear, nose, and throat) clinic as well as an endocrinology clinic. If an ENT doctor refers
their patient to an endocrinologist upstairs, they may be required to send along patient history and
records. This is inconvenient and generates more paper than is necessaryperhaps the patient is
only seeing the endocrinologist once. By using the Integrator, the patients data can be accessed on
the endocrinologists EMR, and access to the contents of the patients chart can be revoked after the
visit.
A more extreme example: if an unconscious man shows up in an ER with nothing but his health
card, because the home clinic and the hospitals system are connected via the Integrator, the mans
record can be pulled and it can be very quickly realized that he has been prescribed the blood thinner
warfarin. Ultimately, information retrieval like this is what an EMR like OSCAR paired with the
Integrator can achieve.
Technical Details
The Integrator is available in source code form only, which requires the user to retrieve and build it
manually. Like OSCAR, it runs on a standard installation of Tomcat with MySQL.
When the URL where the Integrator lives is accessed, it doesnt appear to display anything useful.
This component is almost purely a web service; OSCAR communicates via POST and GET requests
to the Integrator URL.
As an independently developed project (initially as part of the CAISI project), the Integrator is
fairly strict in adhering to the MVC design pattern. The original developers have done an excellent
job of setting it up with very clearly dened lines between the models, views, and controllers. The
most recently implemented type of database access layer that I mentioned earliergeneric JPAis
the only such layer in the project. (As an interesting side note: because the entire project is properly
set up with JPA annotations on all the model classes, an SQL script is created at build time that
can be used to initialize the structure of the database; the Integrator, therefore, doesnt ship with a
stand-alone SQL script.)
Communication is handled via web service calls described in WSDL XML les that are available
on the server. A client could query the Integrator to nd out what kind of functions are available and
adapt to it. This really means that the Integrator is compatible with any kind of EMR that somebody
Jennifer Ruttan 245
decides to write a client for; the data format is generic enough that it could easily be mapped to local
types.
For OSCAR, though, a client library is built and included in the main source tree, for simplicitys
sake. That library only ever needs to be updated if new functions become available on the Integrator.
A bug x on the Integrator doesnt require an update of that le.
Design
Data for the Integrator comes in from all of the connected EMRs at scheduled times and, once
there, another EMR can request that data. None of the data on the Integrator is stored permanently,
thoughits database could be erased and it could be rebuilt from the client data.
The dataset sent is congured individually at each OSCAR instance which is connected to a
particular Integrator, and except in situations where the entire patient database has to be sent to the
Integrator server, only patient records that have been viewed since the previous push to the server are
sent. The process isnt exactly like delta patching, but its close.
Figure 16.1: Data exchange between OSCARs and Integrator
Let me go into a little more detail about how the Integrator works with an example: a remote
clinic seeing another clinics patient. When that clinic wants to access the patients record, the clinics
rst have to have been connected to the same Integrator server. The receptionist can search the
Integrator for the remote patient (by name and optionally date of birth or sex) and nd their record
stored on the server. They initiate the copy of a limited set of the patients demographic information
and then double-check with the patient to make sure that they consent to the retrieval of their record by
completing a consent form. Once completed, the Integrator server will deliver whatever information
the Integrator knows about that patientnotes, prescriptions, allergies, vaccinations, documents,
and so on. This data is cached locally so that the local OSCAR doesnt have to send a request to the
Integrator every time it wants to see this data, but the local cache expires every hour.
After the initial setup of a remote patient by copying their demographic data to the local OSCAR,
that patient is set up as any other on the system. All of the remote data that is retrieved from
the Integrator is marked as such (and the clinic from which it came from is noted), but its only
temporarily cached on the local OSCAR. Any local data that is recorded is recorded just like any
other patient datato the patient record, and sent to the Integratorbut not permanently stored on
any remote machine.
This has a very important implication, especially for patient consent and how that factors into
the design of the Integrator. Lets say that a patient sees a remote physician and is ne with them
having access to their record, but only temporarily. After their visit, they can revoke the consent for
that clinic to be able to view that patients record and the next time that clinic opens the patients
246 OSCAR
Figure 16.2: The Demographic information and associated data is sent to the Integrator during a data push from
the home clinic. The record on the Integrator may not be a representation of the complete record from the home
clinic as the OSCAR can choose not to send all patient data.
Figure 16.3: A remote OSCAR requests data from the Integrator by asking for a specic patient record. The
Integrator server sends only the demographic information, which is stored permanently on the remote OSCAR.
chart there wont be any data there (with the exception of any data that was locally recorded). This
ultimately gives control over how and when a record is viewed directly to the patient and is similar
to walking into a clinic carrying a copy of your paper chart. They can see the chart while theyre
interacting with you, but you take it home with you when you leave.
Figure 16.4: A remote clinic can see the contents of a patient chart by asking for the data; if the appropriate
consent is present, the data is sent. The data is never stored permanently on the remote OSCAR.
Another very important ability is for physicians to decide what kinds of data they want to share
with the other connected clinics via their Integrator server. A clinic can choose to share all of a
demographic record or only parts of it, such as notes but not documents, allergies but not prescriptions,
and so on. Ultimately its up to the group of physicians who set up the Integrator server to decide
what kinds of data theyre comfortable with sharing with each other.
As I mentioned before, the Integrator is only a temporary storage warehouse and no data is ever
stored permanently there. This is another very important decision that was made during development;
it allows clinics to back out of sharing any and all data via the Integrator very easilyand in fact if
necessary the entire Integrator database can be wiped. If the database is wiped, no user of a client
will ever notice because the data will be accurately reconstructed from the original data on all of the
Jennifer Ruttan 247
various connected clients. An implication is that the OSCAR provider needs to trust the Integrator
provider to have wiped the database when they say soit is therefore best to deploy an Integrator to
a group of physicians already in a legal organization such as a Family Health Organization or Family
Health Team; the Integrator server would be housed at one of these physicians clinics.
Data Format
The Integrators client libraries are built via wsdl2java, which creates a set of classes representing
the appropriate data types the web service communicates in. There are classes for each data type as
well as classes representing keys for each of these data types.
Its outside the scope of this chapter to describe how to build the Integrators client library. Whats
important to know is that once the library is built, it must be included with the rest of the JARs in
OSCAR. This JAR contains everything necessary to set up the Integrator connection and access
all of the data types that the Integrator server will return to OSCAR, such as CachedDemographic,
CachedDemographicNote, and CachedProvider, among many others. In addition to the data types
that are returned, there are WS classes that are used for the retrieval of such lists of data in the rst
placethe most frequently used being DemographicWs.
Dealing with the Integrator data can sometimes be a little tricky. OSCAR doesnt have anything
truly built-in to handle this kind of data, so what usually happens is when retrieving a certain kind of
patient data (for example, notes for a patients chart) the Integrator client is asked to retrieve data
from the server. That data is then manually transformed into a local class representing that data (in
the case of notes, its a CaseManagementNote). A Boolean ag is set inside the class to indicate
that its a piece of remote content and that is used to change how the data is displayed to the user on
the screen. On the opposite end, CaisiIntegratorUpdateTask handles taking local OSCAR data,
converting it into the Integrators data format, and then sending that data to the Integrator server.
This design may not be as ecient or as clean as possible, but it does enable older parts of
the system to become compatible with Integrator-delivered data without much modication. In
addition, keeping the view as simple as possible by referring to only one type of class improves the
readability of the JSP le and makes it easier to debug in the event of an error.
16.7 Lessons Learned
As you can probably imagine, OSCAR has its share of issues when it comes to overall design. It does,
however, provide a complete feature set that most users will nd no issues with. Thats ultimately
the goal of the project: provide a good solution that works in most situations.
I cant speak for the entire OSCAR community, so this section will be highly subjective and from
my point of view. I feel that there are some important takeaways from an architectural discussion
about the project.
First, its clear that poor source control in the past has caused the architecture of the system to
become highly chaotic in parts, especially in areas where the controllers and the views blend together.
The way that the project was run in the past didnt prevent this from happening, but the process has
changed since and hopefully we wont have to deal with such a problem again.
Next, because the project is so old, its dicult to upgrade (or even change) libraries without
causing signicant disruption throughout the code base. Thats exactly what has happened, though.
I often nd it dicult to gure out whats necessary and what isnt when Im looking in the library
folder. In addition to that, sometimes when libraries undergo major upgrades they break backwards
248 OSCAR
compatibility (changing package names is a common oense). There are often several libraries
included with OSCAR that all accomplish the same taskthis goes back to poor source control, but
also the fact that that there has been no list or documentation describing which library is required by
which component.
Additionally, OSCAR is a little inexible when it comes to adding new features to existing
subsystems. For example, if you want to add a new box to the E-Chart, youll have to create a new
JSP page and a new servlet, modify the layout of the E-Chart (in a few places), and modify the
conguration le of the application so that your servlet can load.
Next, due to the lack of documentation, sometimes it is nearly impossible to gure out how a part
of the system worksthe original contributor may not even be part of the project anymoreand
often the only tool you have to gure it out is a debugger. As a project of this age, this is costing
the community the potential for new contributors to get involved. However, its something that, as a
collaborative eort, the community is working on.
Finally, OSCAR is a repository for medical information and its security is compromised by the
inclusion of the DBHandler class (discussed in a previous section). I personally feel that freeform
database queries that accept parameters should never be acceptable in an EMR because its so easy
to perform SQL injection attacks. While its good that no new code is permitted that uses this class,
it should be a priority of the development team to remove all instances of its use.
All of that may sound like some harsh criticism of the project. In the past, all of these problems
have been signicant and, like I said, prevent the community from growing as the barrier to entry
is so high. This is something that is changing, so in the future, these issues wont be so much of a
hindrance.
In looking back over the projects history (and especially over the past few versions) we can come
up with a better design for how the application would be built. The system still has to provide a base
level of functionality (mandated by the Ontario government for certication as an EMR), so that
all has to be baked in by default. But if OSCAR were to be redesigned today, it should be designed
in a truly modular fashion that would allow modules to be treated as plugins; if you didnt like the
default E-Form module, you could write your own (or even another module entirely). It should be
able to speak to more systems (or more systems should be able to speak to it), including the medical
hardware that you see in increasing use throughout the industry, such as devices for measuring visual
acuity. This also means that it would be easy to adapt OSCAR to the requirements of local and
federal governments around the world for storing medical data. Since every region has a dierent set
of laws and requirements, this kind of design would be crucial for making sure that OSCAR develops
a worldwide userbase.
I also believe that security should be the most important feature of all. An EMR is only as secure
as its least secure component, so there should be focus on abstracting away as much data access
as possible from the application so that it stores and retrieves data in a sandbox-style environment
through a main data access layer API that has been audited by a third-party and found to be adequate
for storing medical information. Other EMRs can hide behind obscurity and proprietary code as
a security measure (which isnt really a security measure at all), but being open source, OSCAR
should lead the charge with better data protection.
I stand rmly as a believer in the OSCAR project. We have hundreds of users that we know about
(and the many hundreds that we dont), and we receive valuable feedback from the physicians who
are interacting with our project on a daily basis. Through the development of new processes and
new features, we hope to grow the installed base and to support users from other regions. It is our
intention to make sure that what we deliver is something that improves the lives of the physicians who
use OSCAR as well as the lives of their patients, by creating better tools to help manage healthcare.
249
250 OSCAR
[chapter17]
Processing.js
Mike Kamermans
Originally developed by Ben Fry and Casey Reas, the Processing programming language started
as an open source programming language (based on Java) to help the electronic arts and visual
design communities learn the basics of computer programming in a visual context. Oering a highly
simplied model for 2D and 3D graphics compared to most programming languages, it quickly
became well-suited for a wide range of activities, from teaching programming through writing small
visualisations to creating multi-wall art installations, and became able to perform a wide variety of
tasks, from simply reading in a sequence of strings to acting as the de facto IDE for programming
and operating the popular Arduino open source hardware prototyping boards. Continuing to gain
popularity, Processing has rmly taken its place as an easy to learn, widely used programming
language for all things visual, and so much more.
The basic Processing program, called a sketch, consists of two functions: setup and draw.
The rst is the main program entry point, and can contain any amount of initialization instructions.
After nishing setup, Processing programs can do one of two things: 1) call draw, and schedule
another call to draw at a xed interval upon completion; or 2) call draw, and wait for input events
from the user. By default, Processing does the former; calling noLoop results in the latter. This
allows for two modes to present sketches, namely a xed framerate graphical environment, and an
interactive, event-based updating graphical environment. In both cases, user events are monitored
and can be handled either in their own event handlers, or for certain events that set persistent global
values, directly in the draw function.
Processing.js is a sister project of Processing, designed to bring it to the web without the need
for Java or plugins. It started as an attempt by John Resig to see if the Processing language could be
ported to the web, by using theat the time brand newHTML5 <canvas> element as a graphical
context, with a proof of concept library released to the public in 2008. Written with the idea in
mind that your code should just work, Processing.js has been rened over the years to make data
visualisations, digital art, interactive animations, educational graphs, video games, etc. work using
web standards and without any plugins. You write code using the Processing language, either in
the Processing IDE or your favourite editor of choice, include it on a web page using a <canvas>
element, and Processing.js does the rest, rendering everything in the <canvas> element and letting
users interact with the graphics in the same way they would with a normal standalone Processing
program.
17.1 How Does It Work?
Processing.js is a bit unusual as an open source project, in that the code base is a single le called
processing.js which contains the code for Processing, the single object that makes up the entire
library. In terms of how the code is structured, we constantly shue things around inside this object
as we try to clean it up a little bit with every release. Its design is relatively straightforward, and its
function can be described in a single sentence: it rewrites Processing source code into pure JavaScript
source code, and every Processing API function call is mapped to a corresponding function in the
JavaScript Processing object, which eects the same thing on a <canvas> element as the Processing
call would eect on a Java Applet canvas.
For speed, we have two separate code paths for 2D and 3D functions, and when a sketch is loaded,
either one or the other is used for resolving function wrappers so that we dont add bloat to running
instances. However, in terms of data structures and code ow, knowing JavaScript means you can
read processing.js, with the possible exception of the syntax parser.
Unifying Java and JavaScript
Rewriting Processing source code into JavaScript source code means that you can simply tell the
browser to execute the rewritten source, and if you rewrote it correctly, things just work. But, making
sure the rewrite is correct has taken, and still occasionally takes, quite a bit of eort. Processing
syntax is based on Java, which means that Processing.js has to essentially transform Java source code
into JavaScript source code. Initially, this was achieved by treating the Java source code as a string,
and iteratively replacing substrings of Java with their JavaScript equivalents
1
. For a small syntax set,
this is ne, but as time went on and complexity added to complexity, this approach started to break
down. Consequently, the parser was completely rewritten to build an Abstract Syntax Tree (AST)
instead, rst breaking down the Java source code into functional blocks, and then mapping each of
those blocks to their corresponding JavaScript syntax. The result is that, at the cost of readability
2
,
Processing.js now eectively contains an on-the-y Java-to-JavaScript transcompiler.
Here is the code for a Processing sketch:
void setup() {
size(2,2);
noCursor();
noStroke();
smooth(); }
void draw() {
fill(255,1);
rect(-1,-1,width+1,height+1);
float f = frameCount*PI/frameRate;
float d = 1+abs(6*sin(f));
fill(,1,,5);
ellipse(mouseX, mouseY, d,d); }
1
For those interested in an early incarnation of the parser, it can be found at https://github.com/jeresig/processing-
js/blob/51d28c516c53cd9e6353176dfa14746e6b2/processing.js, running from line 37 to line 266.
2
Readers are welcome to peruse https://github.com/jeresig/processing-js/blob/v1.3./processing.js#L17649,
up to line 19217.
252 Processing.js
And here is its Processing.js conversion:
function($p) {
function setup() {
$p.size(2, 2);
$p.noCursor();
$p.noStroke();
$p.smooth(); }
$p.setup = setup;
function draw() {
$p.fill(255, 1);
$p.rect(-1, -1, $p.width + 1, $p.height + 1);
var f = $p.frameCount * $p.PI / $p.__frameRate;
var d = 1 + $p.abs(6 * $p.sin(f));
$p.fill(, 1, , 5);
$p.ellipse($p.mouseX, $p.mouseY, d, d); }
$p.draw = draw; }
This sounds like a great thing, but there are a few problems when converting Java syntax to
JavaScript syntax:
1. Java programs are isolated entities. JavaScript programs share the world with a web page.
2. Java is strongly typed. JavaScript is not.
3. Java is a class/instance based object-oriented language. JavaScript is not.
4. Java has distinct variables and methods. JavaScript does not.
5. Java allows method overloading. JavaScript does not.
6. Java allows importing compiled code. JavaScript has no idea what that even means.
Dealing with these problems has been a tradeo between what users need, and what we can do
given web technologies. The following sections will discuss each of these issues in greater detail.
17.2 Signicant Dierences
Java programs have their own threads; JavaScript can lock up your browser.
Java programs are isolated entities, running in their own thread in the greater pool of applications
on your system. JavaScript programs, on the other hand, live inside a browser, and compete with
each other in a way that desktop applications dont. When a Java program loads a le, the program
waits until the resource is done loading, and operation resumes as intended. In a setting where the
program is an isolated entity on its own, this is ne. The operating system stays responsive because
its responsible for thread scheduling, and even if the program takes an hour to load all its data, you
can still use your computer. On a web page, this is not how things work. If you have a JavaScript
program waiting for a resource to be done loading, it will lock its process until that resource is
available. If youre using a browser that uses one process per tab, it will lock up your tab, and the rest
of the browser is still usable. If youre using a browser that doesnt, your entire browser will seem
frozen. So, regardless of what the process represents, the page the script runs on wont be usable
until the resource is done loading, and its entirely possible that your JavaScript will lock up the
entire browser.
Mike Kamermans 253
This is unacceptable on the modern web, where resources are transferred asynchronously, and
the page is expected to function normally while resources are loaded in the background. While
this is great for traditional web pages, for web applications this is a real brain twister: how do you
make JavaScript idle, waiting for a resource to load, when there is no explicit mechanism to make
JavaScript idle? While there is no explicit threading in JavaScript, there is an event model, and there
is an XMLHTTPRequest object for requesting arbitrary (not just XML or HTML) data from arbitrary
URLS. This object comes with several dierent status events, and we can use it to asynchronously
get data while the browser stays responsive. Which is great in programs in which you control the
source code: you make it simply stop after scheduling the data request, and make it pick up execution
when the data is available. However, this is near impossible for code that was written based on the
idea of synchronous resource loading. Injecting idling in programs that are supposed to run at a
xed framerate is not an option, so we have to come up with alternative approaches.
For some things, we decided to force synchronous waiting anyway. Loading a le with strings,
for instance, uses a synchronous XMLHTTPRequest, and will halt execution of the page until the data
is available. For other things, we had to get creative. Loading images, for instance, uses the browsers
built-in mechanism for loading images; we build a new Image in JavaScript, set its src attribute to
the image URL, and the browser does the rest, notifying us that the image is ready through the onload
event. This doesnt even rely on an XMLHTTPRequest, it simply exploits the browsers capabilities.
To make matters easier when you already know which images you are loading, we added preload
directives so that the sketch does not start execution until preloading is complete. A user can indicate
any number of images to preload via a comment block at the start of the sketch; Processing.js then
tracks outstanding image loading. The onload event for an image tells us that it is done transferring
and is considered ready to be rendered (rather than simply having been downloaded but not decoded
to a pixel array in memory yet), after which we can populate the corresponding Processing PImage
object with the correct values (width, height, pixel data, etc.) and clear the image from the list.
Once the list is empty, the sketch gets executed, and images used during its lifetime will not require
waiting.
Here is an example of preload directives:
/* @pjs preload="./worldmap.jpg"; */
PImage img;
void setup() {
size(64,48);
noLoop();
img = loadImage("worldmap.jpg"); }
void draw() {
image(img,,); }
For other things, weve had to build more complicated wait for me systems. Fonts, unlike
images, do not have built-in browser loading (or at least not a system as functional as image loading).
While it is possible to load a font using a CSS @font-face rule and rely on the browser to make it all
happen, there are no JavaScript events that can be used to determine that a font nished loading. We
are slowly seeing events getting added to browsers to generate JavaScript events for font download
completion, but these events come too early, as the browser may need anywhere from a few to a
few hundred more milliseconds to actually parse the font for use on the page after download. Thus,
acting on these events will still lead to either no font being applied, or the wrong font being applied
254 Processing.js
if there is a known fallback font. Rather than relying on these events, we embed a tiny TrueType
font that only contains the letter A with impossibly small metrics, and instruct the browser to load
this font via an @font-face rule with a data URI that contains the fonts bytecode as a BASE64
string. This font is so small that we can rely on it being immediately available. For any other font
load instruction we compare text metrics between the desired font and this tiny font. A hidden <div>
is set up with text styled using the desired font, with our tiny font as fallback. As long as the text in
that <div> is impossibly small, we know the desired font is not available yet, and we simply poll at
set intervals until the text has sensible metrics.
Java is strongly typed; JavaScript is not.
In Java, the number 2 and the number 2.0 are dierent values, and they will do dierent things
during mathematical operations. For instance, the code i = 1/2 will result in i being 0, because the
numbers are treated as integers, whereas i = 1/2., i = 1./2, and even i = 1./2. will all result
in i being 0.5, because the numbers are considered decimal fractions with a non-zero integer part,
and a zero fractional part. Even if the intended data type is a oating point number, if the arithmetic
uses only integers, the result will be an integer. This lets you write fairly creative math statements in
Java, and consequently in Processing, but these will generate potentially wildly dierent results when
ported to Processing.js, as JavaScript only knows numbers. As far as JavaScript is concerned, 2
and 2.0 are the same number, and this can give rise to very interesting bugs when running a sketch
using Processing.js.
This might sound like a big issue, and at rst we were convinced it would be, but you cant argue
with real world feedback: it turns out this is almost never an issue for people who put their sketches
online using Processing.js. Rather than solving this in some cool and creative way, the resolution of
this problem was actually remarkably straightforward; we didnt solve it, and as a design choice, we
dont intend to ever revisit that decision. Short of adding a symbol table with strong typing so that
we can fake types in JavaScript and switch functionality based on type, this incompatibility cannot
properly be solved without leaving much harder to nd edge case bugs, and so rather than adding
bulk to the code and slowdown to execution, we left this quirk in. It is a well-documented quirk, and
good code wont try to take advantage of Javas implicit number type casting. That said, sometimes
you will forget, and the result can be quite interesting.
Java is a class/instance-based object-oriented language, with separate variable and method
spaces; JavaScript is not.
JavaScript uses prototype objects, and the inheritance model that comes with it. This means all
objects are essentially key/value pairs where each key is a string, and values are either primitives,
arrays, objects, or functions. On the inheritance side, prototypes can extend other prototypes,
but there is no real concept of superclass and subclass. In order to make proper Java-style
object-oriented code work, we had to implement classical inheritance for JavaScript in Processing.js,
without making it super slow (we think we succeeded in that respect). We also had to come up
with a way to prevent variable names and function names from stepping on each other. Because of
the key/value nature of JavaScript objects, dening a variable called line, followed by a function
like line(x1,y1,x2,y2) will leave you with an object that uses whatever was declared last for a
key. JavaScript rst sets object.line = "some value" for you, and then sets object.line =
function(x1,y1,x2,y2){...}, overriding what you thought your variable line was.
Mike Kamermans 255
It would have slowed down the library a lot to create separate administration for variables and
methods/functions, so again the documentation explains that its a bad idea to use variables and
functions with the same name. If everyone wrote proper code, this wouldnt be much of a problem,
as you want to name variables and functions based on what theyre for, or what they do, but the
real world does things dierently. Sometimes your code wont work, and its because we decided
that having your code break due to a naming conict is preferable to your code always working, but
always being slow. A second reason for not implementing variable and function separation was that
this could break JavaScript code used inside Processing sketches. Closures and the scope chain for
JavaScript rely on the key/value nature of objects, so driving a wedge in that by writing our own
administration would have also severely impacted performance in terms of Just-In-Time compilation
and compression based on functional closures.
Java allows method overloading; JavaScript does not.
One of Javas more powerful features is that you can dene a function, lets say add(int,int),
and then dene another function with the same name, but a dierent number of arguments, e.g.
add(int,int,int), or with dierent argument types, e.g. add(ComplexNumber,ComplexNumber).
Calling add with two or three integer arguments will automatically call the appropriate function,
and calling add with oats or Car objects will generate an error. JavaScript, on the other hand, does
not support this. In JavaScript, a function is a property, and you can dereference it (in which case
JavaScript will give you a value based on type coercion, which in this case returns true when the
property points to a function denition, or false when it doesnt), or you can call it as a function using
the execution operators (which you will know as parentheses with zero or more arguments between
them). If you dene a function as add(x,y) and then call it as add(1,2,3,4,5,6), JavaScript is
okay with that. It will set x to 1 and y to 2 and simply ignore the rest of the arguments. In order
to make overloading work, we rewrite functions with the same name but dierent argument count
to a numbered function, so that function(a,b,c) in the source becomes function$3(a,b,c) in
the rewritten code, and function(a,b,c,d) becomes function$4(a,b,c,d), ensuring the correct
code paths.
We also mostly solved overloading of functions with the same number but dierently typed
arguments, as long as the argument types can be seen as dierent by JavaScript. JavaScript can
tell the functional type of properties using the typeof operator, which will return either number,
string, object or function depending on what a property represents. Declaring var x = 3 followed
by x = 6 will cause typeof x to report number after the initial declaration, and string after
reassignment. As long as functions with the same argument count dier in argument type, we rename
them and switch based on the result of the typeof operation. This does not work when the functions
take arguments of type object, so for these functions we have an additional check involving the
instanceof operator (which returns the name of the function that was used to create the object) to
make function overloading work. In fact, the only place where we cannot successfully transcompile
overloaded functions is where the argument count is the same between functions, and the argument
types are dierent numerical types. As JavaScript only has one numerical type, declaring functions
such as add(int x, int y), add(float x, float y) and add(double x, double y) will clash.
Everything else, however, will work just ne.
256 Processing.js
Java allows importing compiled code.
Sometimes, plain Processing is not enough, and additional functionality is introduced in the form
of a Processing library. These take the form of a .jarchive with compiled Java code, and oer
things like networking, audio, video, hardware interfacing and other exotic functions not covered by
Processing itself.
This is a problem, because compiled Java code is Java byte code. This has given us many
headaches: how do we support library imports without writing a Java byte code decompiler? After
about a year of discussions, we settled on what may seem the simplest solution. Rather than trying
to also cover Processing libraries, we decided to support the import keyword in sketches, and
create a Processing.js Library API, so that library developers can write a JavaScript version of
their library (where feasible, given the webs nature), so that if they write a package that is used
via import processing.video, native Processing will pick the .jarchive, and Processing.js will
instead pick processing.video.js, thus ensuring that things just work. This functionality is slated for
Processing.js 1.4, and library imports is the last major feature that is still missing from Processing.js
(we currently support the import keyword only in the sense that it is removed from the source code
before conversion), and will be the last major step towards parity.
Why Pick JavaScript if It Cant Do Java?
This is not an unreasonable question, and it has multiple answers. The most obvious one is that
JavaScript comes with the browser. You dont install JavaScript yourself, theres no plugin to
download rst; its just there. If you want to port something to the web, youre stuck with JavaScript.
Although, given the exibility of JavaScript, stuck with is really not doing justice to how powerful
the language is. So, one reason to pick JavaScript is because its already there. Pretty much every
device that is of interest comes with a JavaScript-capable browser these days. The same cannot be
said for Java, which is being oered less and less as a preinstalled technology, if it is available at all.
However, the proper answer is that its not really true that JavaScript cant do the things that
Java does; it can, it would just be slower. Even though out of the box JavaScript cant do some of the
things Java does, its still a Turing-complete programming language and it can be made to emulate any
other programming language, at the cost of speed. We could, technically, write a full Java interpreter,
with a String heap, separate variable and method models, class/instance object-orientation with
rigid class hierarchies, and everything else under the Sun (or, these days, Oracle), but thats not what
were in it for: Processing.js is about oering a Processing-to-the-web conversion, in as little code as
is necessary for that. This means that even though we decided not to make it do certain Java things,
our library has one huge benet: it can cope with embedded JavaScript really, really well.
In fact, during a meeting between the Processing.js and Processing people at Bocoup in Boston,
in 2010, Ben Fry asked John Resig why he used regular expression replacement and only partial
conversion instead of doing a proper parser and compiler. Johns response was that it was important
to him that people be able to mix Processing syntax (Java) and JavaScript without having to choose
between them. That initial choice has been crucial in shaping the philosophy of Processing.js ever
since. Weve worked hard to keep it true in our code, and we can see a clear payo when we look
at all the purely web users of Processing.js, who never used Processing, and will happily mix
Processing and JavaScript syntax without a problem.
Mike Kamermans 257
The following example shows how JavaScript and Processing work together.
// JavaScript (would throw an error in native Processing)
var cs = { x: 5,
y: ,
label: "my label",
rotate: function(theta) {
var nx = this.x*cos(theta) - this.y*sin(theta);
var ny = this.x*sin(theta) + this.y*cos(theta);
this.x = nx; this.y = ny; }};
// Processing
float angle = ;
void setup() {
size(2,2);
strokeWeight(15); }
void draw() {
translate(width/2,height/2);
angle += PI/frameRate;
while(angle>2*PI) { angle-=2*PI; }
jQuery(#log).text(angle); // JavaScript (error in native Processing)
cs.rotate(angle); // legal JavaScript as well as Processing
stroke(random(255));
point(cs.x, cs.y); }
A lot of things in Java are promises: strong typing is a content promise to the compiler, visibility
is a promise on who will call methods and reference variables, interfaces are promises that instances
contain the methods the interface describes, etc. Break those promises and the compiler complains.
But, if you dontand this is a one of the most important thoughts for Processing.jsthen you dont
need the additional code for those promises in order for a program to work. If you stick a number in
a variable, and your code treats that variable as if it has a number in it, then at the end of the day var
varname is just as good as int varname. Do you need typing? In Java, you do; in JavaScript, you
dont, so why force it in? The same goes for other code promises. If the Processing compiler doesnt
complain about your code, then we can strip all the explicit syntax for your promises and itll still
work the same.
This has made Processing.js a ridiculously useful library for data visualisation, media presentation
and even entertainment. Sketches in native Processing work, but sketches that mix Java and JavaScript
also work just ne, as do sketches that use pure JavaScript by treating Processing.js as a gloried
canvas drawing framework. In an eort to reach parity with native Processing, without forcing
Java-only syntax, the project has been taken in by an audience as wide as the web itself. Weve seen
activity all over the web using Processing.js. Everyone from IBM to Google has built visualisations,
presentations and even games with Processing.jsProcessing.js is making a dierence.
Another great thing about converting Java syntax to JavaScript while leaving JavaScript untouched
is that weve enabled something we hadnt even thought about ourselves: Processing.js will work with
anything that will work with JavaScript. One of the really interesting things that were now seeing,
for instance, is that people are using CoeeScript (a wonderfully simple, Ruby-like programming
language that transcompiles to JavaScript) in combination with Processing.js, with really cool results.
Even though we set out to build Processing for the web based on parsing Processing syntax, people
258 Processing.js
took what we did and used it with brand new syntaxes. They could never have done that if we had
made Processing.js simply be a Java interpreter. By sticking with code conversion rather than writing
a code interpreter, Processing.js has given Processing a reach on the web far beyond what it would
have had if it had stayed Java-only, or even if it had kept a Java-only syntax, with execution on the
web taken care of by JavaScript. The uptake of our code not just by end users, but also by people
who try to integrate it with their own technologies, has been both amazing and inspiring. Clearly
were doing something right, and the web seems happy with what were doing.
The Result
As we are coming up to Processing.js 1.4.0, our work has resulted in a library that will run any
sketch you give it, provided it does not rely on compiled Java library imports. If you can write it in
Processing, and it runs, you can put it on a webpage and it will just run. Due to the dierences in
hardware access and low level implementations of dierent parts of the rendering pipeline there will
be timing dierences, but in general a sketch that runs at 60 frames per seconds in the Processing
IDE will run at 60 frames per second on a modern computer, with a modern browser. We have
reached a point where bug reports have started to die down, and most work is no longer about adding
feature support, but more about bug xing and code optimization.
Thanks to the eorts of many developers working to resolve over 1800 bug reports, Processing
sketches run using Processing.js just work. Even sketches that rely on library imports can be made
to work, provided that the library code is at hand. Under favourable circumstances, the library is
written in a way that lets you rewrite it to pure Processing code with a few search-replace operations.
In this case the code can be made to work online virtually immediately. When the library does things
that cannot be implemented in pure Processing, but can be implemented using plain JavaScript, more
work is required to eectively emulate the library using JavaScript code, but porting is still possible.
The only instances of Processing code that cannot be ported are those that rely on functionality that
is inherently unavailable to browsers, such as interfacing directly with hardware devices (such as
webcams or Arduino boards) or performing unattended disk writes, though even this is changing.
Browsers are constantly adding functionality to allow for more elaborate applications, and limiting
factors today may disappear a year from now, so that hopefully in the not too distant future, even
sketches that are currently impossible to run online will become portable.
17.3 The Code Components
Processing.js is presented and developed as a large, single le, but architecturally it represents three
dierent components: 1) the launcher, responsible for converting Processing source to Processing.js
avoured JavaScript and executing it, 2) static functionality that can be used by all sketches, and 3)
sketch functionality that has to be tied to individual instances.
The Launcher
The launcher component takes care of three things: code preprocessing, code conversion, and sketch
execution.
Mike Kamermans 259
Preprocessing
In the preprocessing step, Processing.js directives are split o from the code, and acted upon. These
directives come in two avours: settings and load instructions. There is a small number of directives,
keeping with the it should just work philosophy, and the only settings that sketch authors can change
are related to page interaction. By default a sketch will keep running if the page is not in focus, but the
pauseOnBlur = true directive sets up a sketch in such a way that it will halt execution when the page
the sketch is running on is not in focus, resuming execution when the page is in focus again. Also by
default, keyboard input is only routed to a sketch when it is focussed. This is especially important
when people run multiple sketches on the same page, as keyboard input intended for one sketch
should not be processed by another. However, this functionality can be disabled, routing keyboard
events to every sketch that is running on a page, using the globalKeyEvents = true directive.
Load instructions take the form of the aforementioned image preloading and font preloading.
Because images and fonts can be used by multiple sketches, they are loaded and tracked globally, so
that dierent sketches dont attempt multiple loads for the same resource.
Code Conversion
The code conversion component decomposes the source code into AST nodes, such as statements
and expressions, methods, variables, classes, etc. This AST then expanded to JavaScript source code
that builds a sketch-equivalent program when executed. This converted source code makes heavy use
of the Processing.js instance framework for setting up class relations, where classes in the Processing
source code become JavaScript prototypes with special functions for determining superclasses and
bindings for superclass functions and variables.
Sketch Execution
The nal step in the launch process is sketch execution, which consists of determining whether or
not all preloading has nished, and if it has, adding the sketch to the list of running instances and
triggering its JavaScript onLoad event so that any sketch listeners can take the appropriate action.
After this the Processing chain is run through: setup, then draw, and if the sketch is a looping sketch,
setting up an interval call to draw with an interval length that gets closest to the desired framerate for
the sketch.
Static Library
Much of Processing.js falls under the static library heading, representing constants, universal
functions, and universal data types. A lot of these actually do double duty, being dened as global
properties, but also getting aliased by instances for quicker code paths. Global constants such as key
codes and color mappings are housed in the Processing object itself, set up once, and then referenced
when instances are built via the Processing constructor. The same applies to self-contained helper
functions, which lets us keep the code as close to write once, run anywhere as we can without
sacricing performance.
Processing.js has to support a large number of complex data types, not just in order to support
the data types used in Processing, but also for its internal workings. These, too, are dened in the
Processing constructor:
Char, an internal object used to overcome some of the behavioural quirks of Javas char datatype.
PShape, which represents shape objects.
260 Processing.js
PShapeSVG, an extension for PShape objects, which is built from and represents SVG XML.
For PShapeSVG, we implemented our own SVG-to-<canvas>-instructions code. Since Pro-
cessing does not implement full SVG support, the code we saved by not relying on an external
SVG library means that we can account for every line of code relating to SVG imports. It
only parses what it has to, and doesnt waste space with code that follows the spec, but is
unused because native Processing does not support it.
XMLElement, an XML document object.
For XMLElement, too, we implemented our own code, relying on the browser to rst load the
XML element into a Node-based structure, then traveling the node structure to build a leaner
object. Again, this means we dont have any dead code sitting in Processing.js, taking up
space and potentially causing bugs because a patch accidentally makes use of a function that
shouldnt be there.
PMatrix2D and PMatrix3D, which perform matrix operations in 2D and 3D mode.
PImage, which represents an image resource.
This is eectively a wrapper of the Image object, with some additional functions and properties
so that its API matches the Processing API.
PFont, which represents a font resource.
There is no Font object dened for JavaScript (at least for now), so rather than actually storing
the font as an object, our PFont implementation loads a font via the browser, computes its
metrics based on how the browser renders text with it, and then caches the resultant PFont
object. For speed, PFonts have a reference to the canvas that was used to determine the font
properties, in case textWidth must be calculated, but because we track PFont objects based
on name/size pair, if a sketch uses a lot of distinct text sizes, or fonts in general, this will
consume too much memory. As such, PFonts will clear their cached canvas and instead call
a generic textWidth computation function when the cache grows too large. As a secondary
memory preservation strategy, if the font cache continues to grow after clearing the cached
canvas for each PFont, font caching is disabled entirely, and font changes in the sketch simply
build new throwaway PFont objects for every change in font name, text size or text leading.
DrawingShared, Drawing2D, and Drawing3D, which house all the graphics functions.
The DrawingShared object is actually the biggest speed trap in Processing.js. It determines
if a sketch is launching in 2D or 3D mode, and then rebinds all graphics functions to either
the Drawing2D or Drawing3D object. This ensures short code path for graphics instructions,
as 2D Processing sketches cannot used 3D functions, and vice versa. By only binding one of
the two sets of graphics functions, we gain speed from not having to switch on the graphics
mode in every function to determine the code path, and we save space by not binding the
graphics functions that are guaranteed not to be used.
ArrayList, a container that emulates Javas ArrayList.
HashMap, a container that emulates Javas HashMap.
ArrayList, and HashMap in particular, are special data structures because of how Java
implements them. These containers rely on the Java concepts of equality and hashing, and all
objects in Java have an equals and a hashCode method that allow them to be stored in lists
and maps.
For non-hashing containers, objects are resolved based on equality rather than identity.
Thus, list.remove(myobject) iterates through the list looking for an element for which
element.equals(myobject), rather than element == myobject, is true. Because all objects
must have an equals method, we implemented a virtual equals function on the JavaScript
Mike Kamermans 261
side of things. This function takes two objects as arguments, checks whether either of them
implements their own equals function, and if so, falls through to that function. If they dont,
and the passed objects are primitives, primitive equality is checked. If theyre not, then there
is no equality.
For hashing containers, things are even more interesting, as hashing containers act as shortcut
trees. The container actually wraps a variable number of lists, each tied to a specic hash
code. Objects are found based on rst nding the container that matches their hash code,
in which the object is then searched for based on equality evaluation. As all objects in Java
have a hashCode method, we also wrote a virtual hashcode function, which takes a single
object as an argument. The function checks whether the object implements its own hashCode
function, and if so falls through to that function. If it doesnt, the hash code is computed
based on the same hashing algorithm that is used in Java.
Administration
The nal piece of functionality in the static code library is the instance list of all sketches that are
currently running on the page. This instance list stores sketches based on the canvas they have been
loaded in, so that users can call Processing.getInstanceById(canvasid) and get a reference
to their sketch for page interaction purposes.
Instance Code
Instance code takes the form of p.functor = function(arg, ...) denitions for the Processing
API, and p.constant = ... for sketch state variables (where p is our reference to the sketch being
set up). Neither of these are located in dedicated code blocks. Rather, the code is organized based on
function, so that instance code relating to PShape operations is dened near the PShape object, and
instance code for graphics functions are dened near, or in, the Drawing2D and Drawing3D objects.
In order to keep things fast, a lot of code that could be written as static code with an instance wrap-
per is actually implemented as purely instance code. For instance, the lerpColor(c1, c2, ratio)
function, which determines the color corresponding to the linear interpolation of two colors, is
dened as an instance function. Rather than having p.lerpColor(c1, c2, ratio) acting as a wrap-
per for some static function Processing.lerpColor(c1, c2, ratio), the fact that nothing else
in Processing.js relies on lerpColor means that code execution is faster if we write it as a pure
instance function. While this does bloat the instance object, most functions for which we insist on
an instance function rather than a wrapper to the static library are small. Thus, at the expense of
memory we create really fast code paths. While the full Processing object will take up a one-time
memory slice worth around 5 MB when initially set up, the prerequisite code for individual sketches
only takes up about 500 KB.
17.4 Developing Processing.js
Processing.js is worked on intensively, which we can only do because our development approach
sticks to a few basic rules. As these rules inuence the architecture of Processing.js, its worth having
a brief look at them before closing this chapter.
262 Processing.js
Make It Work
Writing code that works sounds like a tautological premise; you write code, and by the time youre
done your code either works, because thats what you set out to do, or it doesnt, and youre not done
yet. However, make it work comes with a corollary: Make it work, and when youre done, prove it.
If there is one thing above all other things that has allowed Processing.js to grow at the pace it
has, it is the presence of tests. Any ticket that requires touching the code, be it either by writing new
code or rewriting old code, cannot be marked as resolved until there is a unit or reference test that
allows others to verify not only that the code works the way it should, but also that it breaks when it
should. For most code, this typically involves a unit testa short bit of code that calls a function and
simply tests whether the function returns the correct values, for both legal and illegal function calls.
Not only does this allow us to test code contributions, it also lets us perform regression tests.
Before any code is accepted and merged into our stable development branch, the modied Pro-
cessing.js library is validated against an ever-growing battery of unit tests. Big xes and performance
tests in particular are prone to passing their own unit tests, but breaking parts that worked ne before
the rewrite. Having tests for every function in the API, as well as internal functions, means that
as Processing.js grows, we dont accidentally break compatibility with previous versions. Barring
destructive API changes, if none of the tests failed before a code contribution or modication, none
of the tests are allowed to fail with the new code in.
The following is an example of a unit test verifying inline object creation.
interface I {
int getX();
void test(); }
I i = new I() {
int x = 5;
public int getX() {
return x; }
public void test() {
x++; }};
i.test();
_checkEqual(i.getX(), 6);
_checkEqual(i instanceof I, true);
_checkEqual(i instanceof Object, true);
In addition to regular code unit tests, we also have visual reference (or ref) tests. As Processing.js
is a port of a visual programming language, some tests cannot be performed using just unit tests.
Testing to see whether an ellipse gets drawn on the correct pixels, or whether a single-pixel-wide
vertical line is drawn crisp or smoothed cannot be determined without a visual reference. Because all
mainstream browsers implement the <canvas> element and Canvas2D API with subtle dierences,
these things can only be tested by running code in a browser and verifying that the resulting sketch
looks the same as what native Processing generates. To make life easier for developers, we use an
automated test suite for this, where new test cases are run through Processing, generating what it
should look like data to be used for pixel comparison. This data is then stored as a comment inside
the sketch that generated it, forming a test, and these tests are then run by Processing.js on a visual
reference test page which executes each test and performs pixel comparisons between what it should
Mike Kamermans 263
look like and what it looks like. If the pixels are o, the test fails, and the developer is presented
with three images: what it should look like, how Processing.js rendered it, and the dierence between
the two, marking problem areas as red pixels, and correct areas as white. Much like unit tests, these
tests must pass before any code contribution can be accepted.
Make It Fast
In an open source project, making things work is only the rst step in the life of a function. Once
things work, you want to make sure things work fast. Based on the if you cant measure it, you cant
improve it principle, most functions in Processing.js dont just come with unit or ref tests, but also
with performance (or perf) tests. Small bits of code that simply call a function, without testing the
correctness of the function, are run several hundred times in a row, and their run time is recorded on
a special performance test web page. This lets us quantify how well (or not!) Processing.js performs
in browsers that support HTML5s <canvas> element. Every time an optimization patch passes
unit and ref testing, it is run through our performance test page. JavaScript is a curious beast, and
beautiful code can, in fact, run several orders of magnitude slower than code that contains the same
lines several times over, with inline code rather than function calls. This makes performance testing
crucial. We have been able to speed up certain parts of the library by three orders of magnitude
simply by discovering hot loops during perf testing, reducing the number of function calls by inlining
code, and by making functions return the moment they know what their return value should be, rather
than having only a single return at the very end of the function.
Another way in which we try to make Processing.js fast is by looking at what runs it. As
Processing.js is highly dependent on the eciency of JavaScript engines, it makes sense to also look
at which features various engines oer to speed things up. Especially now that browsers are starting to
support hardware accelerated graphics, instant speed boosts are possible when engines oer new and
more ecient data types and functions to perform the low level operations that Processing.js depends
on. For instance, JavaScript technically has no static typing, but graphics hardware programming
environments do. By exposing the data structures used to talk to the hardware directly to JavaScript,
it is possible to signicantly speed up sections of code if we know that they will only use specic
values.
Make It Small
There are two ways to make code small. First, write compact code. If youre manipulating a variable
multiple times, compact it to a single manipulation (if possible). If you access an object variable
multiple times, cache it. If you call a function multiple times, cache the result. Return once you
have all the information you need, and generally apply all the tricks a code optimiser would apply
yourself. JavaScript is a particularly nice language for this, since it comes with an incredible amount
of exibility. For example, rather than using:
if ((result = functionresult)!==null) {
var = result;
} else {
var = default;
}
in JavaScript this becomes:
var = functionresult || default
264 Processing.js
There is also another form of small code, and thats in terms of runtime code. Because JavaScript
lets you change function bindings on the y, running code becomes much smaller if you can say
bind the function for line2D to the function call for line once you know that a program runs in 2D
rather than 3D mode, so that you dont have to perform:
if(mode==2D) { line2D() } else { line3D() }
for every function call that might be either in 2D or 3D mode.
Finally, there is the process of minication. There are a number of good systems that let you
compress your JavaScript code by renaming variables, stripping whitespace, and applying certain
code optimisations that are hard to do by hand while still keeping the code readable. Examples of
these are the YUI minier and Googles closure compiler. We use these technologies in Processing.js
to oer end users bandwidth convenienceminication after stripping comments can shrink the
library by as much as 50%, and taking advantage of modern browser/server interaction for gzipped
content, we can oer the entire Processing.js library in gzipped form in 65 KB.
If All Else Fails, Tell People
Not everything that can currently be done in Processing can be done in the browser. Security models
prevent certain things like saving les to the hard disk and performing USB or serial port I/O, and
a lack of typing in JavaScript can have unexpected consequences (such as all math being oating
point math). Sometimes were faced with the choice between adding an incredible amount of code
to enable an edge case, or mark the ticket as a wontx issue. In such cases, a new ticket gets led,
typically titled Add documentation that explains why. . . .
In order to make sure these things arent lost, we have documentation for people who start
using Processing.js with a Processing background, and for people who start using Processing.js
with a JavaScript background, covering the dierences between what is expected, and what actually
happens. Certain things just deserve special mention, because no matter how much work we put
into Processing.js, there are certain things we cannot add without sacricing usability. A good
architecture doesnt just cover the way things are, it also covers why; without that, youll just end up
having the same discussions about what the code looks like and whether it should be dierent every
time the team changes.
17.5 Lessons Learned
The most important lesson we learned while writing Processing.js is that when porting a language,
what matters is that the result is correct, not whether or not the code used in your port is similar to
the original. Even though Java and JavaScript syntax are fairly similar, and modifying Java code
to legal JavaScript code is fairly easy, it often pays to look at what JavaScript can natively do and
exploit that to get the same functional result. Taking advantage of the lack of typing by recycling
variables, using certain built-in functions that are fast in JavaScript but slow in Java, or avoiding
patterns that are fast in Java but slow in JavaScript means your code may look radically dierent, but
has the exact same eect. You often hear people say not to reinvent the wheel, but that only applies
to working with a single programming language. When youre porting, reinvent as many wheels as
you need to obtain the performance you require.
Another important lesson is to return early, return often, and branch as little as possible. An if/then
statement followed by a return can be made (sometimes drastically) faster by using an if-return/return
Mike Kamermans 265
construction instead, using the return statement as a conditional shortcut. While its conceptually
pretty to aggregate your entire function state before calling the ultimate return statement for that
function, it also means your code path may traverse code that is entirely unrelated to what you will
be returning. Dont waste cycles; return when you have all the information you need.
A third lesson concerns testing your code. In Processing.js we had the benet of starting with
very good documentation outlining how Processing was supposed to work, and a large set of test
cases, most of which started out as known fail. This allowed us to do two things: 1) write code
against tests, and 2) create tests before writing code. The usual process, in which code is written and
then test cases are written for that code, actually creates biased tests. Rather than testing whether or
not your code does what it should do, according to the specication, you are only testing whether
your code is bug-free. In Processing.js, we instead start by creating test cases based on what the
functional requirements for some function or set of functions is, based on the documentation for it.
With these unbiased tests, we can then write code that is functionally complete, rather than simply
bug-free but possibly decient.
The last lesson is also the most general one: apply the rules of agile development to individual
xes as well. No one benets from you retreating into dev mode and not being heard from for three
days straight while you write the perfect solution. Rather, get your solutions to the point where they
work, and not even necessarily for all test cases, then ask for feedback. Working alone, with a test
suite for catching errors, is no guarantee of good or complete code. No amount of automated testing
is going to point out that you forgot to write tests for certain edge cases, or that there is a better
algorithm than the one you picked, or that you could have reordered your statements to make the
code better suited for JIT compilation. Treat xes like releases: present xes early, update often, and
work feedback into your improvements.
266 Processing.js
[chapter18]
Puppet
Luke Kanies
18.1 Introduction
Puppet is an open source IT management tool written in Ruby, used for datacenter automation
and server management at Google, Twitter, the New York Stock Exchange, and many others. It is
primarily maintained by Puppet Labs, which also founded the project. Puppet can manage as few as
2 machines and as many as 50,000, on teams with one system administrator or hundreds.
Puppet is a tool for conguring and maintaining your computers; in its simple conguration
language, you explain to Puppet how you want your machines congured, and it changes them as
needed to match your specication. As you change that specication over timesuch as with package
updates, new users, or conguration updatesPuppet will automatically update your machines to
match. If they are already congured as desired, then Puppet does nothing.
In general, Puppet does everything it can to use existing system features to do its work; e.g., on
Red Hat it will use yum for packages and init.d for services, but on OS X it will use dmg for packages
and launchd for services. One of the guiding goals in Puppet is to have the work it does make sense
whether you are looking at Puppet code or the system itself, so following system standards is critical.
Puppet comes from multiple traditions of other tools. In the open source world, it is most
inuenced by CFEngine, which was the rst open source general-purpose conguration tool, and
ISconf, whose use of make for all work inspired the focus on explicit dependencies throughout the
system. In the commercial world, Puppet is a response to BladeLogic and Opsware (both since
acquired by larger companies), each of which was successful in the market when Puppet was begun,
but each of which was focused on selling to executives at large companies rather than building great
tools directly for system administrators. Puppet is meant to solve similar problems to these tools, but
it is focused on a very dierent user.
For a simple example of how to use Puppet, here is a snippet of code that will make sure the
secure shell service (SSH) is installed and congured properly:
class ssh {
package { ssh: ensure => installed }
file { "/etc/ssh/sshd_config":
source => puppet:///modules/ssh/sshd_config,
ensure => present,
require => Package[ssh]
}
service { sshd:
ensure => running,
require => [File["/etc/ssh/sshd_config"], Package[ssh]]
}
}
This makes sure the package is installed, the le is in place, and the service is running. Note that
weve specied dependencies between the resources, so that we always perform any work in the right
order. This class could then be associated with any host to apply this conguration to it. Notice that
the building blocks of a Puppet conguration are structured objects, in this case package, file, and
service. We call these objects resources in Puppet, and everything in a Puppet conguration comes
down to these resources and the dependencies between them.
A normal Puppet site will have tens or even hundreds of these code snippets, which we call
classes; we store these classes on disk in les called manifests, and collect them in related groups
called modules. For instance, you might have an ssh module with this ssh class plus any other
related classes, along with modules for mysql, apache, and sudo.
Most Puppet interactions are via the command line or long-running HTTP services, but there
are graphical interfaces for some things such as report processing. Puppet Labs also produces
commercial products around Puppet, which tend more toward graphical web-based interfaces.
Puppets rst prototype was written in the summer of 2004, and it was turned into a full-time
focus in February of 2005. It was initially designed and written by Luke Kanies, a sysadmin who
had a lot of experience writing small tools, but none writing tools greater than 10,000 lines of code.
In essence, Luke learned to be a programmer while writing Puppet, and that shows in its architecture
in both positive and negative ways.
Puppet was rst and foremost built to be a tool for sysadmins, to make their lives easier and
allow them to work faster, more eciently, and with fewer errors. The rst key innovation meant to
deliver on this was the resources mentioned above, which are Puppets primitives; they would both
be portable across most operating systems and also abstract away implementation detail, allowing the
user to focus on outcomes rather than how to achieve them. This set of primitives was implemented
in Puppets Resource Abstraction Layer.
Puppet resources must be unique on a given host. You can only have one package named ssh,
one service named sshd, and one le named /etc/ssh/sshd_cong. This prevents dierent parts of
your congurations from conicting with each other, and you nd out about those conicts very early
in the conguration process. We refer to these resources by their type and title; e.g., Package[ssh]
and Service[sshd]. You can have a package and a service with the same name because they are
dierent types, but not two packages or services with the same name.
The second key innovation in Puppet provides the ability to directly specify dependencies between
resources. Previous tools focused on the individual work to be done, rather than how the various bits
of work were related; Puppet was the rst tool to explicitly say that dependencies are a rst-class
part of your congurations and must be modeled that way. It builds a graph of resources and their
dependencies as one of the core data types, and essentially everything in Puppet hangs o of this
graph (called a Catalog) and its vertices and edges.
The last major component in Puppet is its conguration language. This language is declarative,
and is meant to be more conguration data than full programmingit most resembles Nagioss
conguration format, but is also heavily inuenced by CFEngine and Ruby.
Beyond the functional components, Puppet has had two guiding principles throughout its develop-
ment: it should be as simple as possible, always preferring usability even at the expense of capability;
and it should be built as a framework rst and application second, so that others could build their
own applications on Puppets internals as desired. It was understood that Puppets framework needed
268 Puppet
a killer application to be adopted widely, but the framework was always the focus, not the application.
Most people think of Puppet as being that application, rather than the framework behind it.
When Puppets prototype was rst built, Luke was essentially a decent Perl programmer with a
lot of shell experience and some C experience, mostly working in CFEngine. The odd thing is he
had experience building parsers for simple languages, having built two as part of smaller tools and
also having rewritten CFEngines parser from scratch in an eort to make it more maintainable (this
code was never submitted to the project, because of small incompatibilities).
A dynamic language was easily decided on for Puppets implementation, based on much higher
developer productivity and time to market, but choosing the language proved dicult. Initial
prototypes in Perl went nowhere, so other languages were sought for experimentation. Python was
tried, but Luke found the language quite at odds with how he thought about the world. Based on
what amounted to a rumor of utility heard from a friend, Luke tried Ruby, and in four hours had built
a usable prototype. When Puppet became a full-time eort in 2005 Ruby was a complete unknown,
so the decision to stick with it was a big risk, but again programmer productivity was deemed the
primary driver in language choice. The major distinguishing feature in Ruby, at least as opposed to
Perl, was how easy it was to build non-hierarchical class relationships, but it also mapped very well
to Lukes brain, which turned out to be critical.
18.2 Architectural Overview
This chapter is primarily about the architecture of Puppets implementation (that is, the code that
weve used to make Puppet do the things its supposed to do) but its worth briey discussing its
application architecture (that is, how the parts communicate), so that the implementation makes
some sense.
Puppet has been built with two modes in mind: A client/server mode with a central server
and agents running on separate hosts, or a serverless mode where a single process does all of the
work. To ensure consistency between these modes, Puppet has always had network transparency
internally, so that the two modes used the same code paths whether they went over the network or
not. Each executable can congure local or remote service access as appropriate, but otherwise they
behave identically. Note also that you can use the serverless mode in what amounts to a client/server
conguration, by pulling all conguration les to each client and having it parse them directly.
This section will focus on the client/server mode, because its more easily understood as separate
components, but keep in mind that this is all true of the serverless mode, too.
One of the dening choices in Puppets application architecture is that clients should not get
access to raw Puppet modules; instead, they get a conguration compiled just for them. This provides
multiple benets: First, you followthe principle of least privilege, in that each host only knows exactly
what it needs to know (how it should be congured), but it does not know how any other servers are
congured. Second, you can completely separate the rights needed to compile a conguration (which
might include access to central data stores) from the need to apply that conguration. Third, you can
run hosts in a disconnected mode where they repeatedly apply a conguration with no contact to
a central server, which means you remain in compliance even if the server is down or the client is
disconnected (such as would be the case in a mobile installation, or when the clients are in a DMZ).
Given this choice, the workow becomes relatively straightforward:
1. The Puppet agent process collects information about the host it is running on, which it passes
to the server.
Luke Kanies 269
2. The parser uses that system information and Puppet modules on local disk to compile a
conguration for that particular host and returns it to the agent.
3. The agent applies that conguration locally, thus aecting the local state of the host, and les
the resulting report with the server.
Figure 18.1: Puppet dataow
Thus, the agent has access to its own system information, its conguration, and each report it
generates. The server has copies of all of this data, plus access to all of the Puppet modules, and any
back-end databases and services that might be needed to compile the conguration.
Beyond the components that go into this workow, which well address next, there are many data
types that Puppet uses for internal communication. These data types are critical, because theyre how
all communication is done and theyre public types which any other tools can consume or produce.
The most important data types are:
Facts: System data collected on each machine and used to compile congurations.
Manifest: Files containing Puppet code, generally organized into collections called modules.
Catalog: A graph of a given hosts resources to be managed and the dependencies between them.
Report: The collection of all events generated during application of a given Catalog.
Beyond Facts, Manifests, Catalogs, and Reports, Puppet supports data types for les, certicates
(which it uses for authentication), and others.
270 Puppet
Agent puppet master
Connect
Plugins
Facter
(facts)
ENC?
puppet console
(or other data source)
Finish
Report?
le? Database
classes:
- ssh:
ssh_key: AAAA
parameters:
conf: LISA11
./enc.script $certname
node $certname {
$conf = LISA11
class { 'ssh':
ssh_key => 'AAA.',
}
}
Common
Prod Dev
Compile Catalog
(functions)
Apply Catalog
puppet leserver
puppet report
puppet:///server/
Report Procesor
Figure 18.2: Orchestration of data ow between Puppet processes and components
18.3 Component Analysis
Agent
The rst component encountered in a Puppet run is the agent process. This was traditionally a
separate executable called puppetd, but in version 2.6 we reduced down to one executable so now it
is invoked with puppet agent, akin to how Git works. The agent has little functionality of its own;
it is primarily conguration and code that implements the client-side aspects of the above-described
workow.
Facter
The next component after the agent is an external tool called Facter, which is a very simple tool used
to discover information about the host it is running on. This is data like the operating system, IP
address, and host name, but Facter is easily extensible so many organizations add their own plugins
to discover custom data. The agent sends the data discovered by Facter to the server, at which point
it takes over the workow.
External Node Classier
On the server, the rst component encountered is what we call the External Node Classier, or
ENC. The ENC accepts the host name and returns a simple data structure containing the high-level
conguration for that host. The ENC is generally a separate service or application: either another
open source project, such as Puppet Dashboard or Foreman, or integration with existing data stores,
such as LDAP. The purpose of the ENC is to specify what functional classes a given host belongs to,
Luke Kanies 271
and what parameters should be used to congure those classes. For example, a given host might be
in the debian and webserver classes, and have the parameter datacenter set to atlanta.
Note that as of Puppet 2.7, the ENC is not a required component; users can instead directly
specify node congurations in Puppet code. Support for an ENC was added about 2 years after
Puppet was launched because we realized that classifying hosts is fundamentally dierent than
conguring them, and it made more sense to split these problems into separate tools than to extend
the language to support both facilities. The ENC is always recommended, and at some point soon
will become a required component (at which point Puppet will ship with a suciently useful one
that that requirement will not be a burden).
Once the server receives classication information from the ENC and system information from
Facter (via the agent), it bundles all of the information into a Node object and passes it on to the
Compiler.
Compiler
As mentioned above, Puppet has a custom language built for specifying system congurations. Its
compiler is really three chunks: A Yacc-style parser generator and a custom lexer; a group of classes
used to create our Abstract Syntax Tree (AST); and the Compiler class that handles the interactions
of all of these classes and also functions as the API to this part of the system.
The most complicated thing about the compiler is the fact that most Puppet conguration code
is lazily loaded on rst reference (to reduce both load times and irrelevant logging about missing-
but-unneeded dependencies), which means there arent really explicit calls to load and parse the
code.
Puppets parser uses a normal Yacc
1
-style parser generator (built using the open source Racc
2
tool). Unfortunately, there were no open source lexer generators when Puppet was begun, so it uses a
custom lexer.
Because we use an AST in Puppet, every statement in the Puppet grammar evaluates to an instance
of a Puppet AST class (e.g., Puppet::Parser::AST::Statement), rather than taking action directly,
and these AST instances are collected into a tree as the grammar tree is reduced. This AST provides
a performance benet when a single server is compiling congurations for many dierent nodes,
because we can parse once but compile many times. It also gives us the opportunity to perform some
introspection of the AST, which provides us information and capability we wouldnt have if parsing
operated directly.
Very few approachable AST examples were available when Puppet was begun, so there has been
a lot of evolution in it, and weve arrived at what seems a relatively unique formulation. Rather than
creating a single AST for the entire conguration, we create many small ASTs, keyed o their name.
For instance, this code:
class ssh {
package { ssh: ensure => present }
}
creates a new AST containing a single Puppet::Parser::AST::Resource instance, and stores that
AST by the name ssh in the hash of all classes for this particular environment. (Ive left out details
about other constructs akin to classes, but they are unnecessary for this discussion.)
1
http://dinosaur.compilertools.net/
2
https://github.com/tenderlove/racc
272 Puppet
Given the AST and a Node object (from the ENC), the compiler takes the classes specied in the
node object (if there are any), looks them up and evaluates them. In the course of this evaluation, the
compiler is building up a tree of variable scopes; every class gets its own scope which is attached to
the creating scope. This amounts to dynamic scoping in Puppet: if one class includes another class,
then the included class can look up variables directly in the including class. This has always been a
nightmare, and we have been on the path to getting rid of this capability.
The Scope tree is temporary and is discarded once compiling is done, but the artifact of compiling
is also built up gradually over the course of the compilation. We call this artifact a Catalog, but it
is just a graph of resources and their relationships. Nothing of the variables, control structures, or
function calls survive into the catalog; its plain data, and can be trivially converted to JSON, YAML,
or just about anything else.
During compilation, we create containment relationships; a class contains all of the resources
that come with that class (e.g., the ssh package above is contained by the ssh class). A class might
contain a denition, which itself contains either yet more denitions, or individual resources. A
catalog tends to be a very horizontal, disconnected graph: many classes, each no more than a couple
of levels deep.
One of the awkward aspects of this graph is that it also contains dependency relationships, such
as a service requiring a package (maybe because the package installation actually creates the service),
but these dependency relationships are actually specied as parameter values on the resources, rather
than as edges in the structure of the graph. Our graph class (called SimpleGraph, for historical
reasons) does not support having both containment and dependency edges in the same graph, so we
have to convert between them for various purposes.
Transaction
Once the catalog is entirely constructed (assuming there is no failure), it is passed on to the Transaction.
In a system with a separate client and server, the Transaction runs on the client, which pulls the
Catalog down via HTTP as in Figure 18.2.
Puppets transaction class provides the framework for actually aecting the system, whereas
everything else weve discussed just builds up and passes around objects. Unlike transactions in
more common systems such as databases, Puppet transactions do not have behaviors like atomicity.
The transaction performs a relatively straightforward task: walk the graph in the order specied
by the various relationships, and make sure each resource is in sync. As mentioned above, it
has to convert the graph from containment edges (e.g., Class[ssh] contains Package[ssh] and
Service[sshd]) to dependency edges (e.g., Service[sshd] depends on Package[ssh]), and then
it does a standard topological sort of the graph, selecting each resource in turn.
For a given resource, we perform a simple three-step process: retrieve the current state of that
resource, compare it to the desired state, and make any changes necessary to x discrepancies. For
instance, given this code:
file { "/etc/motd":
ensure => file,
content => "Welcome to the machine",
mode => 644
}
the transaction checks the content and mode of /etc/motd, and if they dont match the specied state,
it will x either or both of them. If /etc/motd is somehow a directory, then it will back up all of the
les in that directory, remove it, and replace it with a le that has the appropriate content and mode.
Luke Kanies 273
This process of making changes is actually handled by a simple ResourceHarness class that denes
the entire interface between Transaction and Resource. This reduces the number of connections
between the classes, and makes it easier to make changes to either independently.
Resource Abstraction Layer
The Transaction class is the heart of getting work done with Puppet, but all of the work is actually
done by the Resource Abstraction Layer (RAL), which also happens to be the most interesting
component in Puppet, architecturally speaking.
The RAL was the rst component created in Puppet and, other than the language, it most clearly
denes what the user can do. The job of the RAL is to dene what it means to be a resource and how
resources can get work done on the system, and Puppets language is specically built to specify
resources as modeled by the RAL. Because of this, its also the most important component in the
system, and the hardest to change. There are plenty of things we would like to x in the RAL, and
weve made a lot of critical improvements to it over the years (the most crucial being the addition of
Providers), but there is still a lot of work to do to the RAL in the long term.
In the Compiler subsystem, we model resources and resource types with separate classes (named,
conveniently, Puppet::Resource and Puppet::Resource::Type). Our goal is to have these classes
also form the heart of the RAL, but for now these two behaviors (resource and type) are modeled
within a single class, Puppet::Type. (The class is named poorly because it signicantly predates
our use of the term Resource, and at the time we were directly serializing memory structures when
communicating between hosts, so it was actually quite complicated to change class names.)
When Puppet::Type was rst created, it seemed reasonable to put resource and resource type
behaviors in the same class; after all, resources are just instances of resource types. Over time,
however, it became clear that the relationship between a resource and its resource type arent modeled
well in a traditional inheritance structure. Resource types dene what parameters a resource can
have, but not whether it accepts parameters (they all do), for instance. Thus, our base class of
Puppet::Type has class-level behaviors that determine how resource types behave, and instance-
level behaviors that determine how resource instances behave. It additionally has the responsibility
of managing registration and retrieval of resource types; if you want the user type, you call
Puppet::Type.type(:user).
This mix of behaviors makes Puppet::Type quite dicult to maintain. The whole class is less
than 2,000 lines of code, but working at three levelsresource, resource type, and resource type
managermakes it convoluted. This is obviously why its a major target for being refactored, but
its more plumbing than user-facing, so its always been hard to justify eort here rather than directly
in features.
Beyond Puppet::Type, there are two major kinds of classes in the RAL, the most interesting of
which are what we call Providers. When the RAL was rst developed, each resource type mixed
the denition of a parameter with code that knew how to manage it. For instance, we would dene
the content parameter, and then provide a method that could read the content of a le, and another
method that could change the content:
Puppet::Type.newtype(:file) do
...
newproperty(:content) do
def retrieve
File.read(@resource[:name])
end
274 Puppet
def sync
File.open(@resource[:name], "w") { |f| f.print @resource[:content] }
end
end
end
This example is simplied considerably (e.g., we use checksums internally, rather than the full
content strings), but you get the idea.
This became impossible to manage as we needed to support multiple varieties of a given resource
type. Puppet now supports more than 30 kinds of package management, and it would have been
impossible to support all of those within a single Package resource type. Instead, we provide a clean
interface between the denition of the resource typeessentially, what the name of the resource
type is and what properties it supportsfrom how you manage that type of resource. Providers
dene getter and setter methods for all of a resource types properties, named in obvious ways. For
example, this is how a provider of the above property would look:
Puppet::Type.newtype(:file) do
newproperty(:content)
end
Puppet::Type.type(:file).provide(:posix) do
def content
File.read(@resource[:name])
end
def content=(str)
File.open(@resource[:name], "w") { |f| f.print(str) }
end
end
This is a touch more code in the simplest cases, but is much easier to understand and maintain,
especially as either the number of properties or number of providers increases.
I said at the beginning of this section that the Transaction doesnt actually aect the system
directly, and it instead relies on the RAL for that. Now its clear that its the providers that do the
actual work. In fact, in general the providers are the only part of Puppet that actually touch the
system. The transaction asks for a les content, and the provider collects it; the transaction species
that a les content should be changed, and the provider changes it. Note, however, that the provider
never decides to aect the systemthe Transaction owns the decisions, and the provider does the
work. This gives the Transaction complete control without requiring that it understand anything
about les, users, or packages, and this separation is what enables Puppet to have a full simulation
mode where we can largely guarantee the system wont be aected.
The second major class type in the RAL is responsible for the parameters themselves. We actually
support three kinds of parameters: metaparameters, which aect all resource types (e.g., whether
you should run in simulation mode); parameters, which are values that arent reected on disk (e.g.,
whether you should follow links in les); and properties, which model aspects of the resource that
you can change on disk (e.g., a les content, or whether a service is running). The dierence between
properties and parameters is especially confusing to people, but if you just think of properties as
having getter and setter methods in the providers, its relatively straightforward.
Luke Kanies 275
Reporting
As the transaction walks the graph and uses the RAL to change the systems conguration, it
progressively builds a report. This report largely consists of the events generated by changes to the
system. These events, in turn, are comprehensive reections of what work was done: they retain
a timestamp the resource changed, the previous value, the new value, any message generated, and
whether the change succeeded or failed (or was in simulation mode).
The events are wrapped in a ResourceStatus object that maps to each resource. Thus, for a given
Transaction, you know all of the resources that are run, and you know any changes that happen, along
with all of the metadata you might need about those changes.
Once the transaction is complete, some basic metrics are calculated and stored in the report, and
then it is sent o to the server (if congured). With the report sent, the conguration process is
complete, and the agent goes back to sleep or the process just ends.
18.4 Infrastructure
Now that we have a thorough understanding of what Puppet does and how, its worth spending a
little time on the pieces that dont show up as capabilities but are still critical to getting the job done.
Plugins
One of the great things about Puppet is that it is very extensible. There are at least 12 dierent
kinds of extensibility in Puppet, and most of these are meant to be usable by just about anyone. For
example, you can create custom plugins for these areas:
resource types and custom providers
report handlers, such as for storing reports in a custom database
Indirector plugins for interacting with existing data stores
facts for discovering extra information about your hosts
However, Puppets distributed nature means that agents need a way to retrieve and load new
plugins. Thus, at the start of every Puppet run, the rst thing we do is download all plugins that the
server has available. These might include new resource types or providers, new facts, or even new
report processors.
This makes it possible to heavily upgrade Puppet agents without ever changing the core Puppet
packages. This is especially useful for highly customized Puppet installations.
Indirector
Youve probably detected by now that we have a tradition of bad class names in Puppet, and according
to most people, this one takes the cake. The Indirector is a relatively standard Inversion of Control
framework with signicant extensibility. Inversion of Control systems allow you to separate develop-
ment of functionality from how you control which functionality you use. In Puppets case, this allows
us to have many plugins that provide very dierent functionality, such as reaching the compiler via
HTTP or loading it in-process, and switch between them with a small conguration change rather
than a code change. In other words, Puppets Indirector is basically an implementation of a service
locator, as described on the Wikipedia page for Inversion of Control. All of the hand-os from
one class to another go through the Indirector, via a standard REST-like interface (e.g., we support
276 Puppet
nd, search, save, and destroy as methods), and switching Puppet from serverless to client/server is
largely a question of conguring the agent to use an HTTP endpoint for retrieving catalogs, rather
than using a compiler endpoint.
Because it is an Inversion of Control framework where conguration is stringently separated
from the code paths, this class can also be dicult to understand, especially when youre debugging
why a given code path was used.
Networking
Puppets prototype was written in the summer of 2004, when the big networking question was
whether to use XMLRPC or SOAP. We chose XMLRPC, and it worked ne but had most of the
problems everyone else had: it didnt encourage standard interfaces between components, and it
tended to get overcomplicated very quickly as a result. We also had signicant memory problems,
because the encoding needed for XMLRPC resulted in every object appearing at least twice in
memory, which quickly gets expensive for large les.
For our 0.25 release (begun in 2008), we began the process of switching all networking to a
REST-like model, but we chose a much more complicated route than just changing out the networking.
We developed the Indirector as the standard framework for inter-component communication, and
built REST endpoints as just one option. It took two releases to fully support REST, and we have
not quite nished converting to using JSON (instead of YAML) for all serialization. We undertook
switching to JSON for two major reasons: rst, YAML processing Ruby is painfully slow, and pure
Ruby processing of JSON is a lot faster; second, most of the web seems to be moving to JSON, and
it tends to be implemented more portably than YAML. Certainly in the case of Puppet, the rst use
of YAML was not portable across languages, and was often not portable across dierent versions of
Puppet, because it was essentially serialization of internal Ruby objects.
Our next major release of Puppet will nally remove all of the XMLRPC support.
18.5 Lessons Learned
In terms of implementation, were proudest of the various kinds of separation that exist in Puppet:
the language is completely separate from the RAL, the Transaction cannot directly touch the system,
and the RAL cant decide to do work on its own. This gives the application developer a lot of control
over application workow, along with a lot of access to information about what is happening and
why.
Puppets extensibility and congurability are also major assets, because anyone can build on top
of Puppet quite easily without having to hack the core. Weve always built our own capabilities on
the same interfaces we recommend our users use.
Puppets simplicity and ease of use have always been its major draw. Its still too dicult to get
running, but its miles easier than any of the other tools on the market. This simplicity comes with a
lot of engineering costs, especially in the form of maintenance and extra design work, but its worth
it to allow users to focus on their problems instead of the tool.
Puppets congurability is a real feature, but we took it a bit too far. There are too many ways
you can wire Puppet together, and its too easy to build a workow on top of Puppet that will make
you miserable. One of our major near-term goals is to dramatically reduce the knobs you can turn in
a Puppet conguration, so the user cannot so easily congure it poorly, and so we can more easily
upgrade it over time without worrying about obscure edge cases.
Luke Kanies 277
We also just generally changed too slowly. There are major refactors weve been wanting to do
for years but have never quite tackled. This has meant a more stable system for our users in the short
term, but also a more dicult-to-maintain system, and one thats much harder to contribute to.
Lastly, it took us too long to realize that our goals of simplicity were best expressed in the
language of design. Once we began speaking about design rather than just simplicity, we acquired a
much better framework for making decisions about adding or removing features, with a better means
of communicating the reasoning behind those decisions.
18.6 Conclusion
Puppet is both a simple system and a complex one. It has many moving parts, but theyre wired
together quite loosely, and each of them has changed pretty dramatically since its founding in 2005.
It is a framework that can be used for all manner of conguration problems, but as an application it
is simple and approachable.
Our future success rests on that framework becoming more solid and more simple, and that
application staying approachable while it gains capability.
278 Puppet
[chapter19]
PyPy
Benjamin Peterson
PyPy is a Python implementation and a dynamic language implementation framework.
This chapter assumes familiarity with some basic interpreter and compiler concepts like bytecode
and constant folding.
19.1 A Little History
Python is a high-level, dynamic programming language. It was invented by the Dutch programmer
Guido van Rossum in the late 1980s. Guidos original implementation is a traditional bytecode
interpreter written in C, and consequently known as CPython. There are now many other Python
implementations. Among the most notable are Jython, which is written in Java and allows for
interfacing with Java code, IronPython, which is written in C# and interfaces with Microsofts
.NET framework, and PyPy, the subject of this chapter. CPython is still the most widely used
implementation and currently the only one to support Python 3, the next generation of the Python
language. This chapter will explain the design decisions in PyPy that make it dierent from other
Python implementations and indeed from any other dynamic language implementation.
19.2 Overview of PyPy
PyPy, except for a negligible number of C stubs, is written completely in Python. The PyPy source
tree contains two major components: the Python interpreter and the RPython translation toolchain.
The Python interpreter is the programmer-facing runtime that people using PyPy as a Python
implementation invoke. It is actually written in a subset of Python called Restricted Python (usually
abbreviated RPython). The purpose of writing the Python interpreter in RPython is so the interpreter
can be fed to the second major part of PyPy, the RPython translation toolchain. The RPython
translator takes RPython code and converts it to a chosen lower-level language, most commonly
C. This allows PyPy to be a self-hosting implementation, meaning it is written in the language it
implements. As we shall see throughout this chapter, the RPython translator also makes PyPy a
general dynamic language implementation framework.
PyPys powerful abstractions make it the most exible Python implementation. It has nearly
200 conguration options, which vary from selecting dierent garbage collector implementations to
altering parameters of various translation optimizations.
19.3 The Python Interpreter
Since RPython is a strict subset of Python, the PyPy Python interpreter can be run on top of another
Python implementation untranslated. This is, of course, extremely slow but it makes it possible to
quickly test changes in the interpreter. It also enables normal Python debugging tools to be used to
debug the interpreter. Most of PyPys interpreter tests can be run both on the untranslated interpreter
and the translated interpreter. This allows quick testing during development as well as assurance that
the translated interpreter behaves the same as the untranslated one.
For the most part, the details of the PyPy Python interpreter are quite similiar to that of CPython;
PyPy and CPython use nearly identical bytecode and data structures during interpretation. The
primary dierence between the two is PyPy has a clever abstraction called object spaces (or objspaces
for short). An objspace encapsulates all the knowledge needed to represent and manipulate Python
data types. For example, performing a binary operation on two Python objects or fetching an attribute
of an object is handled completely by the objspace. This frees the interpreter from having to know
anything about the implementation details of Python objects. The bytecode interpreter treats Python
objects as black boxes and calls objspace methods whenever it needs to manipulate them. For example,
here is a rough implementation of the BINARY_ADD opcode, which is called when two objects are
combined with the + operator. Notice how the operands are not inspected by the interpreter; all
handling is delegated immediately to the objspace.
def BINARY_ADD(space, frame):
object1 = frame.pop() # pop left operand off stack
object2 = frame.pop() # pop right operand off stack
result = space.add(object1, object2) # perform operation
frame.push(result) # record result on stack
The objspace abstraction has numerous advantages. It allows new data type implementations to
be swapped in and out without modifying the interpreter. Also, since the sole way to manipulate
objects is through the objspace, the objspace can intercept, proxy, or record operations on objects.
Using the powerful abstraction of objspaces, PyPy has experimented with thunking, where results
can be lazily but completely transparently computed on demand, and tainting, where any operation
on an object will raise an exception (useful for passing sensitive data through untrusted code). The
most important application of objspaces, however, will be discussed in Section 19.4.
The objspace used in a vanilla PyPy interpreter is called the standard objspace (std objspace for
short). In addition to the abstraction provided by the objspace system, the standard objspace provides
another level of indirection; a single data type may have multiple implementations. Operations on data
types are then dispatched using multimethods. This allows picking the most ecient representation
for a given piece of data. For example, the Python long type (ostensibly a bigint data type) can
be represented as a standard machine-word-sized integer when it is small enough. The memory
and computationally more expensive arbitrary-precision long implementation need only be used
when necessary. Theres even an implementation of Python integers available using tagged pointers.
Container types can also be specialized to certain data types. For example, PyPy has a dictionary
(Pythons hash table data type) implementation specialized for string keys. The fact that the same data
type can be represented by dierent implementations is completely transparent to application-level
code; a dictionary specialized to strings is identical to a generic dictionary and will degenerate
gracefully if non-string keys are put into it.
PyPy distinguishes between interpreter-level (interp-level) and application-level (app-level) code.
Interp-level code, which most of the interpreter is written in, must be in RPython and is translated.
280 PyPy
It directly works with the objspace and wrapped Python objects. App-level code is always run by
the PyPy bytecode interpreter. As simple as interp-level RPython code is, compared to C or Java,
PyPy developers have found it easiest to use pure app-level code for some parts of the interpreter.
Consequently, PyPy has support for embedding app-level code in the interpreter. For example, the
functionality of the Python print statement, which writes objects to standard output, is implemented
in app-level Python. Builtin modules can also be written partially in interp-level code and partially
in app-level code.
19.4 The RPython Translator
The RPython translator is a toolchain of several lowering phases that rewrite RPython to a target
language, typically C. The higher-level phases of translation are shown in Figure 19.1. The translator
is itself written in (unrestricted) Python and intimately linked to the PyPy Python interpreter for
reasons that will be illuminated shortly.
Figure 19.1: Translation steps
The rst thing the translator does is load the RPython program into its process. (This is done
with the normal Python module loading support.) RPython imposes a set of restrictions on normal,
dynamic Python. For example, functions cannot be created at runtime, and a single variable cannot
have the possibility of holding incompatible types, such as an integer and a object instance. When
the program is initially loaded by the translator, though, it is running on a normal Python interpreter
and can use all of Pythons dynamic features. PyPys Python interpreter, a huge RPython program,
makes heavy use of this feature for metaprogramming. For example, it generates code for standard
objspace multimethod dispatch. The only requirement is that the program is valid RPython by the
time the translator starts the next phase of translation.
The translator builds ow graphs of the RPython program through a process called abstract
interpretation. Abstract interpretation reuses the PyPy Python interpreter to interpret RPython
programs with a special objspace called the ow objspace. Recall that the Python interpreter treats
objects in a program like black boxes, calling out to the objspace to perform any operation. The ow
objspace, instead of the standard set of Python objects, has only two objects: variables and constants.
Variables represent values not known during translation, and constants, not surprisingly, represent
immutable values that are known. The ow objspace has a basic facility for constant folding; if it is
asked to do an operation where all the arguments are constants, it will statically evaluate it. What
Benjamin Peterson 281
is immutable and must be constant in RPython is broader than in standard Python. For example,
modules, which are emphatically mutable in Python, are constants in the ow objspace because
they dont exist in RPython and must be constant-folded out by the ow objspace. As the Python
interpreter interprets the bytecode of RPython functions, the ow objspace records the operations it
is asked to perform. It takes care to record all branches of conditional control ow constructs. The
end result of abstract interpretation for a function is a ow-graph consisting of linked blocks, where
each block has one or more operations.
An example of the ow-graph generating process is in order. Consider a simple factorial function:
def factorial(n):
if n == 1:
return 1
return n * factorial(n - 1)
The ow-graph for the function looks like Figure 19.2.
Figure 19.2: Flow-graph of factorial
The factorial function has been divided into blocks containing the operations the owspace
recorded. Each block has input arguments and a list of operations on the variables and constants.
The rst block has an exit switch at the end, which determines which block control-ow will pass to
after the rst block is run. The exit switch can be based on the value of some variable or whether
an exception occurred in the last operation of the block. Control-ow follows the lines between the
blocks.
The ow-graph generated in the ow objspace is in static single assignment form, or SSA, an
intermediate representation commonly used in compilers. The key feature of SSA is that every
282 PyPy
variable is only assigned once. This property simplies the implementation of many compiler
transformations and optimizations.
After a function graph is generated, the annotation phase begins. The annotator assigns a type
to the results and arguments of each operation. For example, the factorial function above will be
annotated to accept and return an integer.
The next phase is called RTyping. RTyping uses type information from the annotator to expand
each high-level ow-graph operation into low-level ones. It is the rst part of translation where the
target backend matters. The backend chooses a type system for the RTyper to specialize the program
to. The RTyper currently has two type systems: A low-level typesystem for backends like C and one
for higher-level typesystems with classes. High-level Python operations and types are transformed
into the level of the type system. For example, an add operation with operands annotated as integers
will generate a int_add operation with the low-level type system. More complicated operations like
hash table lookups generate function calls.
After RTyping, some optimizations on the low-level ow-graph are performed. They are mostly
of the traditional compiler variety like constant folding, store sinking, and dead code removal.
Python code typically has frequent dynamic memory allocations. RPython, being a Python
derivative, inherits this allocation intensive pattern. In many cases, though, allocations are temporary
and local to a function. Malloc removal is an optimization that addresses these cases. Malloc removal
removes these allocations by attening the previously dynamically allocated object into component
scalars when possible.
To see how malloc removals works, consider the following function that computes the Euclidean
distance between two points on the plane in a roundabout fashion:
def distance(x1, y1, x2, y2):
p1 = (x1, y1)
p2 = (x2, y2)
return math.hypot(p1[] - p2[], p1[1] - p2[1])
When initially RTyped, the body of the function has the following operations:
v6 = malloc((GcStruct tuple2))
v61 = setfield(v6, (item), x1_1)
v62 = setfield(v6, (item1), y1_1)
v63 = malloc((GcStruct tuple2))
v64 = setfield(v63, (item), x2_1)
v65 = setfield(v63, (item1), y2_1)
v66 = getfield(v6, (item))
v67 = getfield(v63, (item))
v68 = int_sub(v66, v67)
v69 = getfield(v6, (item1))
v7 = getfield(v63, (item1))
v71 = int_sub(v69, v7)
v72 = cast_int_to_float(v68)
v73 = cast_int_to_float(v71)
v74 = direct_call(math_hypot, v72, v73)
This code is suboptimal in several ways. Two tuples that never escape the function are allocated.
Additionally, there is unnecessary indirection accessing the tuple elds.
Running malloc removal produces the following concise code:
Benjamin Peterson 283
v53 = int_sub(x1_, x2_)
v56 = int_sub(y1_, y2_)
v57 = cast_int_to_float(v53)
v58 = cast_int_to_float(v56)
v59 = direct_call(math_hypot, v57, v58)
The tuple allocations have been completely removed and the indirections attened out. Later, we
will see how a technique similar to malloc removal is used on application-level Python in the PyPy
JIT (Section 19.5).
PyPy also does function inlining. As in lower-level languages, inlining improves performance
in RPython. Somewhat surprisingly, it also reduces the size of the nal binary. This is because it
allows more constant folding and malloc removal to take place, which reduces overall code size.
The program, now in optimized, low-level ow-graphs, is passed to the backend to generate
sources. Before it can generate C code, the C backend must perform some additional transformations.
One of these is exception transformation, where exception handling is rewritten to use manual stack
unwinding. Another is the insertion of stack depth checks. These raise an exception at runtime if the
recursion is too deep. Places where stack depth checks are needed are found by computing cycles in
the call graph of the program.
Another one of the transformations performed by the C backend is adding garbage collection
(GC). RPython, like Python, is a garbage-collected language, but C is not, so a garbage collector has
to be added. To do this, a garbage collection transformer converts the ow-graphs of the program into
a garbage-collected program. PyPys GC transformers provide an excellent demonstration of how
translation abstracts away mundane details. In CPython, which uses reference counting, the C code
of the interpreter must carefully keep track of references to Python objects it is manipulating. This
not only hardcodes the garbage collection scheme in the entire codebase but is prone to subtle human
errors. PyPys GC transformer solves both problems; it allows dierent garbage collection schemes
to be swapped in and out seamlessly. It is trivial to evaluate a garbage collector implementation (of
which PyPy has many), simply by tweaking a conguration option at translation. Modulo transformer
bugs, the GC transformer also never makes reference mistakes or forgets to inform the GC when
an object is no longer in use. The power of the GC abstraction allows GC implementations that
would be practically impossible to hardcode in an interpreter. For example, several of PyPys GC
implementations require a write barrier. A write barrier is a check which must be performed every
time a GC-managed object is placed in another GC-managed array or structure. The process of
inserting write barriers would be laborious and fraught with mistakes if done manually, but is trivial
when done automatically by the GC transformer.
The C backend can nally emit C source code. The generated C code, being generated from
low-level ow-graphs, is an ugly mess of gotos and obscurely named variables. An advantage of
writing C is that the C compiler can do most of the complicated static transformation work required
to make a nal binary-like loop optimizations and register allocation.
19.5 The PyPy JIT
Python, like most dynamic languages, has traditionally traded eciency for exibility. The archi-
tecture of PyPy, being especially rich in exibility and abstraction, makes very fast interpretation
dicult. The powerful objspace and multimethod abstractions in the std objspace do not come with-
out a cost. Consequently, the vanilla PyPy interpreter performs up to 4 times slower than CPython.
To remedy not only this but Pythons reputation as a sluggish language, PyPy has a just-in-time
284 PyPy
compiler (commonly written JIT). The JIT compiles frequently used codepaths into assembly during
the runtime of the program.
The PyPy JIT takes advantage of PyPys unique translation architecture described in Section 19.4.
PyPy actually has no Python-specic JIT; it has a JIT generator. JIT generation is implemented as
simply another optional pass during translation. A interpreter desiring JIT generation need only
make two special function calls called jit hints.
PyPys JIT is a tracing JIT. This means it detects hot (meaning frequently run) loops to
optimize by compiling to assembly. When the JIT has decided it is going to compile a loop, it records
operations in one iteration of the loop, a process called tracing. These operations are subsequently
compiled to machine code.
As mentioned above, the JIT generator requires only two hints in the interpreter to generate a JIT:
merge_point and can_enter_jit. can_enter_jit tells the JIT where in the interpreter a loop
starts. In the Python interpreter, this is the end of the JUMP_ABSOLUTE bytecode. (JUMP_ABSOLUTE
makes the interpreter jump to the head of the app-level loop.) merge_point tells the JIT where it is
safe to return to the interpreter from the JIT. This is the beginning of the bytecode dispatch loop in
the Python interpreter.
The JIT generator is invoked after the RTyping phase of translation. Recall that at this point,
the programs ow-graphs consist of low-level operations nearly ready for target code generation.
The JIT generator locates the hints mentioned above in the interpreter and replaces them with calls
to invoke the JIT during runtime. The JIT generator then writes a serialized representation of the
ow-graphs of every function that the interpreter wants jitted. These serialized ow-graphs are called
jitcodes. The entire interpreter is now described in terms of low-level RPython operations. The
jitcodes are saved in the nal binary for use at runtime.
At runtime, the JIT maintains a counter for every loop that is executed in the program. When a
loops counter exceeds a congurable threshold, the JIT is invoked and tracing begins. The key object
in tracing is the meta-interpreter. The meta-interpreter executes the jitcodes created in translation.
It is thus interpreting the main interpreter, hence the name. As it traces the loop, it creates a list
of the operations it is executing and records them in JIT intermediate representation (IR), another
operation format. This list is called the trace of the loop. When the meta-interpreter encounters a
call to a jitted function (one for which jitcode exists), the meta-interpreter enters it and records its
operations to original trace. Thus, the tracing has the eect of attening out the call stack; the only
calls in the trace are to interpreter functions that are outside the knowledge of jit.
The meta-interpreter is forced to specialize the trace to properties of the loop iteration it is
tracing. For example, when the meta-interpreter encounters a conditional in the jitcode, it naturally
must choose one path based on the state of the program. When it makes a choice based on runtime
information, the meta-interpreter records an IR operation called a guard. In the case of a conditional,
this will be a guard_true or guard_false operation on the condition variable. Most arithmetic
operations also have guards, which ensure the operation did not overow. Essentially, guards codify
assumptions the meta-interpreter is making as it traces. When assembly is generated, the guards
will protect assembly from being run in a context it is not specialized for. Tracing ends when the
meta-interpreter reaches the same can_enter_jit operation with which it started tracing. The loop
IR can now be passed to the optimizer.
The JIT optimizer features a few classical compiler optimizations and many optimizations special-
ized for dynamic languages. Among the most important of the latter are virtuals and virtualizables.
Virtuals are objects which are known not to escape the trace, meaning they are not passed as
arguments to external, non-jitted function calls. Structures and constant length arrays can be virtuals.
Virtuals do not have to be allocated, and their data can be stored directly in registers and on the
Benjamin Peterson 285
stack. (This is much like the static malloc removal phase described in the section about translation
backend optimizations.) The virtuals optimization strips away the indirection and memory allocation
ineciencies in the Python interpreter. For example, by becoming virtual, boxed Python integer
objects are unboxed into simple word-sized integers and can be stored directly in machine registers.
A virtualizable acts much like a virtual but may escape the trace (that is, be passed to non-
jitted functions). In the Python interpreter the frame object, which holds variable values and the
instruction pointer, is marked virtualizable. This allows stack manipulations and other operations on
the frame to be optimized out. Although virtuals and virtualizables are similar, they share nothing in
implementation. Virtualizables are handled during tracing by the meta-interpreter. This is unlike
virtuals, which are handled during trace optimization. The reason for this is virtualizables require
special treatment, since they may escape the trace. Specically, the meta-interpreter has to ensure
that non-jitted functions that may use the virtualizable dont actually try to fetch its elds. This is
because in jitted code, the elds of virtualizable are stored in the stack and registers, so the actual
virtualizable may be out of date with respect to its current values in the jitted code. During JIT
generation, code which accesses a virtualizable is rewritten to check if jitted assembly is running. If
it is, the JIT is asked to update the elds from data in assembly. Additionally when the external call
returns to jitted code, execution bails back to the interpreter.
After optimization, the trace is ready to be assembled. Since the JIT IR is already quite low-level,
assembly generation is not too dicult. Most IR operations correspond to only a few x86 assembly
operations. The register allocator is a simple linear algorithm. At the moment, the increased time that
would be spent in the backend with a more sophisticated register allocation algorithm in exchange for
generating slightly better code has not been justied. The trickiest portions of assembly generation
are garbage collector integration and guard recovery. The GC has to be made aware of stack roots in
the generated JIT code. This is accomplished by special support in the GC for dynamic root maps.
Figure 19.3: Bailing back to the interpreter on guard failure
When a guard fails, the compiled assembly is no longer valid and control must return to the
bytecode interpreter. This bailing out is one of the most dicult parts of JIT implementation, since
the interpreter state has to be reconstructed from the register and stack state at the point the guard
failed. For each guard, the assembler writes a compact description of where all the values needed to
reconstruct the interpreter state are. At guard failure, execution jumps to a function which decodes
this description and passes the recovery values to a higher level be reconstructed. The failing guard
may be in the middle of the execution of a complicated opcode, so the interpreter can not just start
with the next opcode. To solve this, PyPy uses a blackhole interpreter. The blackhole interpreter
executes jitcodes starting from the point of guard failure until the next merge point is reached.
There, the real interpreter can resume. The blackhole interpreter is so named because unlike the
meta-interpreter, it doesnt record any of the operations it executes. The process of guard failure is
depicted in Figure 19.3.
286 PyPy
As described up to this point, the JIT would be essentially useless on any loop with a frequently
changing condition, because a guard failure would prevent assembly from running very many
iterations. Every guard has a failure counter. After the failure count has passed a certain threshold,
the JIT starts tracing from the point of guard failure instead of bailing back to the interpreter. This
new sub-trace is called a bridge. When the tracing reaches the end of the loop, the bridge is optimized
and compiled and the original loop is patched at the guard to jump to the new bridge instead of the
failure code. This way, loops with dynamic conditions can be jitted.
Howsuccessful have the techniques used in the PyPy JIT proven? At the time of this writing, PyPy
is a geometric average of ve times faster than CPython on a comprehensive suite of benchmarks.
With the JIT, app-level Python has the possibility of being faster than interp-level code. PyPy
developers have recently had the excellent problem of having to write interp-level loops in app-level
Python for performance.
Most importantly, the fact that the JIT is not specic to Python means it can be applied to any
interpreter written within the PyPy framework. This need not necessarily be a language interpreter.
For example, the JIT is used for Pythons regular expression engine. NumPy is a powerful array
module for Python used in numerical computing and scientic research. PyPy has an experimental
reimplementation of NumPy. It harnesses the power of the PyPy JIT to speed up operations on
arrays. While the NumPy implementation is still in its early stages, initial performance results look
promising.
19.6 Design Drawbacks
While it beats C any day, writing in RPython can be a frustrating experience. Its implicit typing
is dicult to get used to at rst. Not all Python language features are supported and others are
arbitrarily restricted. RPython is not specied formally anywhere and what the translator accepts
can vary from day to day as RPython is adapted to PyPys needs. The author of this chapter often
manages to create programs that churn in the translator for half an hour, only to fail with an obscure
error.
The fact that the RPython translator is a whole-program analyzer creates some practical problems.
The smallest change anywhere in translated code requires retranslating the entire interpreter. That
currently takes about 40 minutes on a fast, modern system. The delay is especially annoying for
testing how changes aect the JIT, since measuring performance requires a translated interpreter. The
requirement that the whole program be present at translation means modules containing RPython
cannot be built and loaded separately from the core interpreter.
The levels of abstraction in PyPy are not always as clear cut as in theory. While technically the
JIT generator should be able to produce an excellent JIT for a language given only the two hints
mentioned above, the reality is that it behaves better on some code than others. The Python interpreter
has seen a lot of work towards making it more jit-friendly, including many more JIT hints and
even new data structures optimized for the JIT.
The many layers of PyPy can make tracking down bugs a laborious process. A Python interpreter
bug could be directly in the interpreter source or buried somewhere in the semantics of RPython and
the translation toolchain. Especially when a bug cannot be reproduced on the untranslated interpreter,
debugging is dicult. It typically involves running GDB on the nearly unreadable generated C
sources.
Translating even a restricted subset of Python to a much lower-level language like C is not an
easy task. The lowering passes described in Section 19.4 are not really independent. Functions
Benjamin Peterson 287
are being annotated and rtyped throughout translation, and the annotator has some knowledge of
low-level types. The RPython translator is thus a tangled web of cross-dependencies. The translator
could do with cleaning up in several places, but doing it is neither easy nor much fun.
19.7 A Note on Process
In part to combat its own complexity (see Section 19.6), PyPy has adopted several so-called agile
development methodologies. By far the most important of these is test-driven development. All
new features and bug xes are required to have tests to verify their correctness. The PyPy Python
interpreter is also run against CPythons regression test suite. PyPys test driver, py.test, was spun o
and is now used in many other projects. PyPy also has a continuous integration system that runs
the test suite and translates the interpreter on a variety of platforms. Binaries for all platforms are
produced daily and the benchmark suite is run. All these tests ensure that the various components
are behaving, no matter what change is made in the complicated architecture.
There is a strong culture of experimentation in the PyPy project. Developers are encouraged
to make branches in the Mercurial repository. There, ideas in development can be rened without
destabilizing the main branch. Branches are not always successful, and some are abandoned. If
anything though, PyPy developers are tenacious. Most famously, the current PyPy JIT is the fth
attempt to add a JIT to PyPy!
Figure 19.4: The jitviewer showing Python bytecode and associated JIT IR operations
288 PyPy
The PyPy project also prides itself on its visualization tools. The ow-graph charts in Section 19.4
are one example. PyPy also has tools to show invocation of the garbage collector over time and view
the parse trees of regular expressions. Of special interest is jitviewer, a program that allows one to
visually peel back the layers of a jitted function, from Python bytecode to JIT IR to assembly. (The
jitviewer is shown in Figure 19.4.) Visualization tools help developers understand how PyPys many
layers interact with each other.
19.8 Summary
The Python interpreter treats Python objects as black boxes and leaves all behavior to be dened by
the objspace. Individual objspaces can provide special extended behavior to Python objects. The
objspace approach also enables the abstract interpretation technique used in translation.
The RPython translator allows details like garbage collection and exception handling to be
abstracted from the language interpreter. It also opens up the possibly of running PyPy on many
dierent runtime platforms by using dierent backends.
One of the most important uses of the translation architecture is the JIT generator. The generality
of the JIT generator allows JITs for new languages and sub-languages like regular expressions to be
added. PyPy is the fastest Python implementation today because of its JIT generator.
While most of PyPys development eort has gone into the Python interpreter, PyPy can be used
for the implementation of any dynamic language. Over the years, partial interpreters for JavaScript,
Prolog, Scheme, and IO have been written with PyPy.
19.9 Lessons Learned
Finally, some of lessons to take away from the PyPy project:
Repeated refactoring is often a necessary process. For example, it was originally envisioned that
the C backend for the translator would be able to work o the high-level ow graphs! It took several
iterations for the current multi-phase translation process to be born.
The most important lesson of PyPy is the power of abstraction. In PyPy, abstractions separate
implementation concerns. For example, RPythons automatic garbage collection allows a developer
working the interpreter to not worry about memory management. At the same time, abstractions have
a mental cost. Working on the translation chain involves juggling the various phases of translation
at once in ones head. What layer a bug resides in can also be clouded by abstractions; abstraction
leakage, where swapping low-level components that should be interchangeable breaks higher-level
code, is perennial problem. It is important that tests are used to verify that all parts of the system are
working, so a change in one system does not break a dierent one. More concretely, abstractions can
slow a program down by creating too much indirection.
The exibility of (R)Python as an implementation language makes experimenting with new
Python language features (or even new languages) easy. Because of its unique architecture, PyPy
will play a large role in the future of Python and dynamic language implementation.
289
290 PyPy
[chapter2]
SQLAlchemy
Michael Bayer
SQLAlchemy is a database toolkit and object-relational mapping (ORM) system for the Python
programming language, rst introduced in 2005. From the beginning, it has sought to provide an
end-to-end system for working with relational databases in Python, using the Python Database API
(DBAPI) for database interactivity. Even in its earliest releases, SQLAlchemys capabilities attracted
a lot of attention. Key features include a great deal of uency in dealing with complex SQL queries
and object mappings, as well as an implementation of the unit of work pattern, which provides for
a highly automated system of persisting data to a database.
Starting from a small, roughly implemented concept, SQLAlchemy quickly progressed through
a series of transformations and reworkings, turning over new iterations of its internal architectures as
well as its public API as the userbase continued to grow. By the time version 0.5 was introduced in
January of 2009, SQLAlchemy had begun to assume a stable form that was already proving itself
in a wide variety of production deployments. Throughout 0.6 (April, 2010) and 0.7 (May, 2011),
architectural and API enhancements continued the process of producing the most ecient and stable
library possible. As of this writing, SQLAlchemy is used by a large number of organizations in a
variety of elds, and is considered by many to be the de facto standard for working with relational
databases in Python.
20.1 The Challenge of Database Abstraction
The term database abstraction is often assumed to mean a system of database communication
which conceals the majority of details of how data is stored and queried. The term is sometimes
taken to the extreme, in that such a system should not only conceal the specics of the relational
database in use, but also the details of the relational structures themselves and even whether or not
the underlying storage is relational.
The most common critiques of ORMs center on the assumption that this is the primary purpose
of such a toolto hide the usage of a relational database, taking over the task of constructing an
interaction with the database and reducing it to an implementation detail. Central to this approach
of concealment is that the ability to design and query relational structures is taken away from the
developer and instead handled by an opaque library.
Those who work heavily with relational databases know that this approach is entirely impractical.
Relational structures and SQL queries are vastly functional, and comprise the core of an applications
design. How these structures should be designed, organized, and manipulated in queries varies not
just on what data is desired, but also on the structure of information. If this utility is concealed,
theres little point in using a relational database in the rst place.
The issue of reconciling applications that seek concealment of an underlying relational database
with the fact that relational databases require great specicity is often referred to as the object-
relational impedance mismatch problem. SQLAlchemy takes a somewhat novel approach to this
problem.
SQLAlchemys Approach to Database Abstraction
SQLAlchemy takes the position that the developer must be willing to consider the relational form of
his or her data. A system which pre-determines and conceals schema and query design decisions
marginalizes the usefulness of using a relational database, leading to all of the classic problems of
impedance mismatch.
At the same time, the implementation of these decisions can and should be executed through
high-level patterns as much as possible. Relating an object model to a schema and persisting it via
SQL queries is a highly repetitive task. Allowing tools to automate these tasks allows the development
of an application thats more succinct, capable, and ecient, and can be created in a fraction of the
time it would take to develop these operations manually.
To this end, SQLAlchemy refers to itself as a toolkit, to emphasize the role of the developer
as the designer/builder of all relational structures and linkages between those structures and the
application, not as a passive consumer of decisions made by a library. By exposing relational
concepts, SQLAlchemy embraces the idea of leaky abstraction, encouraging the developer to tailor
a custom, yet fully automated, interaction layer between the application and the relational database.
SQLAlchemys innovation is the extent to which it allows a high degree of automation with little to
no sacrice in control over the relational database.
20.2 The Core/ORM Dichotomy
Central to SQLAlchemys goal of providing a toolkit approach is that it exposes every layer of
database interaction as a rich API, dividing the task into two main categories known as Core and
ORM. The Core includes Python Database API (DBAPI) interaction, rendering of textual SQL
statements understood by the database, and schema management. These features are all presented
as public APIs. The ORM, or object-relational mapper, is then a specic library built on top of
the Core. The ORM provided with SQLAlchemy is only one of any number of possible object
abstraction layers that could be built upon the Core, and many developers and organizations build
their applications on top of the Core directly.
The Core/ORM separation has always been SQLAlchemys most dening feature, and it has both
pros and cons. The explicit Core present in SQLAlchemy leads the ORM to relate database-mapped
class attributes to a structure known as a Table, rather than directly to their string column names as
expressed in the database; to produce a SELECT query using a structure called select, rather than
piecing together object attributes directly into a string statement; and to receive result rows through a
facade called ResultProxy, which transparently maps the select to each result row, rather than
transferring data directly from a database cursor to a user-dened object.
Core elements may not be visible in a very simple ORM-centric application. However, as the
Core is carefully integrated into the ORM to allow uid transition between ORM and Core constructs,
a more complex ORM-centric application can move down a level or two in order to deal with the
292 SQLAlchemy
Third party libraries / Python core
SQLAlchemy Core
SQLAlchemy ORM
SQL Expression
Language
Dialect
Connection
Pooling
DBAPI
Schema / Types Engine
Object Relational Mapper (ORM)
Database
Figure 20.1: SQLAlchemy layer diagram
database in a more specic and nely tuned manner, as the situation requires. As SQLAlchemy has
matured, the Core API has become less explicit in regular use as the ORM continues to provide
more sophisticated and comprehensive patterns. However, the availability of the Core was also a
contributor to SQLAlchemys early success, as it allowed early users to accomplish much more than
would have been possible when the ORM was still being developed.
The downside to the ORM/Core approach is that instructions must travel through more steps.
Pythons traditional C implementation has a signicant overhead penalty for individual function
calls, which are the primary cause of slowness in the runtime. Traditional methods of ameliorating
this include shortening call chains through rearrangement and inlining, and replacing performance-
critical areas with C code. SQLAlchemy has spent many years using both of these methods to
improve performance. However, the growing acceptance of the PyPy interpreter for Python may
promise to squash the remaining performance problems without the need to replace the majority of
SQLAlchemys internals with C code, as PyPy vastly reduces the impact of long call chains through
just-in-time inlining and compilation.
20.3 Taming the DBAPI
At the base of SQLAlchemy is a system for interacting with the database via the DBAPI. The
DBAPI itself is not an actual library, only a specication. Therefore, implementations of the DBAPI
are available for a particular target database, such as MySQL or PostgreSQL, or alternatively for
particular non-DBAPI database adapters, such as ODBC and JDBC.
Michael Bayer 293
The DBAPI presents two challenges. The rst is to provide an easy-to-use yet full-featured facade
around the DBAPIs rudimentary usage patterns. The second is to handle the extremely variable
nature of specic DBAPI implementations as well as the underlying database engines.
The Dialect System
The interface described by the DBAPI is extremely simple. Its core components are the DBAPI
module itself, the connection object, and the cursor objecta cursor in database parlance represents
the context of a particular statement and its associated results. A simple interaction with these objects
to connect and retrieve data from a database is as follows:
connection = dbapi.connect(user="user", pw="pw", host="host")
cursor = connection.cursor()
cursor.execute("select * from user_table where name=?", ("jack",))
print "Columns in result:", [desc[] for desc in cursor.description]
for row in cursor.fetchall():
print "Row:", row
cursor.close()
connection.close()
SQLAlchemy creates a facade around the classical DBAPI conversation. The point of entry
to this facade is the create_engine call, from which connection and conguration information is
assembled. An instance of Engine is produced as the result. This object then represents the gateway
to the DBAPI, which itself is never exposed directly.
For simple statement executions, Engine oers whats known as an implicit execution interface.
The work of acquiring and closing both a DBAPI connection and cursor are handled behind the
scenes:
engine = create_engine("postgresql://user:pw@host/dbname")
result = engine.execute("select * from table")
print result.fetchall()
When SQLAlchemy 0.2 was introduced the Connection object was added, providing the ability
to explicitly maintain the scope of the DBAPI connection:
conn = engine.connect()
result = conn.execute("select * from table")
print result.fetchall()
conn.close()
The result returned by the execute method of Engine or Connection is called a ResultProxy,
which oers an interface similar to the DBAPI cursor but with richer behavior. The Engine,
Connection, and ResultProxy correspond to the DBAPI module, an instance of a specic DBAPI
connection, and an instance of a specic DBAPI cursor, respectively.
Behind the scenes, the Engine references an object called a Dialect. The Dialect is an
abstract class for which many implementations exist, each one targeted at a specic DBAPI/database
combination. A Connection created on behalf of the Engine will refer to this Dialect for all
decisions, which may have varied behaviors depending on the target DBAPI and database in use.
The Connection, when created, will procure and maintain an actual DBAPI connection from
a repository known as a Pool thats also associated with the Engine. The Pool is responsible for
294 SQLAlchemy
creating new DBAPI connections and, usually, maintaining them in an in-memory pool for frequent
re-use.
During a statement execution, an additional object called an ExecutionContext is created
by the Connection. The object lasts from the point of execution throughout the lifespan of the
ResultProxy. It may also be available as a specic subclass for some DBAPI/database combinations.
Figure 20.2 illustrates all of these objects and their relationships to each other as well as to the
DBAPI components.
<<uses>>
Engine Dialect
psycopg2
DBAPI
<<uses>>
ExecutionContext DBAPI cursor
<<uses>>
sqlalchemy.engine psycopg2
<<produces>>
Connection
<<creates>>
<<creates>>
ResultProxy
<<creates>>
<<uses>>
DBAPI
connection
<<produces>>
<<uses>>
Pool
sqlalchemy.pool
<<uses>>
<<maintains>>
<<uses>>
Figure 20.2: Engine, Connection, ResultProxy API
Dealing with DBAPI Variability
For the task of managing variability in DBAPI behavior, rst well consider the scope of the problem.
The DBAPI specication, currently at version two, is written as a series of API denitions which
allow for a wide degree of variability in behavior, and leave a good number of areas undened.
As a result, real-life DBAPIs exhibit a great degree of variability in several areas, including when
Python unicode strings are acceptable and when they are not; how the last inserted idthat is, an
autogenerated primary keymay be acquired after an INSERT statement; and how bound parameter
values may be specied and interpreted. They also have a large number of idiosyncratic type-oriented
behaviors, including the handling of binary, precision numeric, date, Boolean, and unicode data.
SQLAlchemy approaches this by allowing variability in both Dialect and ExecutionContext
via multi-level subclassing. Figure 20.3 illustrates the relationship between Dialect and
Michael Bayer 295
ExecutionContext when used with the psycopg2 dialect. The PGDialect class provides behaviors
that are specic to the usage of the PostgreSQL database, such as the ARRAY datatype and schema
catalogs; the PGDialect_psycopg2 class then provides behaviors specic to the psycopg2 DBAPI,
including unicode data handlers and server-side cursor behavior.
Dialect
DefaultDialect
PGDialect_psycopg2
PGDialect
ExecutionContext
DefaultExecutionContext
PGExecutionContext
PGExecutionContext_psycopg2
sqlalchemy.dialects.postgresql
sqlalchemy.engine
<<uses>>
Figure 20.3: Simple Dialect/ExecutionContext hierarchy
A variant on the above pattern presents itself when dealing with a DBAPI that supports multiple
databases. Examples of this include pyodbc, which deals with any number of database backends via
ODBC, and zxjdbc, a Jython-only driver which deals with JDBC. The above relationship is augmented
by the use of a mixin class from the sqlalchemy.connectors package which provides DBAPI
behavior that is common to multiple backends. Figure 20.4 illustrates the common functionality of
sqlalchemy.connectors.pyodbc shared among pyodbc-specic dialects for MySQLand Microsoft
SQL Server.
The Dialect and ExecutionContext objects provide a means to dene every interaction with
the database and DBAPI, including how connection arguments are formatted and how special quirks
during statement execution are handled. The Dialect is also a factory for SQL compilation constructs
that render SQL correctly for the target database, and type objects which dene how Python data
should be marshaled to and from the target DBAPI and database.
20.4 Schema Denition
With database connectivity and interactivity established, the next task is to provide for the creation
and manipulation of backend-agnostic SQL statements. To achieve this, we need to dene rst
how we will refer to the tables and columns present in a databasethe so-called schema. Tables
and columns represent how data is organized, and most SQL statements consist of expressions and
commands referring to these structures.
296 SQLAlchemy
Dialect
DefaultDialect
MSDialect_pyodbc
MSDialect
sqlalchemy.dialects.mssql
sqlalchemy.engine
PyODBCConnector
sqlalchemy.connectors
MySQLDialect
MySQLDialect_pyodbc
sqlalchemy.dialects.mysql
Figure 20.4: Common DBAPI behavior shared among dialect hierarchies
An ORM or data access layer needs to provide programmatic access to the SQL language; at the
base is a programmatic system of describing tables and columns. This is where SQLAlchemy oers
the rst strong division of Core and ORM, by oering the Table and Column constructs that describe
the structure of the database independently of a users model class denition. The rationale behind
the division of schema denition from object relational mapping is that the relational schema can be
designed unambiguously in terms of the relational database, including platform-specic details if
necessary, without being muddled by object-relational conceptsthese remain a separate concern.
Being independent of the ORM component also means the schema description system is just as
useful for any other kind of object-relational system which may be built on the Core.
The Table and Column model falls under the scope of whats referred to as metadata, oering
a collection object called MetaData to represent a collection of Table objects. The structure is
derived mostly from Martin Fowlers description of Metadata Mapping in Patterns of Enterprise
Application Architecture. Figure 20.5 illustrates some key elements of the sqlalchemy.schema
package.
Table represents the name and other attributes of an actual table present in a target schema.
Its collection of Column objects represents naming and typing information about individual table
columns. A full array of objects describing constraints, indexes, and sequences is provided to ll in
many more details, some of which impact the behavior of the engine and SQL construction system.
In particular, ForeignKeyConstraint is central to determining how two tables should be joined.
Table and Column in the schema package are unique versus the rest of the package in that they are
dual-inheriting, both from the sqlalchemy.schema package and the sqlalchemy.sql.expression
package, serving not just as schema-level constructs, but also as core syntactical units in the SQL
expression language. This relationship is illustrated in Figure 20.6.
In Figure 20.6 we can see that Table and Column inherit from the SQL world as specic forms of
things you can select from, known as a FromClause, and things you can use in a SQL expression,
known as a ColumnElement.
Michael Bayer 297
Column
MetaData
Constraint
UniqueConstraint
CheckConstraint
<<contains>>
<<contains>>
<<contains>>
Index
sqlalchemy.schema
Sequence
Table
ForeignKeyConstraint
SchemaElement
Figure 20.5: Basic sqlalchemy.schema objects
Column
SchemaElement
<<contains>>
sqlalchemy.schema
Table FromClause
ColumnElement
ClauseElement
sqlalchemy.sql
TableClause
ColumnClause
Figure 20.6: The dual lives of Table and Column
20.5 SQL Expressions
During SQLAlchemys creation, the approach to SQL generation wasnt clear. A textual language
might have been a likely candidate; this is a common approach which is at the core of well-known
object-relational tools like Hibernates HQL. For Python, however, a more intriguing choice was
available: using Python objects and expressions to generatively construct expression tree structures,
even re-purposing Python operators so that operators could be given SQL statement behavior.
While it may not have been the rst tool to do so, full credit goes to the SQLBuilder library
included in Ian Bickings SQLObject as the inspiration for the system of Python objects and operators
used by SQLAlchemys expression language. In this approach, Python objects represent lexical
portions of a SQL expression. Methods on those objects, as well as overloaded operators, generate
new lexical constructs derived from them. The most common object is the Column object
298 SQLAlchemy
SQLObject would represent these on an ORM-mapped class using a namespace accessed via the .q
attribute; SQLAlchemy named the attribute .c. The .c attribute remains today on Core selectable
elements, such as those representing tables and select statements.
Expression Trees
A SQLAlchemy SQL expression construct is very much the kind of structure youd create if you were
parsing a SQL statementits a parse tree, except the developer creates the parse tree directly, rather
than deriving it from a string. The core type of node in this parse tree is called ClauseElement, and
Figure 20.7 illustrates the relationship of ClauseElement to some key classes.
ClauseElement
FromClause
ColumnClause
ColumnElement
Select CompoundSelect
SelectBase
Join
Alias
TableClause
_Label
_BinaryExpression
ClauseList
_BindParam
_UnaryExpression
UpdateBase
ValuesBase
Insert Update
Delete
Figure 20.7: Basic expression hierarchy
Through the use of constructor functions, methods, and overloaded Python operator functions, a
structure for a statement like:
SELECT id FROM user WHERE name = ?
might be constructed in Python like:
from sqlalchemy.sql import table, column, select
user = table(user, column(id), column(name))
stmt = select([user.c.id]).where(user.c.name==ed)
The structure of the above select construct is shown in Figure 20.8. Note the representation of
the literal value ed is contained within the _BindParam construct, thus causing it to be rendered
as a bound parameter marker in the SQL string using a question mark.
Michael Bayer 299
TableClause
Select
name='user'
ColumnClause
name='id'
ColumnClause
name='name'
_BindParam
value='ed'
_BinaryExpression
left right
operator=eq
_whereclause _raw_columns
columns
_froms
Figure 20.8: Example expression tree
From the tree diagram, one can see that a simple descending traversal through the nodes can
quickly create a rendered SQL statement, as well see in greater detail in the section on statement
compilation.
Python Operator Approach
In SQLAlchemy, an expression like this:
column(a) == 2
produces neither True nor False, but instead a SQL expression construct. The key to this is to
overload operators using the Python special operator functions: e.g., methods like __eq__, __ne__,
__le__, __lt__, __add__, __mul__. Column-oriented expression nodes provide overloaded Python
operator behavior through the usage of a mixin called ColumnOperators. Using operator overloading,
an expression column(a) == 2 is equivalent to:
from sqlalchemy.sql.expression import _BinaryExpression
from sqlalchemy.sql import column, bindparam
from sqlalchemy.operators import eq
_BinaryExpression(
left=column(a),
right=bindparam(a, value=2, unique=True),
operator=eq
)
The eq construct is actually a function originating from the Python operator built-in. Representing
operators as an object (i.e., operator.eq) rather than a string (i.e., =) allows the string representation
to be dened at statement compilation time, when database dialect information is known.
300 SQLAlchemy
Compilation
The central class responsible for rendering SQL expression trees into textual SQL is the Compiled
class. This class has two primary subclasses, SQLCompiler and DDLCompiler. SQLCompiler handles
SQL rendering operations for SELECT, INSERT, UPDATE, and DELETE statements, collectively
classied as DQL (data query language) and DML (data manipulation language), while DDLCompiler
handles various CREATE and DROP statements, classied as DDL (data denition language).
There is an additional class hierarchy focused around string representations of types, starting at
TypeCompiler. Individual dialects then provide their own subclasses of all three compiler types to
dene SQL language aspects specic to the target database. Figure 20.9 provides an overview of this
class hierarchy with respect to the PostgreSQL dialect.
Compiled
DDLCompiler SQLCompiler
PGCompiler
GenericType
Compiler
PGDDLCompiler PGTypeCompiler
TypeCompiler
sqlalchemy.engine
sqlalchemy.sql.compiler
sqlalchemy.dialects.postgresql
Figure 20.9: Compiler hierarchy, including PostgreSQL-specic implementation
The Compiled subclasses dene a series of visit methods, each one referred to by a particular
subclass of ClauseElement. A hierarchy of ClauseElement nodes is walked and a statement is
constructed by recursively concatenating the string output of each visit function. As this proceeds,
the Compiled object maintains state regarding anonymous identier names, bound parameter names,
and nesting of subqueries, among other things, all of which aim for the production of a string SQL
statement as well as a nal collection of bound parameters with default values. Figure 20.10 illustrates
the process of visit methods resulting in textual units.
label_select_columns()
SELECT id FROM user WHERE name = ?
visit_table()
visit_select()
visit_binary()
visit_column() visit_bind_param()
Figure 20.10: Call hierarchy of a statement compilation
Michael Bayer 301
A completed Compiled structure contains the full SQL string and collection of bound values.
These are coerced by an ExecutionContext into the format expected by the DBAPIs execute
method, which includes such considerations as the treatment of a unicode statement object, the type
of collection used to store bound values, as well as specics on how the bound values themselves
should be coerced into representations appropriate to the DBAPI and target database.
20.6 Class Mapping with the ORM
We now shift our attention to the ORM. The rst goal is to use the system of table metadata weve
dened to allow mapping of a user-dened class to a collection of columns in a database table.
The second goal is to allow the denition of relationships between user-dened classes, based on
relationships between tables in a database.
SQLAlchemy refers to this as mapping, following the well known Data Mapper pattern de-
scribed in Fowlers Patterns of Enterprise Architecture. Overall, the SQLAlchemy ORM draws
heavily fromthe practices detailed by Fowler. Its also heavily inuenced by the famous Java relational
mapper Hibernate and Ian Bickings SQLObject product for Python.
Classical vs. Declarative
We use the term classical mapping to refer to SQLAlchemys system of applying an object-relational
data mapping to an existing user class. This formconsiders the Table object and the user-dened class
to be two individually dened entities which are joined together via a function called mapper. Once
mapper has been applied to a user-dened class, the class takes on new attributes that correspond to
columns in the table:
class User(object):
pass
mapper(User, user_table)
# now User has an ".id" attribute
User.id
mapper can also ax other kinds of attributes to the class, including attributes which correspond to
references to other kinds of objects, as well as arbitrary SQL expressions. The process of axing
arbitrary attributes to a class is known in the Python world as monkeypatching; however, since we
are doing it in a data-driven and non-arbitrary way, the spirit of the operation is better expressed
with the term class instrumentation.
Modern usage of SQLAlchemy centers around the Declarative extension, which is a congura-
tional system that resembles the common active-record-like class declaration system used by many
other object-relational tools. In this system, the end user explicitly denes attributes inline with the
class denition, each representing an attribute on the class that is to be mapped. The Table object,
in most cases, is not mentioned explicitly, nor is the mapper function; only the class, the Column
objects, and other ORM-related attributes are named:
class User(Base):
__tablename__ = user
id = Column(Integer, primary_key=True)
302 SQLAlchemy
It may appear, above, that the class instrumentation is being achieved directly by our placement of
id = Column(), but this is not the case. The Declarative extension uses a Python metaclass, which
is a handy way to run a series of operations each time a new class is rst declared, to generate a new
Table object from whats been declared, and to pass it to the mapper function along with the class.
The mapper function then does its job in exactly the same way, patching its own attributes onto the
class, in this case towards the id attribute, and replacing what was there previously. By the time the
metaclass initialization is complete (that is, when the ow of execution leaves the block delineated
by User), the Column object marked by id has been moved into a new Table, and User.id has been
replaced by a new attribute specic to the mapping.
It was always intended that SQLAlchemy would have a shorthand, declarative form of congura-
tion. However, the creation of Declarative was delayed in favor of continued work solidifying the
mechanics of classical mapping. An interim extension called ActiveMapper, which later became the
Elixir project, existed early on. It redenes mapping constructs in a higher-level declaration system.
Declaratives goal was to reverse the direction of Elixirs heavily abstracted approach by establishing
a system that preserved SQLAlchemy classical mapping concepts almost exactly, only reorganizing
how they are used to be less verbose and more amenable to class-level extensions than a classical
mapping would be.
Whether classical or declarative mapping is used, a mapped class takes on new behaviors that
allow it to express SQL constructs in terms of its attributes. SQLAlchemy originally followed
SQLObjects behavior of using a special attribute as the source of SQL column expressions, referred
to by SQLAlchemy as .c, as in this example:
result = session.query(User).filter(User.c.username == ed).all()
In version 0.4, however, SQLAlchemy moved the functionality into the mapped attributes them-
selves:
result = session.query(User).filter(User.username == ed).all()
This change in attribute access proved to be a great improvement, as it allowed the column-
like objects present on the class to gain additional class-specic capabilities not present on those
originating directly from the underlying Table object. It also allowed usage integration between
dierent kinds of class attributes, such as attributes which refer to table columns directly, attributes
that refer to SQL expressions derived from those columns, and attributes that refer to a related class.
Finally, it provided a symmetry between a mapped class, and an instance of that mapped class, in
that the same attribute could take on dierent behavior depending on the type of parent. Class-bound
attributes return SQL expressions while instance-bound attributes return actual data.
Anatomy of a Mapping
The id attribute thats been attached to our User class is a type of object known in Python as a
descriptor, an object that has __get__, __set__, and __del__ methods, which the Python runtime
defers to for all class and instance operations involving this attribute. SQLAlchemys implementation
is known as an InstrumentedAttribute, and well illustrate the world behind this facade with
another example. Starting with a Table and a user dened class, we set up a mapping that has just
one mapped column, as well as a relationship, which denes a reference to a related class:
user_table = Table("user", metadata,
Column(id, Integer, primary_key=True),
Michael Bayer 303
)
class User(object):
pass
mapper(User, user_table, properties={
related:relationship(Address)
})
When the mapping is complete, the structure of objects related to the class is detailed in Fig-
ure 20.11.
Mapper
Instrumented
Attribute
Scalar
AttributeImpl
ColumnProperty
ColumnLoader
Table
Column
__get__()
__set__()
__del__()
Relationship
Property
LazyLoader
OneToManyDP
Instrumented
Attribute
Collection
AttributeImpl
_sa_class_state/
class_
__get__()
__set__()
__del__()
manager/
mapper
mapped_table
_props
columns
property
property
columns
target
related mapper
id
related
(dict)
sqlalchemy.orm.instrumentation
sqlalchemy.orm.attributes
sqlalchemy.orm.mapper
sqlalchemy.orm.properties
sqlalchemy.schema
User
(user-dened
class)
ClassManager
sqlalchemy.orm.strategies
sqlalchemy.orm.dependency
Figure 20.11: Anatomy of a mapping
The gure illustrates a SQLAlchemy mapping dened as two separate layers of interaction
between the user-dened class and the table metadata to which it is mapped. Class instrumentation
is pictured towards the left, while SQL and database functionality is pictured towards the right.
The general pattern at play is that object composition is used to isolate behavioral roles, and object
inheritance is used to distinguish amongst behavioral variances within a particular role.
Within the realm of class instrumentation, the ClassManager is linked to the mapped class,
while its collection of InstrumentedAttribute objects are linked to each attribute mapped on the
class. InstrumentedAttribute is also the public-facing Python descriptor mentioned previously,
and produces SQL expressions when used in a class-based expression (e.g., User.id==5). When
dealing with an instance of User, InstrumentedAttribute delegates the behavior of the attribute
to an AttributeImpl object, which is one of several varieties tailored towards the type of data being
represented.
304 SQLAlchemy
Towards the mapping side, the Mapper represents the linkage of a user-dened class and a
selectable unit, most typically Table. Mapper maintains a collection of per-attribute objects known
as MapperProperty, which deals with the SQL representation of a particular attribute. The most
common variants of MapperProperty are ColumnProperty, representing a mapped column or SQL
expression, and RelationshipProperty, representing a linkage to another mapper.
MapperProperty delegates attribute loading behaviorincluding how the attribute renders in a
SQL statement and how it is populated from a result rowto a LoaderStrategy object, of which
there are several varieties. Dierent LoaderStrategies determine if the loading behavior of an
attribute is deferred, eager, or immediate. A default version is chosen at mapper conguration
time, with the option to use an alternate strategy at query time. RelationshipProperty also
references a DependencyProcessor, which handles how inter-mapper dependencies and attribute
synchronization should proceed at ush time. The choice of DependencyProcessor is based on the
relational geometry of the parent and target selectables linked to the relationship.
The Mapper/RelationshipProperty structure forms a graph, where Mapper objects are nodes
and RelationshipProperty objects are directed edges. Once the full set of mappers have been
declared by an application, a deferred initialization step known as the conguration proceeds. It is
used mainly by each RelationshipProperty to solidify the details between its parent and target
mappers, including choice of AttributeImpl as well as DependencyProcessor. This graph is a
key data structure used throughout the operation of the ORM. It participates in operations such as
the so-called cascade behavior that denes how operations should propagate along object paths, in
query operations where related objects and collections are eagerly loaded at once, as well as on
the object ushing side where a dependency graph of all objects is established before ring o a
series of persistence steps.
20.7 Query and Loading Behavior
SQLAlchemy initiates all object loading behavior via an object called Query. The basic state Query
starts with includes the entities, which is the list of mapped classes and/or individual SQL expressions
to be queried. It also has a reference to the Session, which represents connectivity to one or more
databases, as well as a cache of data thats been accumulated with respect to transactions on those
connections. Below is a rudimentary usage example:
from sqlalchemy.orm import Session
session = Session(engine)
query = session.query(User)
We create a Query that will yield instances of User, relative to a new Session weve created.
Query provides a generative builder pattern in the same way as the select construct discussed
previously, where additional criteria and modiers are associated with a statement construct one
method call at a time. When an iterative operation is called on the Query, it constructs a SQL
expression construct representing a SELECT, emits it to the database, and then interprets the result
set rows as ORM-oriented results corresponding to the initial set of entities being requested.
Query makes a hard distinction between the SQL rendering and the data loading portions of the
operation. The former refers to the construction of a SELECT statement, the latter to the interpretation
of SQL result rows into ORM-mapped constructs. Data loading can, in fact, proceed without a SQL
rendering step, as the Query may be asked to interpret results from a textual query hand-composed
by the user.
Michael Bayer 305
Both SQL rendering and data loading utilize a recursive descent through the graph formed by the
series of lead Mapper objects, considering each column- or SQL-expression-holding ColumnProperty
as a leaf node and each RelationshipProperty which is to be included in the query via a so-called
eager-load as an edge leading to another Mapper node. The traversal and action to take at each
node is ultimately the job of each LoaderStrategy associated with every MapperProperty, adding
columns and joins to the SELECT statement being built in the SQL rendering phase, and producing
Python functions that process result rows in the data loading phase.
The Python functions produced in the data loading phase each receive a database row as they
are fetched, and produce a possible change in the state of a mapped attribute in memory as a result.
They are produced for a particular attribute conditionally, based on examination of the rst incoming
row in the result set, as well as on loading options. If a load of the attribute is not to proceed, no
callable function is produced.
Figure 20.12 illustrates the traversal of several LoaderStrategy objects in a joined eager load-
ing scenario, illustrating their connection to a rendered SQL statement which occurs during the
_compile_context method of Query. It also shows generation of row population functions which
receive result rows and populate individual object attributes, a process which occurs within the
instances method of Query.
LOAD TIME QUERY TIME
ColumnLoader
SELECT
user.id
as user_id,
address.id
as address_id,
address.user_id
as address_user_id,
address.street
as address_street
FROM user
LEFT OUTER JOIN address
ON user.id=
address.user_id
WHERE user.id =5
name="id"
JoinedLoader
name="related"
ColumnLoader
name="id"
ColumnLoader
name="user_id"
ColumnLoader
name="street"
Query._compile_context()
Query.instances()
Mapper._instance_processor()(row, context)
User
Address
id
street
related
user_id
id
renders column
renders column
renders column
renders column
fn(row, context)
fn(row, context)
fn(row, context)
fn(row, context)
fn(row, context)
collection
RESULTS
MapperProperty.setup(context)
Mapper._instance_processor()(row, context)
MapperProperty.setup(context)
Figure 20.12: Traversal of loader strategies including a joined eager load
306 SQLAlchemy
SQLAlchemys early approach to populating results used a traditional traversal of xed object
methods associated with each strategy to receive each row and act accordingly. The loader callable
system, rst introduced in version 0.5, represented a dramatic leap in performance, as many decisions
regarding row handling could be made just once up front instead of for each row, and a signicant
number of function calls with no net eect could be eliminated.
20.8 Session/Identity Map
In SQLAlchemy, the Session object presents the public interface for the actual usage of the ORM
that is, loading and persisting data. It provides the starting point for queries and persistence operations
for a given database connection.
The Session, in addition to serving as the gateway for database connectivity, maintains an active
reference to the set of all mapped entities which are present in memory relative to that Session. Its
in this way that the Session implements a facade for the identity map and unit of work patterns,
both identied by Fowler. The identity map maintains a database-identity-unique mapping of all
objects for a particular Session, eliminating the problems introduced by duplicate identities. The
unit of work builds on the identity map to provide a system of automating the process of persisting
all changes in state to the database in the most eective manner possible. The actual persistence step
is known as a ush, and in modern SQLAlchemy this step is usually automatic.
Development History
The Session started out as a mostly concealed system responsible for the single task of emitting a
ush. The ush process involves emitting SQL statements to the database, corresponding to changes
in the state of objects tracked by the unit of work system and thereby synchronizing the current
state of the database with whats in memory. The ush has always been one of the most complex
operations performed by SQLAlchemy.
The invocation of ush started out in very early versions behind a method called commit, and
it was a method present on an implicit, thread-local object called objectstore. When one used
SQLAlchemy 0.1, there was no need to call Session.add, nor was there any concept of an explicit
Session at all. The only user-facing steps were to create mappers, create new objects, modify
existing objects loaded through queries (where the queries themselves were invoked directly from
each Mapper object), and then persist all changes via the objectstore.commit command. The pool
of objects for a set of operations was unconditionally module-global and unconditionally thread-local.
The objectstore.commit model was an immediate hit with the rst group of users, but the
rigidity of this model quickly ran into a wall. Users new to modern SQLAlchemy sometimes lament
the need to dene a factory, and possibly a registry, for Session objects, as well as the need to keep
their objects organized into just one Session at a time, but this is far preferable to the early days
when the entire system was completely implicit. The convenience of the 0.1 usage pattern is still
largely present in modern SQLAlchemy, which features a session registry normally congured to
use thread local scoping.
The Session itself was only introduced in version 0.2 of SQLAlchemy, modeled loosely after
the Session object present in Hibernate. This version featured integrated transactional control,
where the Session could be placed into a transaction via the begin method, and completed via
the commit method. The objectstore.commit method was renamed to objectstore.flush, and
new Session objects could be created at any time. The Session itself was broken o from another
Michael Bayer 307
object called UnitOfWork, which remains as a private object responsible for executing the actual
ush operation.
While the ush process started as a method explicitly invoked by the user, the 0.4 series of
SQLAlchemy introduced the concept of autoush, which meant that a ush was emitted immediately
before each query. The advantage of autoush is that the SQL statement emitted by a query always
has access on the relational side to the exact state that is present in memory, as all changes have been
sent over. Early versions of SQLAlchemy couldnt include this feature, because the most common
pattern of usage was that the ush statement would also commit the changes permanently. But when
autoush was introduced, it was accompanied by another feature called the transactional Session,
which provided a Session that would start out automatically in a transaction that remained until
the user called commit explicitly. With the introduction of this feature, the flush method no longer
committed the data that it ushed, and could safely be called on an automated basis. The Session
could now provide a step-by-step synchronization between in-memory state and SQL query state by
ushing as needed, with nothing permanently persisted until the explicit commit step. This behavior
is, in fact, exactly the same in Hibernate for Java. However, SQLAlchemy embraced this style of
usage based on the same behavior in the Storm ORM for Python, introduced when SQLAlchemy
was in version 0.3.
Version 0.5 brought more transaction integration when post-transaction expiration was intro-
duced; after each commit or rollback, by default all states within the Session are expired (erased),
to be populated again when subsequent SQL statements re-select the data, or when the attributes on
the remaining set of expired objects are accessed in the context of the new transaction. Originally,
SQLAlchemy was constructed around the assumption that SELECT statements should be emitted as
little as possible, unconditionally. The expire-on-commit behavior was slow in coming for this reason;
however, it entirely solved the issue of the Session which contained stale data post-transaction with
no simple way to load newer data without rebuilding the full set of objects already loaded. Early on,
it seemed that this problem couldnt be reasonably solved, as it wasnt apparent when the Session
should consider the current state to be stale, and thus produce an expensive new set of SELECT
statements on the next access. However, once the Session moved to an always-in-a-transaction
model, the point of transaction end became apparent as the natural point of data expiration, as
the nature of a transaction with a high degree of isolation is that it cannot see new data until its
committed or rolled back anyway. Dierent databases and congurations, of course, have varied
degrees of transaction isolation, including no transactions at all. These modes of usage are entirely
acceptable with SQLAlchemys expiration model; the developer only needs to be aware that a lower
isolation level may expose un-isolated changes within a Session if multiple Sessions share the same
rows. This is not at all dierent from what can occur when using two database connections directly.
Session Overview
Figure 20.13 illustrates a Session and the primary structures it deals with.
The public-facing portions above are the Session itself and the collection of user objects, each
of which is an instance of a mapped class. Here we see that mapped objects keep a reference to a
SQLAlchemy construct called InstanceState, which tracks ORM state for an individual instance
including pending attribute changes and attribute expiration status. InstanceState is the instance-
level side of the attribute instrumentation discussed in the preceding section, Anatomy of a Mapping,
corresponding to the ClassManager at the class level, and maintaining the state of the mapped
objects dictionary (i.e., the Python __dict__ attribute) on behalf of the AttributeImpl objects
associated with the class.
308 SQLAlchemy
Session
IdentityMap
(dict of key->state)
SessionTransaction
Connection
database
via
DBAPI
InstanceState
(persistent)
user object
InstanceState
(persistent+
deleted)
InstanceState
(pending)
InstanceState
(transient)
user object
user object
user object
new
deleted
identity_map
transaction
1..n
weakref
weakref
weakref
weakref
modied
Figure 20.13: Session overview
State Tracking
The IdentityMap is a mapping of database identities to InstanceState objects, for those objects
which have a database identity, which are referred to as persistent. The default implementation
of IdentityMap works with InstanceState to self-manage its size by removing user-mapped
instances once all strong references to them have been removedin this way it works in the same
way as Pythons WeakValueDictionary. The Session protects the set of all objects marked as
dirty or deleted, as well as pending objects marked new, from garbage collection, by creating strong
references to those objects with pending changes. All strong references are then discarded after the
ush.
InstanceState also performs the critical task of maintaining whats changed for the attributes
of a particular object, using a move-on-change system that stores the previous value of a particular
attribute in a dictionary called committed_state before assigning the incoming value to the objects
current dictionary. At ush time, the contents of committed_state and the __dict__ associated
with the object are compared to produce the set of net changes on each object.
In the case of collections, a separate collections package coordinates with the
InstrumentedAttribute/InstanceState system to maintain a collection of net changes to a par-
ticular mapped collection of objects. Common Python classes such as set, list and dict are
subclassed before use and augmented with history-tracking mutator methods. The collection system
was reworked in 0.4 to be open ended and usable for any collection-like object.
Transactional Control
Session, in its default state of usage, maintains an open transaction for all operations which is
completed when commit or rollback is called. The SessionTransaction maintains a set of
zero or more Connection objects, each representing an open transaction on a particular database.
SessionTransaction is a lazy-initializing object that begins with no database state present. As a
particular backend is required to participate in a statement execution, a Connection corresponding
to that database is added to SessionTransactions list of connections. While a single connection
at a time is common, the multiple connection scenario is supported where the specic connection
used for a particular operation is determined based on congurations associated with the Table,
Michael Bayer 309
Mapper, or SQL construct itself involved in the operation. Multiple connections can also coordinate
the transaction using two-phase behavior, for those DBAPIs which provide it.
20.9 Unit of Work
The flush method provided by Session turns over its work to a separate module called unitofwork.
As mentioned earlier, the ush process is probably the most complex function of SQLAlchemy.
The job of the unit of work is to move all of the pending state present in a particular Session out
to the database, emptying out the new, dirty, and deleted collections maintained by the Session.
Once completed, the in-memory state of the Session and whats present in the current transaction
match. The primary challenge is to determine the correct series of persistence steps, and then to
perform them in the correct order. This includes determining the list of INSERT, UPDATE, and
DELETE statements, including those resulting from the cascade of a related row being deleted or
otherwise moved; ensuring that UPDATE statements contain only those columns which were actually
modied; establishing synchronization operations that will copy the state of primary key columns
over to referencing foreign key columns, at the point at which newly generated primary key identiers
are available; ensuring that INSERTs occur in the order in which objects were added to the Session
and as eciently as possible; and ensuring that UPDATE and DELETE statements occur within a
deterministic ordering so as to reduce the chance of deadlocks.
History
The unit of work implementation began as a tangled system of structures that was written in an ad hoc
way; its development can be compared to nding the way out of a forest without a map. Early bugs
and missing behaviors were solved with bolted-on xes, and while several refactorings improved
matters through version 0.5, it was not until version 0.6 that the unit of workby that time stable,
well-understood, and covered by hundreds of testscould be rewritten entirely from scratch. After
many weeks of considering a new approach that would be driven by consistent data structures, the
process of rewriting it to use this new model took only a few days, as the idea was by this time well
understood. It was also greatly helped by the fact that the new implementations behavior could be
carefully cross-checked against the existing version. This process shows how the rst iteration of
something, however awful, is still valuable as long as it provides a working model. It further shows
how total rewrites of a subsystem is often not only appropriate, but an integral part of development
for hard-to-develop systems.
Topological Sort
The key paradigm behind the unit of work is that of assembling the full list of actions to be taken into
a data structure, with each node representing a single step; this is known in design patterns parlance
as the command pattern. The series of commands within this structure is then organized into a
specic ordering using a topological sort. A topological sort is a process that sorts items based on
a partial ordering, that is, only certain elements must precede others. Figure 20.14 illustrates the
behavior of the topological sort.
The unit of work constructs a partial ordering based on those persistence commands which must
precede others. The commands are then topologically sorted and invoked in order. The determination
of which commands precede which is derived primarily from the presence of a relationship that
310 SQLAlchemy
, , ,
A
D
C
( , )
( , )
B C
( , )
A
D B C A
D A C B
Partial Ordering
Topologically Sorted Sets
"A" comes
before "C"
"B" comes
before "C"
"A" comes
before "D"
C A D B
C B D A
, , ,
, , ,
, , ,
, , ,
C D B A
. . . etc
Figure 20.14: Topological sort
bridges two Mapper objectsgenerally, one Mapper is considered to be dependent on the other, as
the relationship implies that one Mapper has a foreign key dependency on the other. Similar rules
exist for many-to-many association tables, but here we focus on the case of one-to-many/many-to-one
relationships. Foreign key dependencies are resolved in order to prevent constraint violations from
occurring, with no reliance on needing to mark constraints as deferred. But just as importantly,
the ordering allows primary key identiers, which on many platforms are only generated when an
INSERT actually occurs, to be populated from a just-executed INSERT statements result into the
parameter list of a dependent row thats about to be inserted. For deletes, the same ordering is used
in reversedependent rows are deleted before those on which they depend, as these rows cannot be
present without the referent of their foreign key being present.
The unit of work features a system where the topological sort is performed at two dierent levels,
based on the structure of dependencies present. The rst level organizes persistence steps into buckets
based on the dependencies between mappers, that is, full buckets of objects corresponding to a
particular class. The second level breaks up zero or more of these buckets into smaller batches,
to handle the case of reference cycles or self-referring tables. Figure 20.15 illustrates the buckets
generated to insert a set of User objects, then a set of Address objects, where an intermediary
step copies newly generated User primary key values into the user_id foreign key column of each
Address object.
In the per-mapper sorting situation, any number of User and Address objects can be ushed
with no impact on the complexity of steps or how many dependencies must be considered.
The second level of sorting organizes persistence steps based on direct dependencies between
individual objects within the scope of a single mapper. The simplest example of when this occurs
is a table which contains a foreign key constraint to itself; a particular row in the table needs to be
inserted before another row in the same table which refers to it. Another is when a series of tables
have a reference cycle: table A references table B, which references table C, that then references
table A. Some A objects must be inserted before others so as to allow the B and C objects to also be
Michael Bayer 311
SaveUpdateAll
(User)
ProcessAll
(User->Address)
SaveUpdateAll
(Address)
INSERT INTO user
INSERT INTO user
INSERT INTO address
INSERT INTO address
copy user.id to
address.user_id
copy user.id to
address.user_id
Dependency:
(user, address)
Topological Sort
DONE
Figure 20.15: Organizing objects by mapper
inserted. The table that refers to itself is a special case of reference cycle.
To determine which operations can remain in their aggregated, per-Mapper buckets, and which
will be broken into a larger set of per-object commands, a cycle detection algorithm is applied to
the set of dependencies that exist between mappers, using a modied version of a cycle detection
algorithm found on Guido Van Rossums blog
1
. Those buckets involved in cycles are are then broken
up into per-object operations and mixed into the collection of per-mapper buckets through the addition
of new dependency rules from the per-object buckets back to the per-mapper buckets. Figure 20.16
illustrates the bucket of User objects being broken up into individual per-object commands, resulting
from the addition of a new relationship from User to itself called contact.
The rationale behind the bucket structure is that it allows batching of common statements as
much as possible, both reducing the number of steps required in Python and making possible more
ecient interactions with the DBAPI, which can sometimes execute thousands of statements within
a single Python method call. Only when a reference cycle exists between mappers does the more
expensive per-object-dependency pattern kick in, and even then it only occurs for those portions of
the object graph which require it.
1
http://neopythonic.blogspot.com/29/1/detecting-cycles-in-directed-graph.html
312 SQLAlchemy
Dependency:
(user, address)
Topological Sort
Dependency:
(user, user)
DONE
SaveUpdateState
INSERT INTO user
SaveUpdateState
INSERT INTO user
ProcessState
(User->User)
copy user.id to
user.contact_id
ProcessAll
(User->Address)
copy user.id to
address.user_id
copy user.id to
address.user_id
SaveUpdateAll
(Address)
INSERT INTO address
INSERT INTO address
Figure 20.16: Organizing reference cycles into individual steps
20.10 Conclusion
SQLAlchemy has aimed very high since its inception, with the goal of being the most feature-rich and
versatile database product possible. It has done so while maintaining its focus on relational databases,
recognizing that supporting the usefulness of relational databases in a deep and comprehensive way
is a major undertaking; and even now, the scope of the undertaking continues to reveal itself as larger
than previously perceived.
The component-based approach is intended to extract the most value possible from each area of
functionality, providing many dierent units that applications can use alone or in combination. This
system has been challenging to create, maintain, and deliver.
The development course was intended to be slow, based on the theory that a methodical, broad-
based construction of solid functionality is ultimately more valuable than fast delivery of features
without foundation. It has taken a long time for SQLAlchemy to construct a consistent and well-
documented user story, but throughout the process, the underlying architecture was always a step
ahead, leading in some cases to the time machine eect where features can be added almost before
users request them.
Michael Bayer 313
The Python language has been a reliable host (if a little nicky, particularly in the area of
performance). The languages consistency and tremendously open run-time model has allowed
SQLAlchemy to provide a nicer experience than that oered by similar products written in other
languages.
It is the hope of the SQLAlchemy project that Python gain ever-deeper acceptance into as wide a
variety of elds and industries as possible, and that the use of relational databases remains vibrant
and progressive. The goal of SQLAlchemy is to demonstrate that relational databases, Python, and
well-considered object models are all very much worthwhile development tools.
314 SQLAlchemy
[chapter21]
Twisted
Jessica McKellar
Twisted is an event-driven networking engine in Python. It was born in the early 2000s, when the
writers of networked games had few scalable and no cross-platform libraries, in any language, at
their disposal. The authors of Twisted tried to develop games in the existing networking landscape,
struggled, saw a clear need for a scalable, event-driven, cross-platform networking framework and
decided to make one happen, learning from the mistakes and hardships of past game and networked
application writers.
Twisted supports many common transport and application layer protocols, including TCP, UDP,
SSL/TLS, HTTP, IMAP, SSH, IRC, and FTP. Like the language in which it is written, it is batteries-
included; Twisted comes with client and server implementations for all of its protocols, as well as
utilities that make it easy to congure and deploy production-grade Twisted applications from the
command line.
21.1 Why Twisted?
In 2000, glyph, the creator of Twisted, was working on a text-based multiplayer game called Twisted
Reality. It was a big mess of threads, 3 per connection, in Java. There was a thread for input that
would block on reads, a thread for output that would block on some kind of write, and a logic thread
that would sleep while waiting for timers to expire or events to queue. As players moved through the
virtual landscape and interacted, threads were deadlocking, caches were getting corrupted, and the
locking logic was never quite rightthe use of threads had made the software complicated, buggy,
and hard to scale.
Seeking alternatives, he discovered Python, and in particular Pythons select module for multi-
plexing I/O from stream objects like sockets and pipes
1
; at the time, Java didnt expose the operating
systems select interface or any other asynchronous I/O API
2
. A quick prototype of the game in
Python using select immediately proved less complex and more reliable than the threaded version.
An instant convert to Python, select, and event-driven programming, glyph wrote a client and
server for the game in Python using the select API. But then he wanted to do more. Fundamentally,
he wanted to be able to turn network activity into method calls on objects in the game. What if you
could receive email in the game, like the Nethack mailer daemon? What if every player in the game
1
The Single UNIX Specication, Version 3 (SUSv3) describes the select API.
2
The java.nio package for non-blocking I/O was added in J2SE 1.4, released in 2002.
had a home page? Glyph found himself needing good Python IMAP and HTTP clients and servers
that used select.
He rst turned to Medusa, a platform developed in the mid-90s for writing networking servers
in Python based on the asyncore module
3
. asyncore is an asynchronous socket handler that builds
a dispatcher and callback interface on top of the operating systems select API.
This was an inspiring nd for glyph, but Medusa had two drawbacks:
1. It was on its way towards being unmaintained by 2001 when glyph started working on Twisted
Reality.
2. asyncore is such a thin wrapper around sockets that application programmers are still re-
quired to manipulate sockets directly. This means portability is still the responsibility of the
programmer. Additionally, at the time, asyncores Windows support was buggy, and glyph
knew that he wanted to run a GUI client on Windows.
Glyph was facing the prospect of implementing a networking platform himself and realized that
Twisted Reality had opened the door to a problem that was just as interesting as his game.
Over time, Twisted Reality the game became Twisted the networking platform, which would do
what existing networking platforms in Python didnt:
Use event-driven programming instead of multi-threaded programming.
Be cross-platform: provide a uniform interface to the event notication systems exposed by
major operating systems.
Be batteries-included: provide implementations of popular application-layer protocols out
of the box, so that Twisted is immediately useful to developers.
Conform to RFCs, and prove conformance with a robust test suite.
Make it easy to use multiple networking protocols together.
Be extensible.
21.2 The Architecture of Twisted
Twisted is an event-driven networking engine. Event-driven programming is so integral to Twisteds
design philosophy that it is worth taking a moment to review what exactly event-driven programming
means.
Event-driven programming is a programming paradigm in which program ow is determined by
external events. It is characterized by an event loop and the use of callbacks to trigger actions when
events happen. Two other common programming paradigms are (single-threaded) synchronous and
multi-threaded programming.
Lets compare and contrast single-threaded, multi-threaded, and event-driven programming
models with an example. Figure 21.1 shows the work done by a program over time under these three
models. The program has three tasks to complete, each of which blocks while waiting for I/O to
nish. Time spent blocking on I/O is greyed out.
In the single-threaded synchronous version of the program, tasks are performed serially. If
one task blocks for a while on I/O, all of the other tasks have to wait until it nishes and they are
executed in turn. This denite order and serial processing are easy to reason about, but the program
is unnecessarily slow if the tasks dont depend on each other, yet still have to wait for each other.
In the threaded version of the program, the three tasks that block while doing work are performed
in separate threads of control. These threads are managed by the operating system and may run
3
http://www.nightmare.com/medusa/
316 Twisted
Figure 21.1: Threading models
concurrently on multiple processors or interleaved on a single processor. This allows progress to
be made by some threads while others are blocking on resources. This is often more time-ecient
than the analogous synchronous program, but one has to write code to protect shared resources that
could be accessed concurrently from multiple threads. Multi-threaded programs can be harder to
reason about because one now has to worry about thread safety via process serialization (locking),
reentrancy, thread-local storage, or other mechanisms, which when implemented improperly can
lead to subtle and painful bugs.
The event-driven version of the program interleaves the execution of the three tasks, but in a
single thread of control. When performing I/O or other expensive operations, a callback is registered
with an event loop, and then execution continues while the I/O completes. The callback describes
how to handle an event once it has completed. The event loop polls for events and dispatches them
as they arrive, to the callbacks that are waiting for them. This allows the program to make progress
when it can without the use of additional threads. Event-driven programs can be easier to reason
about than multi-threaded programs because the programmer doesnt have to worry about thread
safety.
Jessica McKellar 317
The event-driven model is often a good choice when there are:
1. many tasks, that are. . .
2. largely independent (so they dont have to communicate with or wait on each other), and. . .
3. some of these tasks block while waiting on events.
It is also a good choice when an application has to share mutable data between tasks, because no
synchronization has to be performed.
Networking applications often have exactly these properties, which is what makes them such a
good t for the event-driven programming model.
Reusing Existing Applications
Many popular clients and servers for various networking protocols already existed when Twisted was
created. Why did glyph not just use Apache, IRCd, BIND, OpenSSH, or any of the other pre-existing
applications whose clients and servers would have to get re-implemented from scratch for Twisted?
The problemis that all of these server implementations have networking code written fromscratch,
typically in C, with application code coupled directly to the networking layer. This makes them very
dicult to use as libraries. They have to be treated as black boxes when used together, giving a
developer no chance to reuse code if he or she wanted to expose the same data over multiple protocols.
Additionally, the server and client implementations are often separate applications that dont share
code. Extending these applications and maintaining cross-platform client-server compatibility is
harder than it needs to be.
With Twisted, the clients and servers are written in Python using a consistent interface. This
makes it is easy to write new clients and servers, to share code between clients and servers, to share
application logic between protocols, and to test ones code.
The Reactor
Twisted implements the reactor design pattern, which describes demultiplexing and dispatching
events from multiple sources to their handlers in a single-threaded environment.
The core of Twisted is the reactor event loop. The reactor knows about network, le system, and
timer events. It waits on and then handles these events, abstracting away platform-specic behavior
and presenting interfaces to make responding to events anywhere in the network stack easy.
The reactor essentially accomplishes:
while True:
timeout = time_until_next_timed_event()
events = wait_for_events(timeout)
events += timed_events_until(now())
for event in events:
event.process()
A reactor based on the poll API
4
is the current default on all platforms. Twisted additionally
supports a number of platform-specic high-volume multiplexing APIs. Platform-specic reactors
include the KQueue reactor based on FreeBSDs kqueue mechanism, an epoll-based reactor for
systems supporting the epoll interface (currently Linux 2.6), and an IOCP reactor based on Windows
Input/Output Completion Ports.
Examples of polling implementation-dependent details that Twisted takes care of include:
4
The Single UNIX Specication, Version 3 (SUSv3) describes the poll API.
318 Twisted
Network and lesystem limits.
Buering behavior.
How to detect a dropped connection.
The values returned in error cases.
Twisteds reactor implementation also takes care of using the underlying non-blocking APIs
correctly and handling obscure edge cases correctly. Python doesnt expose the IOCP API at all, so
Twisted maintains its own implementation.
Managing Callback Chains
Callbacks are a fundamental part of event-driven programming and are the way that the reactor
indicates to an application that events have completed. As event-driven programs grow, handling
both the success and error cases for the events in ones application becomes increasingly complex.
Failing to register an appropriate callback can leave a program blocking on event processing that
will never happen, and errors might have to propagate up a chain of callbacks from the networking
stack through the layers of an application.
Lets examine some of the pitfalls of event-driven programs by comparing synchronous and
asynchronous versions of a toy URL fetching utility in Python-like pseudo-code:
Synchronous URL fetcher:
import getPage
def processPage(page):
print page
finishProcessing()
def logError(error):
print error
finishProcessing()
def finishProcessing(value):
print "Shutting down..."
exit()
url = "http://google.com"
try:
page = getPage(url)
processPage(page)
except Error, e:
logError(error)
finally:
finishProcessing()
Asynchronous URL fetcher:
from twisted.internet import reactor
import getPage
def processPage(page):
print page
finishProcessing()
def logError(error):
print error
finishProcessing()
def finishProcessing(value):
print "Shutting down..."
reactor.stop()
url = "http://google.com"
# getPage takes: url,
# success callback, error callback
getPage(url, processPage, logError)
reactor.run()
In the asynchronous URL fetcher, reactor.run() starts the reactor event loop. In both the syn-
chronous and asynchronous versions, a hypothetical getPage function does the work of page retrieval.
processPage is invoked if the retrieval is successful, and logError is invoked if an Exception is
raised while attempting to retrieve the page. In either case, finishProcessing is called afterwards.
Jessica McKellar 319
The callback to logError in the asynchronous version mirrors the except part of the try/except
block in the synchronous version. The callback to processPage mirrors else, and the unconditional
callback to finishProcessing mirrors finally.
In the synchronous version, by virtue of the structure of a try/except block exactly one of
logError and processPage is called, and finishProcessing is always called once; in the asyn-
chronous version it is the programmers responsibility to invoke the correct chain of success and
error callbacks. If, through programming error, the call to finishProcessing were left out of
processPage or logError along their respective callback chains, the reactor would never get
stopped and the program would run forever.
This toy example hints at the complexity frustrating programmers during the rst few years
of Twisteds development. Twisted responded to this complexity by growing an object called a
Deferred.
Deferreds
The Deferred object is an abstraction of the idea of a result that doesnt exist yet. It also helps
manage the callback chains for this result. When returned by a function, a Deferred is a promise
that the function will have a result at some point. That single returned Deferred contains references
to all of the callbacks registered for an event, so only this one object needs to be passed between
functions, which is much simpler to keep track of than managing callbacks individually.
Deferreds have a pair of callback chains, one for success (callbacks) and one for errors (errbacks).
Deferreds start out with two empty chains. One adds pairs of callbacks and errbacks to handle
successes and failures at each point in the event processing. When an asynchronous result arrives,
the Deferred is red and the appropriate callbacks or errbacks are invoked in the order in which
they were added.
Here is a version of the asynchronous URL fetcher pseudo-code which uses Deferreds:
from twisted.internet import reactor
import getPage
def processPage(page):
print page
def logError(error):
print error
def finishProcessing(value):
print "Shutting down..."
reactor.stop()
url = "http://google.com"
deferred = getPage(url) # getPage returns a Deferred
deferred.addCallbacks(success, failure)
deferred.addBoth(stop)
reactor.run()
In this version, the same event handlers are invoked, but they are all registered with a single
Deferred object instead of spread out in the code and passed as arguments to getPage.
320 Twisted
The Deferred is created with two stages of callbacks. First, addCallbacks adds the processPage
callback and logError errback to the rst stage of their respective chains. Then addBoth adds
finishProcessing to the second stage of both chains. Diagrammatically, the callback chains look
like Figure 21.2.
Figure 21.2: Callback chains
Deferreds can only be red once; attempting to re-re them will raise an Exception. This gives
Deferreds semantics closer to those of the try/except blocks of their synchronous cousins, which
makes processing the asynchronous events easier to reason about and avoids subtle bugs caused by
callbacks being invoked more or less than once for a single event.
Understanding Deferreds is an important part of understanding the ow of Twisted programs.
However, when using the high-level abstractions Twisted provides for networking protocols, one
often doesnt have to use Deferreds directly at all.
The Deferred abstraction is powerful and has been borrowed by many other event-driven
platforms, including jQuery, Dojo, and Mochikit.
Transports
Transports represent the connection between two endpoints communicating over a network. Trans-
ports are responsible for describing connection details, like being stream- or datagram-oriented, ow
control, and reliability. TCP, UDP, and Unix sockets are examples of transports. They are designed
to be minimally functional units that are maximally reusable and are decoupled from protocol
Jessica McKellar 321
implementations, allowing for many protocols to utilize the same type of transport. Transports
implement the ITransport interface, which has the following methods:
write Write some data to the physical connection, in sequence, in a
non-blocking fashion.
writeSequence Write a list of strings to the physical connection.
losesConnection Write all pending data and then close the connection.
getPeer Get the remote address of this connection.
getHost Get the address of this side of the connection.
Decoupling transports from procotols also makes testing the two layers easier. A mock transport
can simply write data to a string for inspection.
Protocols
Procotols describe how to process network events asynchronously. HTTP, DNS, and IMAP are
examples of application protocols. Protocols implement the IProtocol interface, which has the
following methods:
makeConnection Make a connection to a transport and a server.
connectionMade Called when a connection is made.
dataReceived Called whenever data is received.
connectionLost Called when the connection is shut down.
The relationship between the reactor, protocols, and transports is best illustrated with an example.
Here are complete implementations of an echo server and client, rst the server:
from twisted.internet import protocol, reactor
class Echo(protocol.Protocol):
def dataReceived(self, data):
# As soon as any data is received, write it back
self.transport.write(data)
class EchoFactory(protocol.Factory):
def buildProtocol(self, addr):
return Echo()
reactor.listenTCP(8, EchoFactory())
reactor.run()
And the client:
from twisted.internet import reactor, protocol
class EchoClient(protocol.Protocol):
def connectionMade(self):
self.transport.write("hello, world!")
def dataReceived(self, data):
print "Server said:", data
322 Twisted
self.transport.loseConnection()
def connectionLost(self, reason):
print "connection lost"
class EchoFactory(protocol.ClientFactory):
def buildProtocol(self, addr):
return EchoClient()
def clientConnectionFailed(self, connector, reason):
print "Connection failed - goodbye!"
reactor.stop()
def clientConnectionLost(self, connector, reason):
print "Connection lost - goodbye!"
reactor.stop()
reactor.connectTCP("localhost", 8, EchoFactory())
reactor.run()
Running the server script starts a TCP server listening for connections on port 8000. The server
uses the Echo protocol, and data is written out over a TCP transport. Running the client makes a TCP
connection to the server, echoes the server response, and then terminates the connection and stops
the reactor. Factories are used to produce instances of protocols for both sides of the connection.
The communication is asynchronous on both sides; connectTCP takes care of registering callbacks
with the reactor to get notied when data is available to read from a socket.
Applications
Twisted is an engine for producing scalable, cross-platform network servers and clients. Making it
easy to deploy these applications in a standardized fashion in production environments is an important
part of a platform like this getting wide-scale adoption.
To that end, Twisted developed the Twisted application infrastructure, a re-usable and congurable
way to deploy a Twisted application. It allows a programmer to avoid boilerplate code by hooking an
application into existing tools for customizing the way it is run, including daemonization, logging,
using a custom reactor, proling code, and more.
The application infrastructure has four main parts: Services, Applications, conguration man-
agement (via TAC les and plugins), and the twistd command-line utility. To illustrate this infras-
tructure, well turn the echo server from the previous section into an Application.
Service
A Service is anything that can be started and stopped and which adheres to the IService interface.
Twisted comes with service implementations for TCP, FTP, HTTP, SSH, DNS, and many other
protocols. Many Services can register with a single application.
The core of the IService interface is:
Jessica McKellar 323
startService Start the service. This might include loading conguration data,
setting up database connections, or listening on a port.
stopService Shut down the service. This might include saving state to disk,
closing database connections, or stopping listening on a port.
Our echo service uses TCP, so we can use Twisteds default TCPServer implementation of this
IService interface.
Application
An Application is the top-level service that represents the entire Twisted application. Services register
themselves with an Application, and the twistd deployment utility described below searches for
and runs Applications.
Well create an echo Application with which the echo Service can register.
TAC Files
When managing Twisted applications in a regular Python le, the developer is responsible for
writing code to start and stop the reactor and to congure the application. Under the Twisted
application infrastructure, protocol implementations live in a module, Services using those protocols
are registered in a Twisted Application Conguration (TAC) le, and the reactor and conguration
are managed by an external utility.
To turn our echo server into an echo application, we can follow a simple algorithm:
1. Move the Protocol parts of the echo server into their own module.
2. Inside a TAC le:
(a) Create an echo Application.
(b) Create an instance of the TCPServer Service which will use our EchoFactory, and
register it with the Application.
The code for managing the reactor will be taken care of by twistd, discussed below. The
application code ends up looking like this:
The echo.py le:
from twisted.internet import protocol, reactor
class Echo(protocol.Protocol):
def dataReceived(self, data):
self.transport.write(data)
class EchoFactory(protocol.Factory):
def buildProtocol(self, addr):
return Echo()
The echo_server.tac le:
from twisted.application import internet, service
from echo import EchoFactory
application = service.Application("echo")
echoService = internet.TCPServer(8, EchoFactory())
echoService.setServiceParent(application)
324 Twisted
twistd
twistd (pronounced twist-dee) is a cross-platform utility for deploying Twisted applications. It
runs TAC les and handles starting and stopping an application. As part of Twisteds batteries-
included approach to network programming, twistd comes with a number of useful conguration
ags, including daemonizing the application, the location of log les, dropping privileges, running
in a chroot, running under a non-default reactor, or even running the application under a proler.
We can run our echo server Application with:
$ twistd -y echo_server.tac
In this simplest case, twistd starts a daemonized instance of the application, logging to
twistd.log. After starting and stopping the application, the log looks like this:
211-11-19 22:23:7-5 [-] Log opened.
211-11-19 22:23:7-5 [-] twistd 11.. (/usr/bin/python 2.7.1) starting up.
211-11-19 22:23:7-5 [-] reactor class: twisted.internet.selectreactor.SelectReactor.
211-11-19 22:23:7-5 [-] echo.EchoFactory starting on 8
211-11-19 22:23:7-5 [-] Starting factory <echo.EchoFactory instance at x12d867>
211-11-19 22:23:2-5 [-] Received SIGTERM, shutting down.
211-11-19 22:23:2-5 [-] (TCP Port 8 Closed)
211-11-19 22:23:2-5 [-] Stopping factory <echo.EchoFactory instance at x12d867>
211-11-19 22:23:2-5 [-] Main loop terminated.
211-11-19 22:23:2-5 [-] Server Shut Down.
Running a service using the Twisted application infrastructure allows developers to skip writ-
ing boilerplate code for common service functionalities like logging and daemonization. It also
establishes a standard command line interface for deploying applications.
Plugins
An alternative to the TAC-based system for running Twisted applications is the plugin system.
While the TAC system makes it easy to register simple hierarchies of pre-dened services within
an application conguration le, the plugin system makes it easy to register custom services as
subcommands of the twistd utility, and to extend the command-line interface to an application.
Using this system:
1. Only the plugin API is required to remain stable, which makes it easy for third-party developers
to extend the software.
2. Plugin discoverability is codied. Plugins can be loaded and saved when a program is rst run,
re-discovered each time the program starts up, or polled for repeatedly at runtime, allowing
the discovery of new plugins installed after the program has started.
To extend a program using the Twisted plugin system, all one has to do is create objects which
implement the IPlugin interface and put them in a particular location where the plugin system
knows to look for them.
Having already converted our echo server to a Twisted application, transformation into a Twisted
plugin is straightforward. Alongside the echo module from before, which contains the Echo protocol
and EchoFactory denitions, we add a directory called twisted, containing a subdirectory called
plugins, containing our echo plugin denition. This plugin will allow us to start an echo server and
specify the port to use as arguments to the twistd utility:
Jessica McKellar 325
from zope.interface import implements
from twisted.python import usage
from twisted.plugin import IPlugin
from twisted.application.service import IServiceMaker
from twisted.application import internet
from echo import EchoFactory
class Options(usage.Options):
optParameters = [["port", "p", 8, "The port number to listen on."]]
class EchoServiceMaker(object):
implements(IServiceMaker, IPlugin)
tapname = "echo"
description = "A TCP-based echo server."
options = Options
def makeService(self, options):
"""
Construct a TCPServer from a factory defined in myproject.
"""
return internet.TCPServer(int(options["port"]), EchoFactory())
serviceMaker = EchoServiceMaker()
Our echo server will now show up as a server option in the output of twistd --help, and running
twistd echo --port=1235 will start an echo server on port 1235.
Twisted comes with a pluggable authentication system for servers called twisted.cred, and a
common use of the plugin system is to add an authentication pattern to an application. One can use
twisted.cred AuthOptionMixin to add command-line support for various kinds of authentication
o the shelf, or to add a new kind. For example, one could add authentication via a local Unix
password database or an LDAP server using the plugin system.
twistd comes with plugins for many of Twisteds supported protocols, which turns the work of
spinning up a server into a single command. Here are some examples of twistd servers that ship
with Twisted:
twistd web --port 88 --path .
Run an HTTP server on port 8080, serving both static and dynamic content out of the current
working directory.
twistd dns -p 5553 --hosts-file=hosts
Run a DNS server on port 5553, resolving domains out of a le called hosts in the format of
/etc/hosts.
sudo twistd conch -p tcp:2222
Run an ssh server on port 2222. ssh keys must be set up independently.
twistd mail -E -H localhost -d localhost=emails
Run an ESMTP POP3 server, accepting email for localhost and saving it to the emails
directory.
twistd makes it easy to spin up a server for testing clients, but it is also pluggable, production-
grade code.
326 Twisted
In that respect, Twisteds application deployment mechanisms via TAC les, plugins, and twistd
have been a success. However, anecdotally, most large Twisted deployments end up having to rewrite
some of these management and monitoring facilities; the architecture does not quite expose what
system administrators need. This is a reection of the fact that Twisted has not historically had
much architectural input from system administratorsthe people who are experts at deploying and
maintaining applications.
Twisted would be well-served to more aggressively solicit feedback from expert end users when
making future architectural decisions in this space.
21.3 Retrospective and Lessons Learned
Twisted recently celebrated its 10th anniversary. Since its inception, inspired by the networked game
landscape of the early 2000s, it has largely achieved its goal of being an extensible, cross-platform,
event-driven networking engine. Twisted is used in production environments at companies from
Google and Lucaslm to Justin.TV and the Launchpad software collaboration platform. Server
implementations in Twisted are the core of numerous other open source applications, including
BuildBot, BitTorrent, and Tahoe-LAFS.
Twisted has had few major architectural changes since its initial development. The one crucial
addition was Deferred, as discussed above, for managing pending results and their callback chains.
There was one important removal, which has almost no footprint in the current implementation:
Twisted Application Persistence.
Twisted Application Persistence
Twisted Application Persistence (TAP) was a way of keeping an applications conguration and state
in a pickle. Running an application using this scheme was a two-step process:
1. Create the pickle that represents an Application, using the now defunct mktap utility.
2. Use twistd to unpickle and run the Application.
This process was inspired by Smalltalk images, an aversion to the proliferation of seemingly ad
hoc conguration languages that were hard to script, and a desire to express conguration details in
Python.
TAP les immediately introduced unwanted complexity. Classes would change in Twisted without
instances of those classes getting changed in the pickle. Trying to use class methods or attributes
from a newer version of Twisted on the pickled object would crash the application. The notion of
upgraders that would upgrade pickles to new API versions was introduced, but then a matrix of
upgraders, pickle versions, and unit tests had to be maintained to cover all possible upgrade paths,
and comprehensively accounting for all interface changes was still hard and error-prone.
TAPs and their associated utilities were abandoned and then eventually removed from Twisted
and replaced with TAC les and plugins. TAP was backronymed to Twisted Application Plugin, and
few traces of the failed pickling system exist in Twisted today.
The lesson learned from the TAP asco was that to have reasonable maintainability, persistent
data needs an explicit schema. More generally, it was a lesson about adding complexity to a project:
when considering introducing a novel system for solving a problem, make sure the complexity of that
solution is well understood and tested and that the benets are clearly worth the added complexity
before committing the project to it.
Jessica McKellar 327
web2: a lesson on rewrites
While not primarily an architectural decision, a project management decision about rewriting the
Twisted Web implementation has had long-termramications for Twisteds image and the maintainers
ability to make architectural improvements to other parts of the code base, and it deserves a short
discussion.
In the mid-2000s, the Twisted developers decided to do a full rewrite of the twisted.web APIs as
a separate project in the Twisted code base called web2. web2 would contain numerous improvements
over twisted.web, including full HTTP 1.1 support and a streaming data API.
web2 was labelled as experimental, but ended up getting used by major projects anyway and
was even accidentally released and packaged by Debian. Development on web and web2 continued
concurrently for years, and new users were perennially frustrated by the side-by-side existence of
both projects and a lack of clear messaging about which project to use. The switchover to web2 never
happened, and in 2011 web2 was nally removed from the code base and the website. Some of the
improvements from web2 are slowly getting ported back to web.
Partially because of web2, Twisted developed a reputation for being hard to navigate and struc-
turally confusing to newcomers. Years later, the Twisted community still works hard to combat this
image.
The lesson learned from web2 was that rewriting a project from scratch is often a bad idea, but if
it has to happen make sure that the developer community understands the long-term plan, and that
the user community has one clear choice of implementation to use during the rewrite.
If Twisted could go back and do web2 again, the developers would have done a series of backwards-
compatible changes and deprecations to twisted.web instead of a rewrite.
Keeping Up with the Internet
The way that we use the Internet continues to evolve. The decision to implement many protocols
as part of the core software burdens Twisted with maintaining code for all of those protocols.
Implementations have to evolve with changing standards and the adoption of new protocols while
maintaining a strict backwards-compatibility policy.
Twisted is primarily a volunteer-driven project, and the limiting factor for development is not
community enthusiasm, but rather volunteer time. For example, RFC 2616 dening HTTP 1.1 was
released in 1999, work began on adding HTTP 1.1 support to Twisteds HTTP protocol implementa-
tions in 2005, and the work was completed in 2009. Support for IPv6, dened in RFC 2460 in 1998,
is in progress but unmerged as of 2011.
Implementations also have to evolve as the interfaces exposed by supported operating systems
change. For example, the epoll event notication facility was added to Linux 2.5.44 in 2002, and
Twisted grew an epoll-based reactor to take advantage of this new API. In 2007, Apple released
OS 10.5 Leopard with a poll implementation that didnt support devices, which was buggy enough
behavior for Apple to not expose select.poll in its build of Python
5
. Twisted has had to work
around this issue and document it for users ever since.
Sometimes, Twisted development doesnt keep up with the changing networking landscape,
and enhancements are moved to libraries outside of the core software. For example, the Wokkel
project
6
, a collection of enhancements to Twisteds Jabber/XMPP support, has lived as a to-be-
merged independent project for years without a champion to oversee the merge. An attempt was
5
http://twistedmatrix.com/trac/ticket/4173
6
http://wokkel.ik.nu/
328 Twisted
made to add WebSockets to Twisted as browsers began to adopt support for the new protocol in 2009,
but development moved to external projects after a decision not to include the protocol until it moved
from an IETF draft to a standard.
All of this being said, the proliferation of libraries and add-ons is a testament to Twisteds exibility
and extensibility. A strict test-driven development policy and accompanying documentation and
coding standards help the project avoid regressions and preserve backwards compatibility while
maintaining a large matrix of supported protocols and platforms. It is a mature, stable project that
continues to have very active development and adoption.
Twisted looks forward to being the engine of your Internet for another ten years.
329
330 Twisted
[chapter22]
Yesod
Michael Snoyman
Yesod is a web framework written in the Haskell programming language. While many popular web
frameworks exploit the dynamic nature of their host languages, Yesod exploits the static nature of
Haskell to produce safer, faster code.
Development began about two years ago and has been going strong ever since. Yesod cut its
teeth on real life projects, with all of its initial features born out of an actual, real-life need. At rst,
development was almost entirely a one-man show. After about a year of development the community
eorts kicked in, and Yesod has since blossomed into a thriving open source project.
During the embryonic phase, when Yesod was incredibly ephemeral and ill-dened, it would
have been counter-productive to try and get a team to work on it. By the time it stabilized enough to
be useful to others, it was the right time to nd out the downsides to some of the decisions that had
been made. Since then, we have made major changes to the user-facing API to make it more useful,
and are quickly solidifying a 1.0 release.
The question you may ask is: Why another web framework? Lets instead redirect to a dierent
question: Why use Haskell? It seems that most of the world is happy with one of two styles of
language:
Statically typed languages, like Java, C# and C++. These languages provide speed and type
safety, but are more cumbersome to program with.
Dynamically typed languages, like Ruby and Python. These languages greatly increase produc-
tivity (at least in the short run), but run slowly and have very little support from the compiler
to ensure correctness. (The solution to this last point is unit testing. Well get to that later.)
This is a false dichotomy. Theres no reason why statically typed languages need to be so clumsy.
Haskell is able to capture a huge amount of the expressivity of Ruby and Python, while remaining a
strongly typed language. In fact, Haskells type system catches many more bugs than Java and its ilk.
Null pointer exceptions are completely eliminated; immutable data structures simplify reasoning
about your code and simplify parallel and concurrent programming.
So why Haskell? It is an ecient, developer-friendly language which provides many compile-time
checks of program correctness.
The goal of Yesod is to extend Haskells strengths into web development. Yesod strives to
make your code as concise as possible. As much as possible, every line of your code is checked for
correctness at compile time. Instead of requiring large libraries of unit tests to test basic properties,
the compiler does it all for you. Under the surface, Yesod uses as many advanced performance
techniques as we can muster to make your high-level code y.
22.1 Compared to Other Frameworks
In general terms, Yesod is more similar to than dierent than the leading frameworks such as Rails
and Django. It generally follows the Model-View-Controller (MVC) paradigm, has a templating
system that separates view from logic, provides an Object-Relational Mapping (ORM) system, and
has a front controller approach to routing.
The devil is in the details. Yesod strives to push as much error catching to the compile phase
instead of runtime, and to automatically catch both bugs and security aws through the type system.
While Yesod tries to maintain a user-friendly, high-level API, it uses a number of newer techniques
from the functional programming world to achieve high performance, and is not afraid to expose
these internals to developers.
The main architectural challenge in Yesod is balancing these two seemingly conicting goals. For
example, there is nothing revolutionary about Yesods approach to routing (called type-safe URLs
1
.
Historically, implementing such a solution was a tedious, error-prone process. Yesods innovation is
to use Template Haskell (a form of code generation) to automate the boilerplate required to bootstrap
the process. Similarly, type-safe HTML has been around for a long while; Yesod tries to keep the
developer-friendly aspect of common template languages while keeping the power of type safety.
22.2 Web Application Interface
A web application needs some way to communicate with a server. One possible approach is to bake
the server directly into the framework, but doing so necessarily limits your options for deployment
and leads to poor interfaces. Many languages have created standard interfaces to address this issue:
Python has WSGI and Ruby has Rack. In Haskell, we have WAI: Web Application Interface.
WAI is not intended to be a high-level interface. It has two specic goals: generality and
performance. By staying general, WAI has been able to support backends for everything from
standalone servers to old school CGI and even works directly with Webkit to produce faux desktop
applications. The performance side will introduce us to a number of the cool features of Haskell.
Datatypes
One of the biggest advantages of Haskelland one of the things we make the most use of in Yesod
is strong static typing. Before we begin to write the code for how to solve something, we need to
think about what the data will look like. WAI is a perfect example of this paradigm. The core concept
we want to express is that of an application. An applications most basic expression is a function that
takes a request and returns a response. In Haskell lingo:
type Application = Request -> Response
This just raises the question: what do Request and Response look like? A request has a number
of pieces of information, but the most basic are the requested path, query string, request headers,
and request body. And a response has just three components: a status code, response headers and
response body.
How do we represent something like a query string? Haskell keeps a strict separation between
binary and textual data. The former is represented by ByteString, the latter by Text. Both are
1
http://www.yesodweb.com/blog/212/1/aosa-chapter#file1414-routes
332 Yesod
Figure 22.1: Overall structure of a Yesod application
highly optimized datatypes that provide a high-level, safe API. In the case of a query string we store
the raw bytes transferred over the wire as a ByteString and the parsed, decoded values as Text.
Streaming
A ByteString represents a single memory buer. If we were to naively use a plain ByteString for
holding the entire request or response bodies, our applications could never scale to large requests
or responses. Instead, we use a technique called enumerators, very similar in concept to generators
in Python. Our Application becomes a consumer of a stream of ByteStrings representing the
incoming request body, and a producer of a separate stream for the response.
We now need to slightly revise our denition of an Application. An Application will take a
Request value, containing headers, query string, etc., and will consume a stream of ByteStrings,
producing a Response. So the revised denition of an Application is:
type Application = Request -> Iteratee ByteString IO Response
The IO simply explains what types of side eects an application can perform. In the case of IO, it
can perform any kind of interaction with the outside world, an obvious necessity for the vast majority
of web applications.
Builder
The trick in our arsenal is how we produce our response buers. We have two competing desires here:
minimizing systemcalls, and minimizing buer copies. On the one hand, we want to minimize system
calls for sending data over the socket. To do this we need to store outgoing data in a buer. However,
if we make this buer too large, we will exhaust our memory and slow down the applications
response time. On the other hand, we want to minimize the number of times data is copied between
buers, preferably copying just once from the source to destination buer.
Michael Snoyman 333
Haskells solution is the builder. A builder is an instruction for how to ll a memory buer, such
as: place the ve bytes "hello" in the next open position. Instead of passing a stream of memory
buers to the server, a WAI application passes a stream of these instructions. The server takes the
stream and uses it to ll up optimally sized memory buers. As each buer is lled, the server makes
a system call to send the data over over the wire and then starts lling up the next buer.
(The optimal size for a buer will depend on many factors such as cache size. The underlying
blaze-builder library underwent signicant performance testing to determine the best trade-o.)
In theory, this kind of optimization could be performed in the application itself. However, by
encoding this approach in the interface, we are able to simply prepend the response headers to the
response body. The result is that, for small to medium-sized responses, the entire response can be
sent with a single system call and memory is copied only once.
Handlers
Now that we have an application, we need some way to run it. In WAI parlance, this is a handler. WAI
has some basic, standard handlers, such as the standalone server Warp (discussed below), FastCGI,
SCGI and CGI. This spectrum allows WAI applications to be run on anything from dedicated servers
to shared hosting. But in addition to these, WAI has some more interesting backends:
Webkit: This backend embeds a Warp server and calls out to QtWebkit. By launching a server, then
launching a new standalone browser window, we have faux desktop applications.
Launch: This is a slight variant on Webkit. Having to deploy the Qt and Webkit libraries can be a
bit burdensome, so instead we just launch the users default browser.
Test: Even testing counts as a handler. After all, testing is simply the act of running an application
and inspecting the responses.
Most developers will likely use Warp. It is lightweight enough to be used for testing. It requires
no cong les, no folder hierarchy and no long-running, administrator-owned process. Its a simple
library that gets compiled into your application or run via the Haskell interpreter. Warp is an
incredibly fast server, with protection from all kinds of attack vectors, such as Slowloris and innite
headers. Warp can be the only web server you need, though it is also quite happy to sit behind a
reverse HTTP proxy.
The PONGbenchmark measures the requests per second of various servers for the 4-byte response
body "PONG". In the graph shown in Figure 22.2, Yesod is measured as a framework on top of
Warp. As can be seen, the Haskell servers (Warp, Happstack and Snap) lead the pack.
Most of the reasons for Warps speed have already been spelled out in the overall description of
WAI: enumerators, builders and packed datatypes. The last piece in the puzzle is from the Glasgow
Haskell Compilers (GHCs) multithreaded runtime. GHC, Haskells agship compiler, has light-
weight green threads. Unlike system threads, it is possible to spin up thousands of these without
serious performance hits. Therefore, in Warp each connection is handled by its own green thread.
The next trick is asynchronous I/O. Any web server hoping to scale to tens of thousands of
requests per second will need some type of asynchronous communication. In most languages, this
involves complicated programming involving callbacks. GHC lets us cheat: we program as if were
using a synchronous API, and GHC automatically switches between dierent green threads waiting
for activity.
Under the surface, GHC uses whatever system is provided by the host operating system, such as
kqueue, epoll and select. This gives us all the performance of an event-based I/O system, without
worrying about cross-platform issues or writing in a callback-oriented way.
334 Yesod
Figure 22.2: Warp PONG benchmark
Middleware
In between handlers and applications, we have middleware. Technically, middleware is an applica-
tion transformer: it takes one Application, and returns a new one. This is dened as:
type Middleware = Application -> Application
The best way to understand the purpose of middleware is to look at some common examples:
gzip automatically compresses the response from an application.
jsonp automatically converts JSON responses to JSON-P responses when the client provided
a callback parameter.
autohead will generate appropriate HEAD responses based on the GET response of an
application.
debug will print debug information to the console or a log on each request.
The idea here is to factor out common code from applications and let it be shared easily. Note
that, based on the denition of middleware, we can easily stack these things up. The general workow
of middleware is:
1. Take the request value and apply some modications.
2. Pass the modied request to the application and receive a response.
3. Modify the response and return it to the handler.
In the case of stacked middleware, instead of passing to the application or handler, the in-between
middleware will actually be passing to the inner and outer middleware, respectively.
Wai-test
No amount of static typing will obviate the need for testing. We all know that automated testing is a
necessity for any serious applications. wai-test is the recommended approach to testing a WAI
application. Since requests and responses are simple datatypes, it is easy to mock up a fake request,
Michael Snoyman 335
pass it to an application, and test properties about the response. wai-test simply provides some
convenience functions for testing common properties like the presence of a header or a status code.
22.3 Templates
In the typical Model-View-Controller (MVC) paradigm, one of the goals is to separate logic from the
view. Part of this separation is achieved through the use of a template language. However, there are
many dierent ways to approach this issue. At one end of the spectrum, for example, PHP/ASP/JSP
will allow you to embed any arbitrary code within your template. At the other end, you have systems
like StringTemplate and QuickSilver, which are passed some arguments and have no other way of
interacting with the rest of the program.
Each system has its pros and cons. Having a more powerful template system can be a huge
convenience. Need to show the contents of a database table? No problem, pull it in with the template.
However, such an approach can quickly lead to convoluted code, interspersing database cursor
updates with HTML generation. This can be commonly seen in a poorly written ASP project.
While weak template systems make for simple code, they also tend towards a lot of redundant work.
You will often need to not only keep your original values in datatypes, but also create dictionaries of
values to pass to the template. Maintaining such code is not easy, and usually there is no way for a
compiler to help you out.
Yesods family of template languages, the Shakespearean languages, strive for a middle ground.
By leveraging Haskells standard referential transparency, we can be assured that our templates
produce no side eects. However, they still have full access to all the variables and functions available
in your Haskell code. Also, since they are fully checked for both well-formedness, variable resolution
and type safety at compile time, typos are much less likely to have you searching through your code
trying to pin down a bug.
Why the Name Shakespeare?
The HTML language, Hamlet, was the rst language written, and originally based its syntax on
Haml. Since it was at the time a "reduced" Haml, Hamlet seemed appropriate. As we added
CSS and Javascript options, we decided to keep the naming theme with Cassius and Julius. At
this point, Hamlet looks nothing like Haml, but the name stuck anyway.
Types
One of the overarching themes in Yesod is proper use of types to make developers lives easier. In
Yesod templates, we have two main examples:
1. All content embedded into a Hamlet template must have a type of Html. As well see later,
this forces us to properly escape dangerous HTML when necessary, while avoiding accidental
double-escaping as well.
2. Instead of concatenating URLs directly in our template, we have datatypesknown as type-
safe URLswhich represent the routes in our application.
As a real-life example, suppose that a user submits his/her name to an application via a form.
This data would be represented with the Text datatype. Now we would like to display this variable,
336 Yesod
called name, in a page. The type systemat compile timeprevents it from being simply stuck into
a Hamlet template, since its not of type Html. Instead we must convert it somehow. For this, there
are two conversion functions:
1. toHtml will automatically escape any entities. So if a user submits the string <script
src="http://example.com/evil.js"></script>, the less-than signs will automatically
be converted to &<.
2. preEscapedText, on the other hand, will leave the content precisely as it is now.
So in the case of untrusted input from a possibly nefarious user, toHtml would be our recom-
mended approach. On the other hand, let us say we have some static HTML stored on our server that
we would like to insert into some pages verbatim. In that case, we could load it into a Text value
and then apply preEscapedText, thereby avoiding any double-escaping.
By default, Hamlet will use the toHtml function on any content you try to interpolate. Therefore,
you only need to explicitly perform a conversion if you want to avoid escaping. This follows the
dictum of erring on the side of caution.
name <- runInputPost $ ireq textField "name"
snippet <- readFile "mysnippet.html"
return [hamlet|
<p>Welcome #{name}, you are on my site!
<div .copyright>#{preEscapedText snippet}
|]
The rst step in type-safe URLs is creating a datatype that represents all the routes in your site.
Let us say you have a site for displaying Fibonacci numbers. The site will have a separate page for
each number in the sequence, plus the homepage. This could be modeled with the Haskell datatype:
data FibRoute = Home | Fib Int
We could then create a page like so:
<p>You are currently viewing number #{show index} in the sequence. Its value is #{fib index}.
<p>
<a href=@{Fib (index + 1)}>Next number
<p>
<a href=@{Home}>Homepage
Then all we need is some function to convert a type-safe URL into a string representation. In our
case, that could look something like this:
render :: FibRoute -> Text
render Home = "/home"
render (Fib i) = "/fib/" ++ show i
Fortunately, all of the boilerplate of dening and rendering type-safe URL datatypes is handled
for the developer automatically by Yesod. We will cover that in more depth later.
Michael Snoyman 337
The Other Languages
In addition to Hamlet, there are three other languages: Julius, Cassius and Lucius. Julius is used for
Javascript; however, its a simple pass-through language, just allowing for interpolation. In other
words, barring accidental use of the interpolation syntax, any piece of Javascript could be dropped
into Julius and be valid. For example, to test the performance of Julius, jQuery was run through the
language without an issue.
The other two languages are alternate CSS syntaxes. Those familiar with the dierence between
Sass and Less will recognize this immediately: Cassius is whitespace delimited, while Lucius uses
braces. Lucius is in fact a superset of CSS, meaning all valid CSS les are valid Lucius les. In
addition to allowing text interpolation, there are some helper datatypes provided to model unit sizes
and colors. Also, type-safe URLs work in these languages, making it convenient for specifying
background images.
Aside from the type safety and compile-time checks mentioned above, having specialized lan-
guages for CSS and Javascript give us a few other advantages:
For production, all the CSS and Javascript is compiled into the nal executable, increasing
performance (by avoiding le I/O) and simplifying deployment.
By being based around the ecient builder construct described earlier, the templates can be
rendered very quickly.
There is built-in support for automatically including these in nal webpages. We will get into
this in more detail when describing widgets below.
22.4 Persistent
Most web applications will want to store information in a database. Traditionally, this has meant
some kind of SQL database. In that regard, Yesod continues a long tradition, with PostgreSQL as
our most commonly used backend. But as we have been seeing in recent years, SQL isnt always
the answer to the persistence question. Therefore, Yesod was designed to work well with NoSQL
databases as well, and ships with a MongoDB backend as a rst-class citizen.
The result of this design decision is Persistent, Yesods preferred storage option. There are really
two guiding lights for Persistent: make it as back-end-agnostic as possible, and let user code be
completely type-checked.
At the same time, we fully recognize that it is impossible to completely shield the user from all
details of the backend. Therefore, we provide two types of escape routes:
Back-end-specic functionality as necessary. For example, Persistent provides features for
SQL joins and MongoDB lists and hashes. Proper portability warnings will apply, but if you
want this functionality, its there.
Easy access to performing raw queries. We dont believe its possible for any abstraction to
cover every use case of the underlying library. If you just have to write a 5-table, correlated
subquery in SQL, go right ahead.
Terminology
The most primitive datatype in Persistent is the PersistValue. This represents any raw data that can
appear within the database, such as a number, a date, or a string. Of course, sometimes youll have
338 Yesod
some more user-friendly datatypes you want to store, like HTML. For that, we have the PersistField
class. Internally, a PersistField expresses itself to the database in terms of a PersistValue.
All of this is very nice, but we will want to combine dierent elds together into a larger picture.
For this, we have a PersistEntity, which is basically a collection of PersistFields. And nally,
we have a PersistBackend that describes how to create, read, update and delete these entities.
As a practical example, consider storing a person in a database. We want to store the persons
name, birthday, and a prole image (a PNG le). We create a new entity Person with three elds:
a Text, a Day and a PNG. Each of those gets stored in the database using a dierent PersistValue
constructor: PersistText, PersistDay and PersistByteString, respectively.
There is nothing surprising about the rst two mappings, but the last one is interesting. There is
no specic constructor for storing PNG content in a database, so instead we use a more generic type
(a ByteString, which is just a sequence of bytes). We could use the same mechanism to store other
types of arbitrary data.
(The commonly held best practice for storing images is to keep the data on the lesystem and
just keep a path to the image in the database. We do not advocate against using that approach, but
are rather using database-stored images as an illustrative example.)
How is all this represented in the database? Consider SQL as an example: the Person entity
becomes a table with three columns (name, birthday, and picture). Each eld is stored as a dierent
SQL type: Text becomes a VARCHAR, Day becomes a Date and PNG becomes a BLOB (or BYTEA).
The story for MongoDB is very similar. Person becomes its own document, and its three elds
each become a MongoDB eld. There is no need for datatypes or creation of a schema in MongoDB.
Persistent SQL MongoDB
PersistEntity Table Document
PersistField Column Field
PersistValue Column type N/A
Type Safety
Persistent handles all of the data marshaling concerns behind the scenes. As a user of Persistent, you
get to completely ignore the fact that a Text becomes a VARCHAR. You are able to simply declare
your datatypes and use them.
Every interaction with Persistent is strongly typed. This prevents you from accidentally putting
a number in the date elds; the compiler will not accept it. Entire classes of subtle bugs simply
disappear at this point.
Nowhere is the power of strong typing more pronounced than in refactoring. Lets say you have
been storing users ages in the database, and you realize that you really wanted to store birthdays
instead. You are able to make a single line change to your entities declaration le, hit compile, and
automatically nd every single line of code that needs to be updated.
In most dynamically-typed languages, and their web frameworks, the recommended approach to
solving this issue is writing unit tests. If you have full test coverage, then running your tests will
immediately reveal what code needs to be updated. This is all well and good, but it is a weaker
solution than true types:
It is all predicated on having full test coverage. This takes extra time, and worse, is boilerplate
code that the compiler should be able to do for you.
You might be a perfect developer who never forgets to write a test, but can you say the same
for every person who will touch your codebase?
Michael Snoyman 339
Even 100% test coverage doesnt guarantee that you really have tested every case. All its done
is proven youve tested every line of code.
Cross-Database Syntax
Creating an SQL schema that works for multiple SQL engines can be tricky enough. How do you
create a schema that will also work with a non-SQL database like MongoDB?
Persistent allows you to dene your entities in a high-level syntax, and will automatically create
the SQL schema for you. In the case of MongoDB, we currently use a schema-less approach. This
also allows Persistent to ensure that your Haskell datatypes match perfectly with the databases
denitions.
Additionally, having all this information gives Persistent the ability to perform more advanced
functions, such as migrations, for you automatically.
Migrations
Persistent not only creates schema les as necessary, but will also automatically apply database
migrations if possible. Database modication is one of the less-developed pieces of the SQL standard,
and thus each engine has a dierent take on the process. As such, each Persistent backend denes its
own set of migration rules. In PostgreSQL, which has a rich set of ALTER TABLE rules, we use those
extensively. Since SQLite lacks much of that functionality, we are reduced to creating temporary
tables and copying rows. MongoDBs schema-less approach means no migration support is required.
This feature is purposely limited to prevent any kind of data loss. It will not remove any columns
automatically; instead, it will give you an error message, telling you the unsafe operations that are
necessary in order to continue. You will then have the option of either manually running the SQL it
provides you, or changing your data model to avoid the dangerous behavior.
Relations
Persistent is non-relational in nature, meaning it has no requirement for backends to support relations.
However, in many use cases, we may want to use relations. In those cases, developers will have full
access to them.
Assume we want to now store a list of skills with each user. If we were writing a MongoDB-
specic app, we could go ahead and just store that list as a new eld in the original Person entity.
But that approach would not work in SQL. In SQL, we call this kind of relationship a one-to-many
relationship.
The idea is to store a reference to the "one" entity (person) with each "many" entity (skill). Then
if we want to nd all the skills a person has, we simply nd all skills that reference that person. For
this reference, every entity has an ID. And as you might expect by now, these IDs are completely
type-safe. The datatype for a Person ID is PersonId. So to add our new skill, we would just add the
following to our entity denition:
Skill
person PersonId
name Text
description Text
UniqueSkill person name
340 Yesod
This ID datatype concept comes up throughout Persistent and Yesod. You can dispatch based on
an ID. In such a case, Yesod will automatically marshal the textual representation of the ID to the
internal one, catching any parse errors along the way. These IDs are used for lookup and deletion
with the get and delete functions, and are returned by the insertion and query functions insert
and selectList.
22.5 Yesod
If we are looking at the typical Model-View-Controller (MVC) paradigm, Persistent is the model and
Shakespeare is the view. This would leave Yesod as the controller.
The most basic feature of Yesod is routing. It features a declarative syntax and type-safe dispatch.
Layered on top of this, Yesod provides many other features: streaming content generation, widgets,
i18n, static les, forms and authentication. But the core feature added by Yesod is really routing.
This layered approach makes it simpler for users to swap dierent components of the system.
Some people are not interested in using Persistent. For them, nothing in the core systemeven mentions
Persistent. Likewise, while they are commonly used features, not everyone needs authentication or
static le serving.
On the other hand, many users will want to integrate all of these features. And doing so, while
enabling all the optimizations available in Yesod, is not always straightforward. To simplify the
process, Yesod also provides a scaolding tool that sets up a basic site with the most commonly used
features.
Routes
Given that routing is really the main function of Yesod, lets start there. The routing syntax is very
simple: a resource pattern, a name, and request methods. For example, a simple blog site might look
like:
/ HomepageR GET
/add-entry AddEntryR GET POST
/entry/#EntryId EntryR GET
The rst line denes the homepage. This says "I respond to the root path of the domain, Im
called HomepageR, and I answer GET requests." (The trailing "R" on the resource names is simply a
convention, it doesnt hold any special meaning besides giving a cue to the developer that something
is a route.)
The second line denes the add-entry page. This time, we answer both GET and POST requests.
You might be wondering why Yesod, as opposed to most frameworks, requires you to explicitly state
your request methods. The reason is that Yesod tries to adhere to RESTful principles as much as
possible, and GET and POST requests really have very dierent meanings. Not only do you state
these two methods separately, but later you will dene their handler functions separately. (This is
actually an optional feature in Yesod. If you want, you can leave o the list of methods and your
handler function will deal with all methods.)
The third line is a bit more interesting. After the second slash we have #EntryId. This denes
a parameter of type EntryId. We already alluded to this feature in the Persistent section: Yesod
will now automatically marshal the path component into the relevant ID value. Assuming an SQL
backend (Mongo is addressed later), if a user requests /entry/5, the handler function will get called
Michael Snoyman 341
with an argument EntryId 5. But if the user requests /entry/some-blog-post, Yesod will return
a 404.
This is obviously possible in most other web frameworks as well. The approach taken by Django,
for instance, would use a regular expression for matching the routes, e.g. r"/entry/(+
.
)". The
Yesod approach, however, provides some advantages:
Typing "EntryId" is much more semantic/developer-friendly than a regular expression.
Regular expressions cannot express everything (or at least, cant do so succinctly). We can use
/calendar/#Day in Yesod; do you want to type a regex to match dates in your routes?
Yesod also automatically marshals the data for us. In our calendar case, our handler function
would receive a Day value. In the Django equivalent, the function would receive a piece of
text which it would then have to marshal itself. This is tedious, repetitive and inecient.
So far weve assumed that a database ID is just a string of digits. But what if its more
complicated? MongoDB uses GUIDs, for example. In Yesod, your #EntryId will still work,
and the type system will instruct Yesod how to parse the route. In a regex system, you would
have to go through all of your routes and change the \d+ to whatever monstrosity of regex is
needed to match GUIDs.
Type-Safe URLs
This approach to routing gives birth to one of Yesods most powerful features: type-safe URLs.
Instead of just splicing together pieces of text to refer to a route, every route in your application can
be represented by a Haskell value. This immediately eliminates a large number of 404 Not Found
errors: it is simply not possible to produce an invalid URL. (It is still possible to produce a URL that
would lead to a 404 error, such as by referring to a blog post that does not exist. However, all URLs
will be formed correctly.)
So how does this magic work? Each site has a route datatype, and each resource pattern gets its
own constructor. In our previous example, we would get something that looks like:
data MySiteRoute = HomepageR
| AddEntryR
| EntryR EntryId
If you want to link to the homepage, you use HomepageR. To link to a specic entry, you would
use the EntryR constructor with an EntryId parameter. For example, to create a new entry and
redirect to it, you could write:
entryId <- insert (Entry "My Entry" "Some content")
redirect RedirectTemporary (EntryR entryId)
Hamlet, Lucius and Julius all include built-in support for these type-safe URLs. Inside a Hamlet
template you can easily create a link to the add-entry page:
<a href=@{AddEntryR}>Create a new entry.
The best part? Just like Persistent entities, the compiler will keep you honest. If you change any
of your routes (e.g., you want to include the year and month in your entry routes), Yesod will force
you to update every single reference throughout your codebase.
342 Yesod
Handlers
Once you dene your routes, you need to tell Yesod how you want to respond to requests. This is
where handler functions come into play. The setup is simple: for each resource (e.g., HomepageR)
and request method, create a function named methodResourceR. For our previous example, we
would need four functions: getHomepageR, getAddEntryR, postAddEntryR, and getEntryR.
All of the parameters collected from the route are passed in as arguments to the handler function.
getEntryR will take a rst argument of type EntryId, while all the other functions will take no
arguments.
The handler functions live in a Handler monad, which provides a great deal of functionality,
such as redirecting, accessing sessions, and running database queries. For the last one, a typical way
to start o the getEntryR function would be:
getEntryR entryId = do
entry <- runDB $ get44 entryId
This will run a database action that will get the entry associated with the given ID from the
database. If there is no such entry, it will return a 404 response.
Each handler function will return some value, which must be an instance of HasReps. This is
another RESTful feature at play: instead of just returning some HTML or some JSON, you can return
a value that will return either one, depending on the HTTP Accept request header. In other words, in
Yesod, a resource is a specic piece of data, and it can be returned in one of many representations.
Widgets
Assume you want to include a navbar on a few dierent pages of your site. This navbar will load up
the ve most recent blog posts (stored in your database), generate some HTML, and then need some
CSS and Javascript to style and enhance.
Without a higher-level interface to tie these components together, this could be a pain to implement.
You could add the CSS to the site-wide CSS le, but thats adding extra declarations you dont always
need. Likewise with the Javascript, though a bit worse: having that extra Javascript might cause
problems on a page it was not intended to live on. You will also be breaking modularity by having to
generate the database results from multiple handler functions.
In Yesod, we have a very simple solution: widgets. A widget is a piece of code that ties together
HTML, CSS and Javascript, allowing you to add content to both the head and body, and can run any
arbitrary code that belongs in a handler. For example, to implement our navbar:
-- Get last five blog posts. The "lift" says to run this code like were in the handler.
entries <- lift $ runDB $ selectList [] [LimitTo 5, Desc EntryPosted]
toWidget [hamlet|
<ul .navbar>
$forall entry <- entries
<li>#{entryTitle entry}
|]
toWidget [lucius| .navbar { color: red } |]
toWidget [julius|alert("Some special Javascript to play with my navbar");|]
But there is even more power at work here. When you produce a page in Yesod, the standard
approach is to combine a number of widgets together into a single widget containing all your page
Michael Snoyman 343
content, and then apply defaultLayout. This function is dened per site, and applies the standard
site layout.
There are two out-of-the-box approaches to handling where the CSS and Javascript go:
1. Concatenate them and place them into style and script tags, respectively, within your
HTML.
2. Place them in external les and refer to them with link and script tags, respectively.
In addition, your Javascript can be automatically minied. Option 2 is the preferred approach,
since it allows a few extra optimizations:
1. The les are created with names based on a hash of the contents. This means you can place
cached values far in the future without worries of users receiving stale content.
2. Your Javascript can be asynchronously loaded.
The second point requires a bit of elaboration. Widgets not only contain raw Javascript, they also
contain a list of Javascript dependencies. For example, many sites will refer to the jQuery library
and then add some Javascript that uses it. Yesod is able to automatically turn all of that into an
asynchronous load via yepnope.js.
In other words, widgets allowyou to create modular, composable code that will result in incredibly
ecient serving of your static resources.
Subsites
Many websites share common areas of functionality. Perhaps the two most common examples of
this are serving static les and authentication. In Yesod, you can easily drop in this code using a
subsite. All you need to do is add an extra line to your routes. For example, to add the static subsite,
you would write:
/static StaticR Static getStatic
The rst argument tells where in the site the subsite starts. The static subsite is usually used
at /static, but you could use whatever you want. StaticR is the name of the route; this is also
entirely up to you, but convention is to use StaticR. Static is the name of the static subsite; this is
one you do not have control over. getStatic is a function that returns the settings for the static site,
such as where the static les are located.
Like all of your handlers, the subsite handlers also have access to the defaultLayout function.
This means that a well-designed subsite will automatically use your site skin without any extra
intervention on your part.
22.6 Lessons Learned
Yesod has been a very rewarding project to work on. It has given me an opportunity to work on a
large system with a diverse group of developers. One of the things that has truly shocked me is how
dierent the end product has become from what I had originally intended. I started o Yesod by
creating a list of goals. Very few of the main features we currently tout in Yesod are in that list, and a
good portion of that list is no longer something I plan to implement. The rst lesson is:
You will have a better idea of the system you need after you start working on it. Do not tie yourself
down to your initial ideas.
344 Yesod
As this was my rst major piece of Haskell code, I learned a lot about the language during
Yesods development. Im sure others can relate to the feeling of "How did I ever write code like
this?" Even though that initial code was not of the same caliber as the code we have in Yesod at this
point, it was solid enough to kick-start the project. The second lesson is:
Dont be deterred by supposed lack of mastery of the tools at hand. Write the best code you can,
and keep improving it.
One of the most dicult steps in Yesods development was moving from a single-person team
meto collaborating with others. It started o simply, with merging pull requests on GitHub,
and eventually moved to having a number of core maintainers. I had established some of my own
development patterns, which were nowhere explained or documented. As a result, contributors found
it dicult to pull my latest unreleased changes and play around with them. This hindered others both
when contributing and testing.
When Greg Weber came aboard as another lead on Yesod, he put in place a lot of the coding
standards that were sorely lacking. To compound the problems, there were some inherent diculties
playing with the Haskell development toolchain; specically in dealing with Yesods large number
of packages. One of the goals of the entire Yesod team has since been to create standard scripts and
tools to automate building. Many of these tools are making their way back into the general Haskell
community. The nal lesson is:
Consider early on how to make your project approachable for others.
345
346 Yesod
[chapter23]
Yocto
Elizabeth Flanagan
The Yocto Project is an open source project that provides a common starting point for developers of
embedded Linux systems to create customized distributions for embedded products in a hardware-
agnostic setting. Sponsored by the Linux Foundation, Yocto is more than a build system. It provides
tools, processes, templates and methods so developers can rapidly create and deploy products for the
embedded market. One of the core components of Yocto is the Poky Build system. As Poky is a
large and complex system, we will be focusing on one of its core components, BitBake. BitBake is a
Gentoo-Portage-inspired build tool, used by both the Yocto Project and OpenEmbedded communities
to utilize metadata to create Linux images from source.
In 2001, Sharp Corporation introduced the SL-5000 PDA, named Zaurus, which ran an embedded
Linux distribution, Lineo. Not long after the Zauruss introduction, Chris Larson founded the
OpenZaurus Project, a replacement Linux distribution for the SharpROM, based on a build system
called buildroot. With the founding of the project, people began contributing many more software
packages, as well as targets for other devices, and it wasnt long before the build systemfor OpenZaurus
began to show fragility. In January 2003, the community began discussing a new build system to
incorporate the community usage model of a generic build system for embedded Linux distributions.
This would eventually become OpenEmbedded. Chris Larson, Michael Lauer, and Holger Schurig
began work on OpenEmbedded by porting hundreds of OpenZaurus packages over to the new build
system.
The Yocto Project springs from this work. At the projects core is the Poky build system, created
by Richard Purdie. It began as a stabilized branch of OpenEmbedded using a core subset of the
thousands of OpenEmbedded recipes, across a limited set of architectures. Over time, it slowly
coalesced into more than just an embedded build system, but into a complete software development
platform, with an Eclipse plugin, a fakeroot replacement and QEMUbased images. Around November
2010, the Linux Foundation announced that this work would all continue under the heading of the
Yocto Project as a Linux Foundation-sponsored project. It was then established that Yocto and
OpenEmbedded would coordinate on a core set of package metadata called OE-Core, combining the
best of both Poky and OpenEmbedded with an increased use of layering for additional components.
23.1 Introduction to the Poky Build System
The Poky build system is the core of the Yocto Project. In Pokys default conguration, it can provide
a starting image footprint that ranges from a shell-accessible minimal image all the way up to a Linux
Standard Base-compliant image with a GNOME Mobile and Embedded (GMAE) based reference
user interface called Sato. From these base image types, metadata layers can be added to extend
functionality; layers can provide an additional software stack for an image type, add a board support
package (BSP) for additional hardware or even represent a new image type. Using the 1.1 release
of Poky, named edison, we will show how BitBake uses these recipes and conguration les to
generate an embedded image.
From a very high level, the build process starts out by setting up the shell environment for the
build run. This is done by sourcing a le, oe-init-build-env, that exists in the root of the Poky
source tree. This sets up the shell environment, creates an initial customizable set of conguration
les and wraps the BitBake runtime with a shell script that Poky uses to determine if the minimal
system requirements have been met.
For example, one of the things it will look for is the existence of Pseudo, a fakeroot replacement
contributed to the Yocto Project by Wind River Systems. At this point, bitbake core-image-minimal,
for example, should be able to create a fully functional cross-compilation environment and then
create a Linux image based on the image denition for core-image-minimal from source as dened
in the Yocto metadata layer.
Figure 23.1: High-level overview of Poky task execution
During the creation of our image, BitBake will parse its conguration, include any additional
layers, classes, tasks or recipes dened, and begin by creating a weighted dependency chain. This
process provides an ordered and weighted task priority map. BitBake then uses this map to determine
what packages must be built in which order so as to most eciently fulll compilation dependencies.
Tasks needed by the most other tasks are weighted higher, and thus run earlier during the build process.
The task execution queue for our build is created. BitBake also stores the parsed metadata summaries
and if, on subsequent runs, it determines that the metadata has changed, it can re-parse only what
348 Yocto
BitBake recipe for grep
DESCRIPTION = "GNU grep utility"
HOMEPAGE = "http://savannah.gnu.org/projects/grep/"
BUGTRACKER = "http://savannah.gnu.org/bugs/?group=grep"
SECTION = "console/utils"
LICENSE = "GPLv3"
LIC_FILES_CHKSUM = "file://COPYING;md5=86d9c814277c1bfc4ca22af94b59ee"
PR = "r"
SRC_URI = "${GNU_MIRROR}/grep/grep-${PV}.tar.gz"
SRC_URI[md5sum] = "3e3451a38bd615cb113cbeaf252dc"
SRC_URI[sha256sum]="e9118eac72ecc71191725a7566361ab7643edfd3364869a47b78dc934a35797"
inherit autotools gettext
EXTRA_OECONF = "--disable-perl-regexp"
do_configure_prepend
() {
rm
-f ${S}/m4/init.m4
}
do_install () {
autotools_do_install
install -d ${D}${base_bindir}
mv ${D}${bindir}/grep ${D}${base_bindir}/grep.${PN}
mv ${D}${bindir}/egrep ${D}${base_bindir}/egrep.${PN}
mv ${D}${bindir}/fgrep ${D}${base_bindir}/fgrep.${PN}
}
pkg_postinst_${PN}() {
update-alternatives --install ${base_bindir}/grep grep grep.${PN} 1
update-alternatives --install ${base_bindir}/egrep egrep egrep.${PN} 1
update-alternatives --install ${base_bindir}/fgrep fgrep fgrep.${PN} 1
}
pkg_prerm_${PN}() {
update-alternatives --remove grep grep.${PN}
update-alternatives --remove egrep egrep.${PN}
update-alternatives --remove fgrep fgrep.${PN}
}
Elizabeth Flanagan 349
has changed. The BitBake scheduler and parser are some of the more interesting architectural
designs of BitBake and some of the decisions surrounding them and their implementation by BitBake
contributors will be discussed later.
BitBake then runs through its weighted task queue, spawning threads (up to the number dened
by BB_NUMBER_THREADS in conf/local.conf) that begin executing those tasks in the predetermined
order. The tasks executed during a packages build may be modied, prepended- or appended-to
through its recipe. The basic, default package task order of execution starts by fetching and unpacking
package source and then conguring and cross-compiling the unpacked source. The compiled source
is then split up into packages and various calculations are made on the compilation result such as
the creation of debug package information. The split packages are then packaged into a supported
package format; RPM, ipk and deb are supported. BitBake will then use these packages to build the
root le system.
Poky Build System Concepts
One of the most powerful properties of the Poky build system is that every aspect of a build is
controlled by metadata. Metadata can be loosely grouped into conguration les or package recipes.
A recipe is a collection of non-executable metadata used by BitBake to set variables or dene
additional build-time tasks. A recipe contains elds such as the recipe description, the recipe version,
the license of the package and the upstream source repository. It may also indicate that the build
process uses autotools, make, distutils or any other build process, in which case the basic
functionality can be dened by classes it inherits from the OE-Core layers class denitions in
./meta/classes. Additional tasks can also be dened, as well as task prerequisites. BitBake also
supports both _prepend and _append as a method of extending task functionality by injecting code
indicated by using prepend or append sux into the beginning or end of a task.
Conguration les can be broken down into two types. There are those that congure BitBake
and the overall build run, and those that congure the various layers Poky uses to create dierent
congurations of a target image. A layer is any grouping of metadata that provides some sort of
additional functionality. These can be BSP for new devices, additional image types or additional
software outside of the core layers. In fact, the core Yocto metadata, meta-yocto, is itself a layer
applied on top of the OE-Core metadata layer, meta which adds additional software and image types
to the OE-Core layer.
An example of how one would use layering is by creating a NAS device for the Intel n660
(Crownbay), using x32, the new 32-bit native ABI for x86-64, with a custom software layer that adds
a user interface.
Given the task at hand, we could split this functionality out into layers. At the lowest level we
would utilize a BSP layer for Crownbay that would enable Crownbay-specic hardware functionality,
such as video drivers. As we want x32, we would use the experimental meta-x32 layer. The NAS
functionality would be layered on top of this by adding the Yocto Projects example NAS layer,
meta-baryon. And lastly, well use an imaginary layer called meta-myproject, to provide the
software and conguration to create a graphical user interface for conguration of the NAS.
During the setup of the BitBake environment, some initial conguration les are generated by
sourcing oe-build-init-env. These conguration les allow us quite a bit of control over how
and what Poky generates. The rst of these conguration les is bblayers.conf. This le is what
we will use to add additional layers in order to build our example project.
350 Yocto
Heres an example of a bblayers.conf le:
# LAYER_CONF_VERSION is increased each time build/conf/bblayers.conf
# changes incompatibly
LCONF_VERSION = "4"
BBFILES ?= ""
BBLAYERS = " \
/home/eflanagan/poky/meta \
/home/eflanagan/poky/meta-yocto \
/home/eflanagan/poky/meta-intel/crownbay \
/home/eflanagan/poky/meta-x32 \
/home/eflanagan/poky/meta-baryon\
/home/eflanagan/poky/meta-myproject \
"
The BitBake layers le, bblayers, denes a variable BBLAYERS that BitBake uses to look for
BitBake layers. In order to fully understand this, we should also look at how our layers are actually
constructed. Using meta-baryon
1
as our example layer, we want to examine the layer conguration
le. This le, conf/layer.conf, is what BitBake parses after its initial parsing of bblayers.conf.
From here it adds additional recipes, classes and conguration to the build.
Figure 23.2: Example of BitBake layering
Heres meta-baryons layer.conf:
# Layer configuration for meta-baryon layer
# Copyright 211 Intel Corporation
# We have a conf directory, prepend to BBPATH to prefer our versions
BBPATH := "${LAYERDIR}:${BBPATH}"
# We have recipes-* directories, add to BBFILES
BBFILES := "${BBFILES} ${LAYERDIR}/recipes-*/*/*.bb ${LAYERDIR}/recipes-*/*/*.bbappend"
BBFILE_COLLECTIONS += "meta-baryon"
BBFILE_PATTERN_meta-baryon := "^${LAYERDIR}/"
BBFILE_PRIORITY_meta-baryon = "7"
1
git://git.yoctoproject.org/meta-baryon
Elizabeth Flanagan 351
All of the BitBake conguration les help generate BitBakes datastore which is used during
the creation of the task execution queue. During the beginning of a build, BitBakes BBCooker
class is started. The cooker manages the build task execution by baking the recipes. One of the
rst things the cooker does is attempt to load and parse conguration data. Remember, though,
that BitBake is looking for two types of conguration data. In order to tell the build system where
it should nd this conguration data (and in turn where to nd recipe metadata), the cookers
parseConfigurationFiles method is called. With few exceptions, the rst conguration le that
the cooker looks for is bblayers.conf. After this le is parsed, BitBake then parses each layers
layer.conf le.
Once layer conguration les are parsed, parseConfigurationFiles then parses bitbake.conf
whose main purpose is to set up global build time variables, such as directory structure naming
for various rootfs directories and the initial LDFLAGS to be used during compile time. Most end
users will never touch this le as most anything needed to be changed here would be within a recipe
context, as opposed to build wide or could be overridden in a conguration le such as local.conf.
As this le is parsed, BitBake also includes conguration les that are relative to each layer in
BBLAYERS and adds the variables found in those les to its data store.
Here is a portion of a bitbake.conf showing included conguration les:
include conf/site.conf
include conf/auto.conf
include conf/local.conf
include conf/build/${BUILD_SYS}.conf
include conf/target/${TARGET_SYS}.conf
include conf/machine/${MACHINE}.conf
23.2 BitBake Architecture
Before we delve into some of BitBakes current architectural design, it would help to understand
how BitBake once worked. In order to fully appreciate how far BitBake has come, we will consider
the initial version, BitBake 1.0. In that rst release of BitBake, a builds dependency chain was
determined based on recipe dependencies. If something failed during the build of an image, BitBake
would move on to the next task and try to build it again later. What this means, obviously, is that
builds took a very long time. One of the things BitBake also did is keep each and every variable that
a recipe used in one very large dictionary. Given the number of recipes and the number of variables
and tasks needed to accomplish a build, BitBake 1.0 was a memory hog. At a time when memory
was expensive and systems had much less, builds could be painful aairs. It was not unheard of for
a system to run out of memory (writing to swap!) as it slugged through a long running build. In
its rst incarnation, while it did the job (sometimes), it did it slowly while consuming an enormous
amount of resources. Worse, as BitBake 1.0 had no concept of a data persistence cache or shared
state, it also had no ability to do incremental builds. If a build failed, one would have to restart it
from scratch.
A quick di between the current BitBake version used in Poky edison 1.13.3 and 1.0 shows
the implementation of BitBakes client-server architecture, the data persistence cache, its datastore, a
copy-on-write improvement for the datastore, shared state implementation and drastic improvements
in how it determines task and package dependency chains. This evolution has made it more reliable,
more ecient and more dynamic. Much of this functionality came out of necessity for quicker, more
reliable builds that used fewer resources. Three improvements to BitBake that we will examine are
352 Yocto
the implementation of a client-server architecture, optimizations around BitBakes data storage and
work done on how BitBake determines its build and task dependency chain.
BitBake IPC
Since we now know a good deal about how the Poky build system uses congurations, recipes and
layers to create embedded images, were prepared to begin to look under the hood of BitBake and
examine how this is all combined. Starting with the core BitBake executable, bitbake/bin/bake,
we can begin to see the process BitBake follows as it begins to set up the infrastructure needed to
begin a build. The rst item of interest is BitBakes Interprocess Communications (IPC). Initially,
BitBake had no concept of a client-server. This functionality was factored into the BitBake design
over a period of time in order to allow BitBake to run multiple processes during a build, as it was
initially single-threaded, and to allow dierent user experiences.
Figure 23.3: Overview of BitBake IPC
All Poky builds are begun by starting a user interface instance. The user interface provides a
mechanism for logging of build output, build status and build progress, as well as for receiving
events from build tasks through the BitBake event module. The default user interface used is knotty,
BitBakes command line interface. Called knotty, or (no) tty, since it handles both ttys and non-ttys,
it is one of a few interfaces that are supported. One of these additional user interfaces is Hob. Hob
is the graphical interface to BitBake, a kind of BitBake commander. In addition to the typical
functions you would see in the knotty user interface, Hob (written by Joshua Lock) brings the ability
to modify conguration les, add additional layers and packages, and fully customize a build.
BitBake user interfaces have the ability to send commands to the next module brought up by
the BitBake executable, the BitBake server. Like the user interface, BitBake also supports multiple
Elizabeth Flanagan 353
dierent server types, such as XMLRPC. The default server that most users use when executing
BitBake from the knotty user interface is BitBakes process server. After bringing up the server, the
BitBake executable brings up the cooker.
The cooker is a core portion of BitBake and is where most of the particularly interesting things
that occur during a Poky build are called from. The cooker is what manages the parsing of metadata,
initiates the generation of the dependency and task trees, and manages the build. One of the functions
of BitBakes server architecture is allowing multiple ways of exposing the command API, indirectly,
to the user interface. The command module is the worker of BitBake, running build commands and
triggering events that get passed up to the user interface through BitBakes event handler. Once the
cooker is brought up from the BitBake executable, it initializes the BitBake datastore and then begins
to parse all of Pokys conguration les. It then creates the runqueue object, and triggers the build.
BitBake DataSmart Copy-on-Write Data Storage
In BitBake 1.0, all BitBake variables were parsed and stored in one very large dictionary during
the initialization of that versions data class. As previously mentioned, this was problematic in
that very large Python dictionaries are slow on writes and member access, and if the build host
runs out of physical memory during the build, we end up using swap. While this is less likely in
most systems in late 2011, when OpenEmbedded and BitBake were rst starting up, the average
computers specication usually had less than one or two gigabytes of memory.
This was one of the major pain points in early BitBake. Two major issues needed to be worked
out in order to help increase performance: one was precomputation of the build dependency chain;
the other was to reduce the size of data being stored in memory. Much of the data being stored
for a recipe doesnt change from recipe to recipe; for example, with TMPDIR, BB_NUMBER_THREADS
and other global BitBake variables, having a copy of the entire data environment per recipe stored
in memory was inecient. The solution was Tom Ansells copy-on-write dictionary that abuses
classes to be nice and fast. BitBakes COW module is both an especially fearless and clever hack.
Running python BitBake/lib/bb/COW.py and examining the module will give you an idea of how
this copy-on-write implementation works and how BitBake uses it to store data eciently
The DataSmart module, which uses the COW dictionary, stores the data from the initial Poky
conguration, data from .conf les and .bbclass les, in a dict as a data object. Each of these
objects can contain another data object of just the di of the data. So, if a recipe changes something
from the initial data conguration, instead of copying the entire conguration in order to localize it,
a di of the parent data object is stored at the next layer down in the COW stack. When an attempt is
made to access a variable, the data module will use DataSmart to look into the top level of the stack.
If the variable is not found it will defer to a lower level of the stack until it does nd the variable or
throws an error.
One of the other interesting things about the DataSmart module centers around variable expansion.
As BitBake variables can contain executable Python code, one of the things that needs to be done
is run the variable through BitBakes bb.codeparser to ensure that its valid Python and that it
contains no circular references. An example of a variable containing Python code is this example
taken from ./meta/conf/distro/include/tclibc-eglibc.inc:
LIBCEXTENSION = "${@[, -gnu][(d.getVar(ABIEXTENSION, True) or ) != ]}"
This variable is included from one of the OE-Core conguration les,
./meta/conf/distro/include/defaultsetup.conf, and is used to provide a set of default op-
tions across dierent distro congurations that one would want to lay on top of Poky or OpenEmbed-
354 Yocto
ded. This le imports some eglibc-specic variables that are set dependent on the value of another
BitBake variable ABIEXTENSION. During the creation of the datastore, the Python code within this
variable needs to be parsed and validated to ensure tasks that use this variable will not fail.
BitBake Scheduler
Once BitBake has parsed the conguration and created its datastore, it needs to parse the recipes
required for the image and produce a build chain. This is one of the more substantial improvements
to BitBake. Originally, BitBake took its build priorities from a recipe. If a recipe had a DEPENDS, it
would try to gure out what to build in order to satisfy that dependency. If a task failed because it
lacked a prerequisite needed for its buildout, it was simply put to the side and attempted later. This
had obvious drawbacks, both in eciency and reliability.
As no precomputed dependency chain was established, task execution order was gured out
during the build run. This limited BitBake to being single-threaded. To give an idea of how painful
single-threaded BitBake builds can be, the smallest image core-image-minimal on a standard
developer machine in 2011 (Intel Core i7, 16 gigabytes of DDR3 memory) takes about three or four
hours to build a complete cross-compilation toolchain and use it to produce packages that are then
used to create an image. For reference, a build on the same machine with BB_NUMBER_THREADS at
14 and PARALLEL_MAKE set to -j 12 takes about 30 to 40 minutes. As one could imagine, running
single-threaded with no precomputed order of task execution on slower hardware that had less
memory with a large portion wasted by duplicate copies of the entire datastore took much longer.
Dependencies
When we talk of build dependencies, we need to make a distinction between the various types. A
build dependency, or DEPENDS, is something we require as a prerequisite so that Poky can build the
required package, whereas a runtime dependency, RDEPENDS, requires that the image the package is
to be installed on also contains the package listed as an RDEPENDS. Take, for example, the package
task-core-boot. If we look at the recipe for it in
meta/recipes-core/tasks/task-core-boot.bb
we will see two BitBake variables set: RDEPENDS and DEPENDS. BitBake uses these two elds
during the creation of its dependency chain.
Here is a portion of task-core-boot.bb showing DEPENDS and RDEPENDS:
DEPENDS = "virtual/kernel"
...
RDEPENDS_task-core-boot = "\
base-files \
base-passwd \
busybox \
initscripts \
...
Packages arent the only thing in BitBake with dependencies. Tasks also have their own depen-
dencies. Within the scope of BitBakes runqueue, we recognize four types: internally dependent,
DEPENDS dependent, RDEPENDS dependent and inter-task dependent.
Elizabeth Flanagan 355
Internally dependent tasks are set within a recipe and add a task before and/or after another
task. For example, in a recipe, we could add a task called do_deploy by adding the line addtask
deploy before do_build after do_compile. This would add a dependency for running the
do_deploy task prior to do_build being started, but after do_compile is completed. DEPENDS and
RDEPENDS dependent tasks are tasks that run after a denoted task. For example, if we wanted to run
do_deploy of a package after the do_install of its DEPENDS or RDEPENDS, our recipe would include
do_deploy[deptask] = do_install or do_deploy[rdeptask] = do_install. For inter-
task dependencies, if we wanted a task to be dependent on a dierent packages task we would add, us-
ing the above example of do_deploy, do_deploy[depends] = <targets name>:do_install.
RunQueue
As an image build can have hundreds of recipes, each with multiple packages and task, each with its
own dependency, BitBake is now tasked with trying to sort this out into something it can use as an
order of execution. After the cooker has gotten the entire list of packages needed to be built from the
initialization of the bb.data object, it will begin to create a weighted task map from this data in order
to produce an ordered list of tasks it needs to run, called the runqueue. Once the runqueue is created,
BitBake can begin executing it in order of priority, tasking out each portion to a dierent thread.
Within the provider module, BitBake will rst look to see if there is a PREFERRED_PROVIDER
for a given package or image. As more than one recipe can provide a given package and as tasks are
dened in recipes, BitBake needs to decide which provider of a package it will use. It will sort all
the providers of the package, weighting each provider by various criteria. For example, preferred
versions of software will get a higher priority than others. However, BitBake also takes into account
package version as well as the dependencies of other packages. Once it has selected the recipe from
which it will derive its package, BitBake will iterate over the DEPENDS and RDEPENDS of that
recipe and proceed to compute the providers for those packages. This chain reaction will produce a
list of packages needed for image generation as well as providers for those packages.
Runqueue now has a full list of all packages that need to be built and a dependency chain. In
order to begin execution of the build, the runqueue module now needs to create the TaskData object
so it can begin to sort out a weighted task map. It begins by taking each buildable package it has
found, splitting the tasks needed to generate that package and weighing each of those tasks based
on the number of packages that require it. Tasks with a higher weight have more dependents, and
therefore are generally run earlier in the build. Once this is complete, the runqueue module then
prepares to convert the TaskData object into a runqueue.
The creation of the runqueue is somewhat complex. BitBake rst iterates through the list of task
names within the TaskData object in order to determine task dependencies. As it iterates through
TaskData, it begins to build a weighted task map. When it is complete, if it has found no circular
dependencies, unbuildable tasks or any such problems, it will then order the task map by weight and
return a complete runqueue object to the cooker. The cooker will begin to attempt to execute the
runqueue, task by task. Depending upon image size and computing resources, Poky may take from
a half-hour to hours to generate a cross-compilation toolchain, a package feed and the embedded
Linux image specied. It is worth noting that from the time of executing bitbake <image_name>
from the command line, the entire process up to right before the execution of the task execution
queue has taken less than a few seconds.
356 Yocto
23.3 Conclusion
In my discussions with community members and my own personal observations, Ive identied a few
areas where things should, perhaps, have been done dierently, as well as a few valuable lessons. It
is important to note that arm chair quarterbacking a decade-long development eort is not meant
as a criticism of those whove poured their time and eort into a wholly remarkable collection of
software. As developers, the most dicult part of our job is predicting what we will need years down
the road and how we can set up a framework to enable that work now. Few can achieve that without
some road bumps.
The rst lesson is to be sure to develop a written, agreed-upon standards document that is well
understood by the community. It should be designed for maximum exibility and growth.
One place where Ive personally run into this issue is with my work in OE-Cores license manifest
creation class, especially with my experiences working with the LICENSE variable. As no clearly
documented standard existed for what LICENSE should contain, a review of the many recipes
available showed many variations. The various LICENSE strings contained everything from Python
abstract syntax tree-parsable values to values that one would have little hope of gaining meaningful
data from. There was a convention that was commonly used within the community; however, the
convention had many variations, some less correct than others. This wasnt the problem of the
developer who wrote the recipe; it was a community failure to dene a standard.
As little prior work was actually done with the LICENSE variable outside of checking for its
existence, there was no particular concern about a standard for that variable. Much trouble could
have been avoided had a project-wide agreed-upon standard been developed early on.
The next lesson is a bit more general and speaks to an issue seen not only within the Yocto Project
but in other large scale projects that are systems-design specic. It is the one of the most important
things developers can do to limit the amount of eort duplication, refactoring and churn their project
encounters: spend timelots of timeon front-end planning and architectural design.
If you think youve spent enough time on architectural design, you probably havent. If you think
you havent spent enough time on architectural design, you denitely havent. Spending more time on
front end planning wont stop you from later having to rip apart code or even do major architectural
changes, but it will certainly reduce the amount of duplicated eort in the long run.
Designing your software to be as modular as possible, knowing that you will end up revisiting
areas for anything from minor tweaks to major rewrites, will make it so that when you do run into
these issues, code rewrites are less hair-raising.
One obvious place where this would have helped in the Yocto Project is identifying the needs of
end users with low memory systems. Had more thought been put into BitBakes datastore earlier,
perhaps we could have predicted the problems associated with the datastore taking up too much
memory and dealt with it earlier.
The lesson here is that while it is nearly impossible to identify every pain point your project will
run into during its lifetime, taking the time to do serious front-end planning will help reduce the
eort needed later. BitBake, OE-Core and Yocto are all fortunate in this regard as there was a fair
amount of architectural planning done early. This enabled us to be able to make major changes to the
architecture without too much pain and suering.
Elizabeth Flanagan 357
23.4 Acknowledgements
First, thank you to Chris Larson, Michael Lauer, and Holger Schurig and the many, many people
who have contributed to BitBake, OpenEmbedded, OE-Core and Yocto over the years. Thank you
also goes to Richard Purdie for his letting me pick his brain, both on historical and technical aspects
of OE, and for his constant encouragement and guidance, especially with some of the dark magic of
BitBake.
358 Yocto
[chapter24]
ZeroMQ
Martin Sstrik
MQis a messaging system, or message-oriented middleware, if you will. Its used in environments
as diverse as nancial services, game development, embedded systems, academic research and
aerospace.
Messaging systems work basically as instant messaging for applications. An application decides
to communicate an event to another application (or multiple applications), it assembles the data to
be sent, hits the send button and there we gothe messaging system takes care of the rest.
Unlike instant messaging, though, messaging systems have no GUI and assume no human beings
at the endpoints capable of intelligent intervention when something goes wrong. Messaging systems
thus have to be both fault-tolerant and much faster than common instant messaging.
MQ was originally conceived as an ultra-fast messaging system for stock trading and so the
focus was on extreme optimization. The rst year of the project was spent devising benchmarking
methodology and trying to dene an architecture that was as ecient as possible.
Later on, approximately in the second year of development, the focus shifted to providing a
generic system for building distributed applications and supporting arbitrary messaging patterns,
various transport mechanisms, arbitrary language bindings, etc.
During the third year the focus was mainly on improving usability and attening the learning
curve. Weve adopted the BSD Sockets API, tried to clean up the semantics of individual messaging
patterns, and so on.
Hopefully, this chapter will give an insight into how the three goals above translated into the
internal architecture of MQ, and provide some tips for those who are struggling with the same
problems.
Since its third year MQ has outgrown its codebase; there is an initiative to standardise the
wire protocols it uses, and an experimental implementation of a MQ-like messaging system inside
the Linux kernel, etc. These topics are not covered in this book. However, you can check online
resources
123
for further details.
24.1 Application vs. Library
MQ is a library, not a messaging server. It took us several years working on AMQP protocol,
a nancial industry attempt to standardise the wire protocol for business messagingwriting a
1
http://www.25bpm.com/concepts
2
http://groups.google.com/group/sp-discuss-group
3
http://www.25bpm.com/hits
reference implementation for it and participating in several large-scale projects heavily based on
messaging technologyto realise that theres something wrong with the classic client/server model
of smart messaging server (broker) and dumb messaging clients.
Our primary concern at the time was with the performance: If theres a server in the middle,
each message has to pass the network twice (from the sender to the broker and from the broker to the
receiver) inducing a penalty in terms of both latency and throughput. Moreover, if all the messages
are passed through the broker, at some point its bound to become the bottleneck.
A secondary concern was related to large-scale deployments: when the deployment crosses
organisational boundaries the concept of a central authority managing the whole message ow
doesnt apply any more. No company is willing to cede control to a server in dierent company; there
are trade secrets and theres legal liability. The result in practice is that theres one messaging server
per company, with hand-written bridges to connect it to messaging systems in other companies. The
whole ecosystem is thus heavily fragmented, and maintaining a large number of bridges for every
company involved doesnt make the situation better. To solve this problem, we need a fully distributed
architecture, an architecture where every component can be possibly governed by a dierent business
entity. Given that the unit of management in server-based architecture is the server, we can solve the
problem by installing a separate server for each component. In such a case we can further optimize
the design by making the server and the component share the same processes. What we end up with
is a messaging library.
MQ was started when we got an idea about how to make messaging work without a central
server. It required turning the whole concept of messaging upside down and replacing the model of
an autonomous centralised store of messages in the center of the network with a smart endpoint,
dumb network architecture based on the end-to-end principle. The technical consequence of that
decision was that MQ, from the very beginning, was a library, not an application.
In the meantime weve been able to prove that this architecture is both more ecient (lower
latency, higher throughput) and more exible (its easy to build arbitrary complex topologies instead
of being tied to classic hub-and-spoke model).
Figure 24.1: MQ being used by dierent libraries
One of the unintended consequences, however, was that opting for the library model improved
the usability of the product. Over and over again users express their happiness about the fact that
they dont have to install and manage a stand-alone messaging server. It turns out that not having a
server is a preferred option as it cuts operational cost (no need to have a messaging server admin)
and improves time-to-market (no need to negotiate the need to run the server with the client, the
management or the operations team).
The lesson learned is that when starting a new project, you should opt for the library design if
at all possible. Its pretty easy to create an application from a library by invoking it from a trivial
360 ZeroMQ
program; however, its almost impossible to create a library from an existing executable. A library
oers much more exibility to the users, at the same time sparing them non-trivial administrative
eort.
24.2 Global State
Global variables dont play well with libraries. A library may be loaded several times in the process
but even then theres only a single set of global variables. Figure 24.1 shows a MQ library being
used from two dierent and independent libraries. The application then uses both of those libraries.
When such a situation occurs, both instances of MQ access the same variables, resulting in
race conditions, strange failures and undened behaviour.
To prevent this problem, the MQ library has no global variables. Instead, a user of the library
is responsible for creating the global state explicitly. The object containing the global state is called
context. While from the users perspective context looks more or less like a pool of worker threads,
from MQs perspective its just an object to store any global state that we happen to need. In the
picture above, libA would have its own context and libB would have its own as well. There would
be no way for one of them to break or subvert the other one.
The lesson here is pretty obvious: Dont use global state in libraries. If you do, the library is
likely to break when it happens to be instantiated twice in the same process.
24.3 Performance
When MQ was started, its primary goal was to optimize performance. Performance of messaging
systems is expressed using two metrics: throughputhow many messages can be passed during a
given amount of time; and latencyhow long it takes for a message to get from one endpoint to the
other.
Which metric should we focus on? Whats the relationship between the two? Isnt it obvious?
Run the test, divide the overall time of the test by number of messages passed and what you get is
latency. Divide the number of messages by time and what you get is throughput. In other words,
latency is the inverse value of throughput. Trivial, right?
Instead of starting coding straight away we spent some weeks investigating the performance
metrics in detail and we found out that the relationship between throughput and latency is much
more subtle than that, and often the metrics are quite counter-intuitive.
Imagine A sending messages to B. (See Figure 24.2.) The overall time of the test is 6 seconds.
There are 5 messages passed. Therefore the throughput is 0.83 msgs/sec (
5
6
) and the latency is 1.2
sec (
6
5
), right?
Have a look at the diagram again. It takes a dierent time for each message to get from A to
B: 2 sec, 2.5 sec, 3 sec, 3.5 sec, 4 sec. The average is 3 seconds, which is pretty far away from our
original calculation of 1.2 second. This example shows the misconceptions people are intuitively
inclined to make about performance metrics.
Now have a look at the throughput. The overall time of the test is 6 seconds. However, at A it
takes just 2 seconds to send all the messages. From As perspective the throughput is 2.5 msgs/sec
(
5
2
). At B it takes 4 seconds to receive all messages. So from Bs perspective the throughput is 1.25
msgs/sec (
5
4
). Neither of these numbers matches our original calculation of 1.2 msgs/sec.
To make a long story short, latency and throughput are two dierent metrics; that much is obvious.
The important thing is to understand the dierence between the two and their mutual relationship.
Martin Sstrik 361
Figure 24.2: Sending messages from A to B
Latency can be measured only between two dierent points in the system; Theres no such thing
as latency at point A. Each message has its own latency. You can average the latencies of multiple
messages; however, theres no such thing as latency of a stream of messages.
Throughput, on the other hand, can be measured only at a single point of the system. Theres a
throughput at the sender, theres a throughput at the receiver, theres a throughput at any intermediate
point between the two, but theres no such thing as overall throughput of the whole system. And
throughput make sense only for a set of messages; theres no such thing as throughput of a single
message.
As for the relationship between throughput and latency, it turns out there really is a relationship;
however, the formula involves integrals and we wont discuss it here. For more information, read the
literature on queueing theory.
There are many more pitfalls in benchmarking the messaging systems that we wont go further
into. The stress should rather be placed on the lesson learned: Make sure you understand the problem
you are solving. Even a problemas simple as make it fast can take lot of work to understand properly.
Whats more, if you dont understand the problem, you are likely to build implicit assumptions and
popular myths into your code, making the solution either awed or at least much more complex or
much less useful than it could possibly be.
24.4 Critical Path
We discovered during the optimization process that three factors have a crucial impact on performance:
Number of memory allocations
Number of system calls
Concurrency model
However, not every memory allocation or every system call has the same eect on performance.
The performance we are interested in in messaging systems is the number of messages we can transfer
between two endpoints during a given amount of time. Alternatively, we may be interested in how
long it takes for a message to get from one endpoint to another.
362 ZeroMQ
However, given that MQ is designed for scenarios with long-lived connections, the time it
takes to establish a connection or the time needed to handle a connection error is basically irrelevant.
These events happen very rarely and so their impact on overall performance is negligible.
The part of a codebase that gets used very frequently, over and over again, is called the critical
path; optimization should focus on the critical path.
Lets have a look at an example: MQ is not extremely optimized with respect to memory alloca-
tions. For example, when manipulating strings, it often allocates a new string for each intermediate
phase of the transformation. However, if we look strictly at the critical paththe actual message
passingwell nd out that it uses almost no memory allocations. If messages are small, its just one
memory allocation per 256 messages (these messages are held in a single large allocated memory
chunk). If, in addition, the stream of messages is steady, without huge trac peaks, the number of
memory allocations on the critical path drops to zero (the allocated memory chunks are not returned
to the system, but re-used over and over again).
Lesson learned: optimize where it makes dierence. Optimizing pieces of code that are not on
the critical path is wasted eort.
24.5 Allocating Memory
Assuming that all the infrastructure was initialised and a connection between two endpoints has been
established, theres only one thing to allocate when sending a message: the message itself. Thus, to
optimize the critical path we had to look into how messages are allocated and passed up and down
the stack.
Its common knowledge in the high-performance networking eld that the best performance is
achieved by carefully balancing the cost of message allocation and the cost of message copying
4
. For
small messages, copying is much cheaper than allocating memory. It makes sense to allocate no new
memory chunks at all and instead to copy the message to preallocated memory whenever needed.
For large messages, on the other hand, copying is much more expensive than memory allocation.
It makes sense to allocate the message once and pass a pointer to the allocated block, instead of
copying the data. This approach is called zero-copy.
Figure 24.3: Message copying (or not)
MQ handles both cases in a transparent manner. A MQ message is represented by an opaque
handle. The content of very small messages is encoded directly in the handle. So making a copy of
4
For example, http://hal.inria.fr/docs//29/28/31/PDF/Open-MX-IOAT.pdf. See dierent handling of small,
medium and large messages.
Martin Sstrik 363
the handle actually copies the message data. When the message is larger, its allocated in a separate
buer and the handle contains just a pointer to the buer. Making a copy of the handle doesnt result
in copying the message data, which makes sense when the message is megabytes long (Figure 24.3).
It should be noted that in the latter case the buer is reference-counted so that it can be referenced by
multiple handles without the need to copy the data.
Lesson learned: When thinking about performance, dont assume theres a single best solution. It
may happen that there are several subclasses of the problem (e.g., small messages vs. large messages),
each having its own optimal algorithm.
24.6 Batching
It has already been mentioned that the sheer number of system calls in a messaging system can
result in a performance bottleneck. Actually, the problem is much more generic than that. Theres a
non-trivial performance penalty associated with traversing the call stack and thus, when creating
high-performance applications, its wise to avoid as much stack traversing as possible.
Figure 24.4: Sending four messages
Consider Figure 24.4. To send four messages, you have to traverse the entire network stack
four times (i.e., MQ, glibc, user/kernel space boundary, TCP implementation, IP implementation,
Ethernet layer, the NIC itself and back up the stack again).
However, if you decide to join those messages into a single batch, there would be only one
traversal of the stack (Figure 24.5). The impact on message throughput can be overwhelming: up to
two orders of magnitude, especially if the messages are small and hundreds of them can be packed
into a single batch.
Figure 24.5: Batching messages
On the other hand, batching can have negative impact on latency. Lets take, for example, the
well-known Nagles algorithm, as implemented in TCP. It delays the outbound messages for a certain
364 ZeroMQ
amount of time and merges all the accumulated data into a single packet. Obviously, the end-to-end
latency of the rst message in the packet is much worse than the latency of the last one. Thus, its
common for applications that need consistently low latency to switch Nagles algorithm o. Its even
common to switch o batching on all levels of the stack (e.g., NICs interrupt coalescing feature).
But again, no batching means extensive traversing of the stack and results in low message
throughput. We seem to be caught in a throughput versus latency dilemma.
MQtries to deliver consistently lowlatencies combined with high throughput using the following
strategy: when message ow is sparse and doesnt exceed the network stacks bandwidth, MQ turns
all the batching o to improve latency. The trade-o here is somewhat higher CPU usagewe still
have to traverse the stack frequently. However, that isnt considered to be a problem in most cases.
When the message rate exceeds the bandwidth of the network stack, the messages have to be
queuedstored in memory till the stack is ready to accept them. Queueing means the latency is
going to grow. If the message spends one second in the queue, end-to-end latency will be at least
one second. Whats even worse, as the size of the queue grows, latencies will increase gradually. If
the size of the queue is not bound, the latency can exceed any limit.
It has been observed that even though the network stack is tuned for lowest possible latency
(Nagles algorithm switched o, NIC interrupt coalescing turned o, etc.) latencies can still be
dismal because of the queueing eect, as described above.
In such situations it makes sense to start batching aggressively. Theres nothing to lose as the
latencies are already high anyway. On the other hand, aggressive batching improves throughput and
can empty the queue of pending messageswhich in turn means the latency will gradually drop as
the queueing delay decreases. Once there are no outstanding messages in the queue, the batching
can be turned o to improve the latency even further.
One additional observation is that the batching should only be done on the topmost level. If the
messages are batched there, the lower layers have nothing to batch anyway, and so all the batching
algorithms underneath do nothing except introduce additional latency.
Lesson learned: To get optimal throughput combined with optimal response time in an asyn-
chronous system, turn o all the batching algorithms on the low layers of the stack and batch on the
topmost level. Batch only when new data are arriving faster than they can be processed.
24.7 Architecture Overview
Up to this point we have focused on generic principles that make MQ fast. From now on well have
a look at the actual architecture of the system (Figure 24.6).
The user interacts with MQ using so-called sockets. They are pretty similar to TCP sockets,
the main dierence being that each socket can handle communication with multiple peers, a bit like
unbound UDP sockets do.
The socket object lives in the users thread (see the discussion of threading models in the next
section). Aside from that, MQ is running multiple worker threads that handle the asynchronous part
of the communication: reading data from the network, enqueueing messages, accepting incoming
connections, etc.
There are various objects living in the worker threads. Each of these objects is owned by exactly
one parent object (ownership is denoted by a simple full line in the diagram). The parent can live
in a dierent thread than the child. Most objects are owned directly by sockets; however, there are
couple of cases where an object is owned by an object which is owned by the socket. What we get is
a tree of objects, with one such tree per socket. The tree is used during shut down; no object can
Martin Sstrik 365
Figure 24.6: MQ architecture
shut itself down until it closes all its children. This way we can ensure that the shut down process
works as expected; for example, that pending outbound messages are pushed to the network prior to
terminating the sending process.
Roughly speaking, there are two kinds of asynchronous objects; there are objects that are not
involved in message passing and there are objects that are. The former have to do mainly with
connection management. For example, a TCP listener object listens for incoming TCP connections
and creates an engine/session object for each new connection. Similarly, a TCP connector object
tries to connect to the TCP peer and when it succeeds it creates an engine/session object to manage
the connection. When such connection fails, the connector object tries to re-establish it.
The latter are objects that are handling data transfer itself. These objects are composed of two
parts: the session object is responsible for interacting with the MQ socket, and the engine object
is responsible for communication with the network. Theres only one kind of the session object,
but theres a dierent engine type for each underlying protocol MQ supports. Thus, we have TCP
engines, IPC (inter-process communication) engines, PGM
5
engines, etc. The set of engines is
extensiblein the future we may choose to implement, say, a WebSocket engine or an SCTP engine.
The sessions are exchanging messages with the sockets. There are two directions to pass messages
in and each direction is handled by a pipe object. Each pipe is basically a lock-free queue optimized
for fast passing of messages between threads.
5
Reliable multicast protocol, see RFC 3208.
366 ZeroMQ
Finally, theres a context object (discussed in the previous sections but not shown on the diagram)
that holds the global state and is accessible by all the sockets and all the asynchronous objects.
24.8 Concurrency Model
One of the requirements for MQ was to take advantage of multi-core boxes; in other words, to scale
the throughput linearly with the number of available CPU cores.
Our previous experience with messaging systems showed that using multiple threads in a classic
way (critical sections, semaphores, etc.) doesnt yield much performance improvement. In fact,
a multi-threaded version of a messaging system can be slower than a single-threaded one, even if
measured on a multi-core box. Individual threads are simply spending too much time waiting for
each other while, at the same time, eliciting a lot of context switching that slows the system down.
Given these problems, weve decided to go for a dierent model. The goal was to avoid locking
entirely and let each thread run at full speed. The communication between threads was to be provided
via asynchronous messages (events) passed between the threads. This, as insiders know, is the classic
actor model.
The idea was to launch one worker thread per CPU corehaving two threads sharing the same
core would only mean a lot of context switching for no particular advantage. Each internal MQ
object, such as say, a TCP engine, would be tightly bound to a particular worker thread. That, in turn,
means that theres no need for critical sections, mutexes, semaphores and the like. Additionally, these
MQ objects wont be migrated between CPU cores so would thus avoid the negative performance
impact of cache pollution (Figure 24.7).
Figure 24.7: Multiple worker threads
This design makes a lot of traditional multi-threading problems disappear. Nevertheless, theres
a need to share the worker thread among many objects, which in turn means there has to be some
kind of cooperative multitasking. This means we need a scheduler; objects need to be event-driven
rather than being in control of the entire event loop; we have to take care of arbitrary sequences of
events, even very rare ones; we have to make sure that no object holds the CPU for too long; etc.
In short, the whole system has to become fully asynchronous. No object can aord to do a
blocking operation, because it would not only block itself but also all the other objects sharing the
same worker thread. All objects have to become, whether explicitly or implicitly, state machines.
Martin Sstrik 367
With hundreds or thousands of state machines running in parallel you have to take care of all the
possible interactions between them andmost importantlyof the shutdown process.
It turns out that shutting down a fully asynchronous system in a clean way is a dauntingly complex
task. Trying to shut down a thousand moving parts, some of them working, some idle, some in the
process of being initiated, some of them already shutting down by themselves, is prone to all kinds of
race conditions, resource leaks and similar. The shutdown subsystem is denitely the most complex
part of MQ. A quick check of the bug tracker indicates that some 3050% of reported bugs are
related to shutdown in one way or another.
Lesson learned: When striving for extreme performance and scalability, consider the actor model;
its almost the only game in town in such cases. However, if you are not using a specialised system
like Erlang or MQ itself, youll have to write and debug a lot of infrastructure by hand. Additionally,
think, from the very beginning, about the procedure to shut down the system. Its going to be the
most complex part of the codebase and if you have no clear idea how to implement it, you should
probably reconsider using the actor model in the rst place.
24.9 Lock-Free Algorithms
Lock-free algorithms have been in vogue lately. They are simple mechanisms for inter-thread
communication that dont rely on the kernel-provided synchronisation primitives, such as mutexes
or semaphores; rather, they do the synchronisation using atomic CPU operations, such as atomic
compare-and-swap (CAS). It should be understood that they are not literally lock-freeinstead,
locking is done behind the scenes on the hardware level.
MQ uses a lock-free queue in pipe objects to pass messages between the users threads and
MQs worker threads. There are two interesting aspects to how MQ uses the lock-free queue.
First, each queue has exactly one writer thread and exactly one reader thread. If theres a need
for 1-to-N communication, multiple queues are created (Figure 24.8). Given that this way the queue
doesnt have to take care of synchronising the writers (theres only one writer) or readers (theres
only one reader) it can be implemented in an extra-ecient way.
Figure 24.8: Queues
Second, we realised that while lock-free algorithms were more ecient than classic mutex-based
algorithms, atomic CPU operations are still rather expensive (especially when theres contention
between CPU cores) and doing an atomic operation for each message written and/or each message
read was slower than we were willing to accept.
The way to speed it uponce againwas batching. Imagine you had 10 messages to be written
to the queue. It can happen, for example, when you received a network packet containing 10 small
messages. Receiving a packet is an atomic event; you cannot get half of it. This atomic event results
in the need to write 10 messages to the lock-free queue. Theres not much point in doing an atomic
368 ZeroMQ
operation for each message. Instead, you can accumulate the messages in a pre-write portion of the
queue thats accessed solely by the writer thread, and then ush it using a single atomic operation.
The same applies to reading from the queue. Imagine the 10 messages above were already
ushed to the queue. The reader thread can extract each message from the queue using an atomic
operation. However, its overkill; instead, it can move all the pending messages to a pre-read
portion of the queue using a single atomic operation. Afterwards, it can retrieve the messages from
the pre-read buer one by one. Pre-read is owned and accessed solely by the reader thread and
thus no synchronisation whatsoever is needed in that phase.
Figure 24.9: Lock-free queue
The arrow on the left of Figure 24.9 shows how the pre-write buer can be ushed to the queue
simply by modifying a single pointer. The arrow on the right shows how the whole content of the
queue can be shifted to the pre-read by doing nothing but modifying another pointer.
Lesson learned: Lock-free algorithms are hard to invent, troublesome to implement and almost
impossible to debug. If at all possible, use an existing proven algorithm rather than inventing your
own. When extreme performance is required, dont rely solely on lock-free algorithms. While they
are fast, the performance can be signicantly improved by doing smart batching on top of them.
24.10 API
The user interface is the most important part of any product. Its the only part of your program visible
to the outside world and if you get it wrong the world will hate you. In end-user products its either
the GUI or the command line interface. In libraries its the API.
In early versions of MQ the API was based on AMQPs model of exchanges and queues
6
. I
spent the end of 2009 rewriting it almost from scratch to use the BSD Socket API instead. That was
the turning point; MQ adoption soared from that point on. While before it was a niche product
used by a bunch of messaging experts, afterwards it became a handy commonplace tool for anybody.
In a year or so the size of the community increased tenfold, some 20 bindings to dierent languages
were implemented, etc.
The user interface denes the perception of a product. With basically no change to the functionality
just by changing the APIMQ changed from an enterprise messaging product to a networking
product. In other words, the perception changed from a complex piece of infrastructure for big
banks to hey, this helps me to send my 10-byte-long message from application A to application B.
6
See the AMQP specication at http://www.amqp.org/specification/1./amqp-org-download. From a historical
perspective its interesting to have a look at the white paper from 2007 that tries to reconcile AMQP with a brokerless model
of messaging. The white paper is at http://www.zeromq.org/whitepapers:messaging-enabled-network.
Martin Sstrik 369
Lesson learned: Understand what you want your project to be and design the user interface
accordingly. Having a user interface that doesnt align with the vision of the project is a 100%
guaranteed way to fail.
One of the important aspects of the move to the BSDSockets API was that it wasnt a revolutionary
freshly invented API, but an existing and well-known one. Actually, the BSD Sockets API is one of
the oldest APIs still in active use today; it dates back to 1983 and 4.2BSD Unix. Its been widely
used and stable for literally decades.
The above fact brings a lot of advantages. Firstly, its an API that everybody knows, so the
learning curve is ludicrously at. Even if youve never heard of MQ, you can build your rst
application in couple of minutes thanks to the fact that you are able to reuse your BSD Sockets
knowledge.
Secondly, using a widely implemented API enables integration of MQwith existing technologies.
For example, exposing MQ objects as sockets or le descriptors allows for processing TCP,
UDP, pipe, le and MQ events in the same event loop. Another example: the experimental project
to bring MQ-like functionality to the Linux kernel
7
turned out to be pretty simple to implement.
By sharing the same conceptual framework it can re-use a lot of infrastructure already in place.
Thirdly and probably most importantly, the fact that the BSD Sockets API survived almost three
decades despite numerous attempts to replace it means that there is something inherently right in
the design. BSD Sockets API designers havewhether deliberately or by chancemade the right
design decisions. By adopting the API we can automatically share those design decisions without
even knowing what they were and what problem they were solving.
Lesson learned: While code reuse has been promoted from time immemorial and pattern reuse
joined in later on, its important to think of reuse in an even more generic way. When designing a
product, have a look at similar products. Check which have failed and which have succeeded; learn
from the successful projects. Dont succumb to Not Invented Here syndrome. Reuse the ideas, the
APIs, the conceptual frameworks, whatever you nd appropriate. By doing so you are allowing users
to reuse their existing knowledge. At the same time you may be avoiding technical pitfalls you are
not even aware of at the moment.
24.11 Messaging Patterns
In any messaging system, the most important design problem is that of how to provide a way for the
user to specify which messages are routed to which destinations. There are two main approaches,
and I believe this dichotomy is quite generic and applicable to basically any problem encountered in
the domain of software.
One approach is to adopt the Unix philosophy of do one thing and do it well. What this means
is that the problem domain should be articially restricted to a small and well-understood area. The
program should then solve this restricted problem in a correct and exhaustive way. An example of
such approach in the messaging area is MQTT
8
. Its a protocol for distributing messages to a set of
consumers. It cant be used for anything else (say for RPC) but it is easy to use and does message
distribution well.
The other approach is to focus on generality and provide a powerful and highly congurable
system. AMQP is an example of such a system. Its model of queues and exchanges provides the
7
https://github.com/25bpm/linux-2.6
8
http://mqtt.org/
370 ZeroMQ
user with the means to programmatically dene almost any routing algorithm they can think of. The
trade-o, of course, is a lot of options to take care of.
MQ opts for the former model because it allows the resulting product to be used by basically
anyone, while the generic model requires messaging experts to use it. To demonstrate the point, lets
have a look how the model aects the complexity of the API. What follows is implementation of
RPC client on top of a generic system (AMQP):
connect ("192.168..111")
exchange.declare (exchange="requests", type="direct", passive=false,
durable=true, no-wait=true, arguments={})
exchange.declare (exchange="replies", type="direct", passive=false,
durable=true, no-wait=true, arguments={})
reply-queue = queue.declare (queue="", passive=false, durable=false,
exclusive=true, auto-delete=true, no-wait=false, arguments={})
queue.bind (queue=reply-queue, exchange="replies",
routing-key=reply-queue)
queue.consume (queue=reply-queue, consumer-tag="", no-local=false,
no-ack=false, exclusive=true, no-wait=true, arguments={})
request = new-message ("Hello World!")
request.reply-to = reply-queue
request.correlation-id = generate-unique-id ()
basic.publish (exchange="requests", routing-key="my-service",
mandatory=true, immediate=false)
reply = get-message ()
On the other hand, MQ splits the messaging landscape into so-called messaging patterns.
Examples of the patterns are publish/subscribe, request/reply or parallelised pipeline. Each
messaging pattern is completely orthogonal to other patterns and can be thought of as a separate tool.
What follows is the re-implementation of the above application using MQs request/reply pattern.
Note how all the option tweaking is reduced to the single step of choosing the right messaging pattern
(REQ):
s = socket (REQ)
s.connect ("tcp://192.168..111:5555")
s.send ("Hello World!")
reply = s.recv ()
Up to this point weve argued that specic solutions are better than generic solutions. We want
our solution to be as specic as possible. However, at the same time we want to provide our customers
with as wide a range of functionality as possible. How can we solve this apparent contradiction?
The answer consists of two steps:
1. Dene a layer of the stack to deal with a particular problem area (e.g. transport, routing,
presentation, etc.).
2. Provide multiple implementations of the layer. There should be a separate non-intersecting
implementation for each use case.
Lets have a look at the example of the transport layer in the Internet stack. Its meant to provide
services such as transferring data streams, applying ow control, providing reliability, etc., on the
top of the network layer (IP). It does so by dening multiple non-intersecting solutions: TCP for
connection-oriented reliable stream transfer, UDP for connectionless unreliable packet transfer, SCTP
for transfer of multiple streams, DCCP for unreliable connections and so on.
Martin Sstrik 371
Note that each implementation is completely orthogonal: a UDP endpoint cannot speak to
a TCP endpoint. Neither can a SCTP endpoint speak to a DCCP endpoint. It means that new
implementations can be added to the stack at any moment without aecting the existing portions of
the stack. Conversely, failed implementations can be forgotten and discarded without compromising
the viability of the transport layer as a whole.
The same principle applies to messaging patterns as dened by MQ. Messaging patterns
form a layer (the so-called scalability layer) on top of the transport layer (TCP and friends).
Individual messaging patterns are implementations of this layer. They are strictly orthogonalthe
publish/subscribe endpoint cant speak to the request/reply endpoint, etc. Strict separation between
the patterns in turn means that new patterns can be added as needed and that failed experiments with
new patterns wont hurt the existing patterns.
Lesson learned: When solving a complex and multi-faceted problem it may turn out that a
monolithic general-purpose solution may not be the best way to go. Instead, we can think of the
problem area as an abstract layer and provide multiple implementations of this layer, each focused on
a specic well-dened use case. When doing so, delineate the use case carefully. Be sure about what
is in the scope and what is not. By restricting the use case too aggressively the application of your
software may be limited. If you dene the problem too broadly, however, the product may become
too complex, blurry and confusing for the users.
24.12 Conclusion
As our world becomes populated with lots of small computers connected via the Internetmobile
phones, RFID readers, tablets and laptops, GPS devices, etc.the problem of distributed computing
ceases to be the domain of academic science and becomes a common everyday problem for every
developer to tackle. The solutions, unfortunately, are mostly domain-specic hacks. This article
summarises our experience with building a large-scale distributed system in a systematic manner. It
focuses on problems that are interesting from a software architecture point of view, and we hope that
designers and programmers in the open source community will nd it useful.
372 ZeroMQ
Colophon
The image on the cover is of New Yorks Equitable Building, photographed and modied by James
Howe (http://www.jameshowephotography.com).
The cover font is Junction by Caroline Hadilaksono. The text font is T
E
XGyre Termes and the heading
font is T
E
XGyre Heros, both by Bogusaw Jackowski and Janusz M. Nowacki. The code font is
Inconsolata by Raph Levien.