The Vietnam of Computer Science
The Vietnam of Computer Science
The Vietnam of Computer Science
History
PBS has a good synopsis of the war, but for those who are more
interested in Computer Science than Political/Military History, the
short version goes like this:
$g(South Indochina), now known as Vietnam, Thailand, Laos and
Cambodia, has a long history of struggle for autonomy. Before
French colonial rule (which began in the mid-1800s), South
Indochina wrestled for regional independence from China. During
World War Two, the Japanese conquered the area, only to be later
"liberated" by the Allies, leading France to resume their colonial rule
(as did the British in their colonial territories elsewhere in Asia and
India). Following WWII, however, the people of South Indochina,
Johnson's War
At the time of the Kennedy assassination, Vietnam had 16,000
American advisers in place, most of whom weren't involved in daily
combat operations. Kennedy's Vice President and new replacement,
however, $g(Lyndon Baines Johnson), was not convinced that this
Nixon's Promise
Unfortunately, American negotiating position was seriously
weakened by the very protests that had brought the Americans to
War's End
The Second South Indochina War was over, America had
experienced its most profound defeat ever in its history, and
Vietnam became synonymous with "quagmire". Its impact on
American culture was immeasurable, as it taught an entire
generation of Americans to fear and mistrust their government, it
taught American leaders to fear any amount of US military
casualties, and brought the phrase "clear exit strategy" directly into
the American political lexicon. Not until $g(Ronald Reagan) used the
American military to "liberate" the small island nation of
$g(Grenada) would American military intervention be considered a
possible tool of diplomacy by American presidents, and even then
only with great sensitivity to domestic concern, as $g(Bill Clinton)
would find out during his peacekeeping missions to $g(Somalia) and
$g(Kosovo). In quantifiable terms, too, Vietnam's effects clearly fell
short of Johnson's goal of a war in "cold blood". Final tally: 3 million
Americans served in the war, 150,000 seriously wounded, 58,000
dead, and over 1,000 MIA, not to mention nearly a million NVA/Viet
Cong troop casualties, 250,000 South Vietnamese casualties, and
hundreds of thousands--if not millions, as some historians
advocated--of civilian casualties.
Lessons of Vietnam
Vietnam presents an interesting problem to the student of military
and political history--exactly what went wrong, when, and where?
Obviously, the US government's unwillingness to admit its failures
during the war makes for an easy scapegoat, but no government in
the history of modern society has ever been entirely truthful with its
population about its fortunes of war; one such example includes (but
is not limited to) the same US government's careful censorship of
activities during World War Two, fifty years earlier, known in
American history as "the last 'good' war". It's also tempting to point
to the lack of a military objective as the crucial failing point of
Vietnam, but other non-military objectives have been successfully
executed by the US and other governments without the kind of
colossal failure accompanying Vietnam's story. Moreover, it's
important to note that the US did, in fact, have a clear objective in
what it wanted out of the conflict in South Indochina: to stop the fall
of the South Vietnam government, and, barring that, the cessation
of the "spread" of Communism. Was it the reluctance of the US
government to unleash the military to its fullest capabilities, as
$g(General William Westmoreland) always claimed? Certainly the
failure in Vietnam was not a military one; the casualty figures make
it clear that the US, by any other measure, was clearly winning.
So what were the principal failures in Vietnam? And, more
importantly, what does all this have to do with O/R Mapping?
Recognizing that all analogies fail eventually, and that the subject of
Vietnam is deeper than this essay can examine, there are still
lessons to be learned here in an entirely different arena. One of the
key lessons of Vietnam was the danger of what's colloquially called
"the Slippery Slope": that a given course of action might yield some
early success, yet further investment into that action yields
decreasingly commensurate results and increasibly dangerous
obstacles whose only solution appears to be greater and greater
commitment of resources and/or action. Some have called this "the
Drug Trap", after the way pharmaceuticals (legal or illegal) can have
diminished effect after prolonged use, requiring upped dosage in
order to yield the same results. Others call this "the Last Mile
Problem": that as one nears the end of a problem, it becomes
increasingly difficult in cost terms (both monetary and abstract) to
find a 100% complete solution. All are basically speaking of the
same thing--the difficulty of finding an answer that allows our hero
to "finish off" the problem in question, completely and satisfactorily.
We begin the analysis of Object/Relational Mapping--and its
relationship to the Second South Indochina War--by examining the
reasons for it in the first place. What drives developers away from
using traditional relational tools to access a relational database, and
to prefer instead tools such as O/R-M's?
These are commonly referred to as tables (relation variable), rows (tuples), columns
(attributes), and a collection of relation variables as a database. These basic element
types can be combined against one another using a set of operators (described in some
detail in Chapter 7 of [Date04]): restrict, project, product, join, divide, union,
intersection and difference, and these form the basis of the format and approach to
SQL, the universally-acceptance language for interacting with a relational system
from operator consoles or programming languages. The use of these operators allow
for the creation of derived relation values, relations that are calculated from other
relation values in the database--for example, we can create a relation value that
demonstrates the number of people living in individual cities by making use of the
project and restrict operators across the People relation variable defined above.
Already, it's fairly clear to see that there are distinct differences
between how the relational world and object world view the "proper"
design of a system, and more will become apparent as time
progresses. It's important to note, however, that so long as
programmers prefer to use object-oriented programming languages
to access relational data stores, there will always be some kind of
object-relational mapping taking place--the two models are simply
too different to bridge silently. (Arguably, the same is true of object-
deployed, with the exact schema the O/R-M-based solution was built
against, creating yet another silo of data in an IT environment where
pressure is building to reduce such silos.
The problem with the QBE approach is obvious: while it's perfectly sufficient for
simple queries, it's not nearly expressive enough to support the more complex style of
query that frequently we need to execute--"find all Persons named Smith or
Cromwell" and "find all Persons NOT named Smith" are two examples. While it's not
impossible to build QBE approaches that handle this (and more complex scenarios), it
definitely complicates the API significantly. More importantly, it also forces the
domain objects into an uncomfortable position--they must support nullable
fields/properties, which may be a violation of the domain rules the object would
otherwise seek to support--a Person without a name isn't a very useful object, in many
scenarios, yet this is exactly what a QBE approach will demand of domain objects
stored within it. (Practitioners of QBE will often argue that it's not unreasonable for
an object's implementation to take this into account, but again this is neither easy nor
frequently done.)
As a result, usually the second step is to have the object system
support a "Query-By-API" approach, in which queries are constructed
by query objects, usually something of the form:
Query q = new Query();
q.From("PERSON").Where(
new EqualsCriteria("PERSON.LAST_NAME", "Smith"));
ObjectCollection oc = QueryExecutor.execute(q);
Here, the query is not based on an empty "template" of the object to be retrieved, but
off of a set of "query objects" that are used together to define a Command-style object
for executing against the database. Multiple criteria are connected using some kind of
binomial construct, usually "And" and "Or" objects, each of which contain unique
Criteria objects to test against. Additional filtration/manipulation objects can be
tagged onto the end, usually by appending calls such as "OrderBy(field-name)" or
"GroupBy(field-name)". In some cases, these method calls are actually objects
constructed by the programmer and strung together explicitly.
Developers quickly note that the above approach is (generally)
much more verbose than the traditional SQL approach, and certain
styles of queries (particularly the more unconventional joins, such as
outer joins) are much more difficult--if not impossible--to represent
in the QBA approach.
On top of this, we have a more subtle problem, that of the reliance
on developers' dicipline: both the table name ("PERSON") and the
column name in the criteria ("PERSON.LAST_NAME") are standard
strings, taken as-is and fed to the system at runtime with no sort of
validity-checking until then. This presents a classic problem in
programming, that of the "fat-finger" error, where a developer
doesn't actually query the "PERSON" table, but the "PRESON" table
instead. While a quick unit-test against a live database instance will
reveal the error during unit-testing, this presumes two facts--that
the developers are religious about adopting unit-testing, and that
the unit-tests are run against database instances. While the former
is slowly becoming more of a guarantee as more and more
developers become "test-infected" (borrowing Gamma's and Beck's
choice of terminology), the latter is still entirely open to discussion
and interpretation, owing to the fact that setting-up and tearingdown the database instance appropriately for unit tests is still
Which solves part of the schema-awareness problem and the "fat-fingering" problem
but still leaves the developer vulnerable to the concerns over verbosity and still
doesn't address the complexity of putting together a more complex query, such as a
multi-table (or multi-class, if you will) query joined on several criteria in a variety of
ways.
So, then, the next task is to create a "Query-By-Language"
approach, in which a new language, similar to SQL but "better"
somehow, is written to support the kind of complex and powerful
queries normally supported by SQL; OQL and HQL are two examples
of this. The problem here is that frequently these languages are a
subset of SQL and thus don't offer the full power of SQL. More
importantly, the O/R layer has now lost an important "selling point",
that of the "objects and only objects" mantra that begat it in the
first place; using a SQL-like language is almost just like using SQL
itself, so how can it be more "objectish"? While developers may not
need to be aware of the physical schema of the data model (the
query language interpreter/executor can do the mapping discussed
earlier), developers will need to be aware of how object associations
and properties are represented within the language, and the subset
of the object's capabilities within the query language--for example,
is it possible to write something like this?
SELECT Person p1, Person p2
FROM Person
WHERE p1.getSpouse() == null
AND p2.getSpouse() == null
AND p1.isThisAnAcceptableSpouse(p2)
AND p2.isThisAnAcceptableSpouse(p1);
In other words, scan through the database and find all single people who find each
other acceptable. While the "isThisAnAcceptableSpouse" method is clearly a method
that belongs on the Person class (each Person instance may have its own criteria by
which to judge the acceptability of another single--are they blonde, brunette, or
redhead, are they making more than $100,000 a year, and so on), it's not clear if
executing this method is possible in the query language, nor is it clear if it should be.
Even for the most trivial implementations, a serious performance hit will be likely,
particularly if the O/R layer must turn the relational column data into objects in order
to execute the query. In addition, we have no guarantees that the developer wrote this
method to be at all efficient, and no ways to enforce any sort of performance-aware
implementation.
(Critics will argue that this is a workable problem, proposing two
possible solutions. One is to encode the preference data in a
separate table and make that part of the query; this will result in a
hideously complicated query that will take several pages in length
and likely require a SQL expert to untangle later when new
preferential criteria want to be added. The other is to encode this
"acceptability" implementation in a stored procedure within the
database, which now removes code entirely from the object model
and leaves us without an "object"-based solution whatsoever-acceptable, but only if you accept the premise that not all
implementation can rest inside the object model itself, which rejects
the "objects and nothing but objects" premise with which many O/R
advocates open their arguments.)
In particular, take notice that only the data desired at each stage of the process is
retrieved--in the first query, the necessary summary information and identifier (for the
subsequent query, in case first and last name wouldn't be sufficient to identify the
person directly), and in the second, the remainder of the data to display. In fact, most
SQL experts will eschew the "*" wildcard column syntax, preferring instead to name
each column in the query, both for performance and maintenance reasons-performance, since the database will better optimize the query, and maintenance,
because there will be less chance of unnecessary columns being returned as DBAs or
developers evolve and/or refactor the database table(s) involved. This notion of being
able to return a part of a table (though still in relational form, which is important for
reasons of closure, described above) is fundamental to the ability to optimize these
queries this way--most queries will, in fact, only require a portion of the complete
relation.
This presents a problem for most, if not all, object/relational
mapping layers: the goal of any O/R is to enable the developer to
see "nothing but objects", and yet the O/R layer cannot tell, from
one request to another, how the objects returned by the query will
be used. For example, it is entirely feasible that most developers will
want to write something along the lines of:
Person[] all = QueryManager.execute(...);
Person selected = DisplayPersonsForSelection(all);
DisplayPersonData(selected);
Meaning, in other words, that once the Person to be displayed has been chosen from
the array of Persons, no further retrieval action is necessary--after all, you have your
object, what more should be necessary?
The problem here is that the data to be displayed in the first
Display...() call is not the complete Person, but a subset of that data;
here we face our first problem, in that an object-oriented system like
C# or Java cannot return just "parts" of an object--an object is an
object, and if the Person object consists of 12 fields, then all 12
fields will be present in every Person returned. This means that the
system faces one of three uncomfortable choices: one, require that
Person objects must be able to accomodate "nullable" fields,
regardless of the domain restrictions against that; two, return the
Person completely filled out with all the data comprising a Person
object; or three, provide some kind of on-demand load that will
obtain those fields if and when the developer accesses those fields,
even indirectly, perhaps through a method call.
(Note that some object-based languages, such as ECMAScript, view
objects differently than class-based languages, such as Java or C#
or C++, and as a result, it is entirely possible to return objects which
contain varying numbers of fields. That said, however, few
languages possess such an approach, not even everybody's favorite
dynamic-language poster child, Ruby, and until such languages
become widespread, such discussion remains outside the realm of
this essay.)
For most O/R layers, this means that objects and/or fields of objects
must be retrieved in a lazy-loaded manner, obtaining the field data
on demand, because retrieving all of the fields of all of the Person
objects/relations would "clearly" be a huge waste of bandwidth for
this particular scenario. Typically, the object's entire set of fields will
be retrieved when any field not-yet-returned is accessed. (This
approach is preferred to a field-by-field approach because there's
less chance of the "N+1 query problem", in which retrieving all the
Summary
Given, then, that objects-to-relational mapping is a necessity in a
modern enterprise system, how can anyone proclaim it a quagmire
from which there is no escape? Again, Vietnam serves as a useful
analogy here--while the situation in South Indochina required a
response from the Americans, there were a variety of responses
available to the Kennedy and Johson Administrations, including the
same kind of response that the recent fall of Suharto in Malaysia
generated from the US, which is to say, none at all. (Remember,
Eisenhower and Dulles didn't consider South Indochina to be a part
of the Domino Theory in the first place; they were far more
concerned about Japan and Europe.)
Several possible solutions present themselves to the O/R-M problem,
some requiring some kind of "global" action by the community as a
whole, some more approachable to development teams "in the
trenches":
1. Abandonment. Developers simply give up on objects entirely, and return to a
programming model that doesn't create the object/relational impedance
mismatch. While distasteful, in certain scenarios an object-oriented approach
creates more overhead than it saves, and the ROI simply isn't there to justify
the cost of creating a rich domain model. ([Fowler] talks about this to some
depth.) This eliminates the problem quite neatly, because if there are no
objects, there is no impedance mismatch.
2. Wholehearted acceptance. Developers simply give up on relational storage
entirely, and use a storage model that fits the way their languages of choice
look at the world. Object-storage systems, such as the db4o project, solve the
problem neatly by storing objects directly to disk, eliminating many (but not
all) of the aforementioned issues; there is no "second schema", for example,
because the only schema used is that of the object definitions themselves.
While many DBAs will faint dead away at the thought, in an increasingly
3.
4.
5.
6.
service-oriented world, which eschews the idea of direct data access but
instead requires all access go through the service gateway thus encapsulating
the storage mechanism away from prying eyes, it becomes entirely feasible to
imagine developers storing data in a form that's much easier for them to use,
rather than DBAs.
Manual mapping. Developers simply accept that it's not such a hard problem
to solve manually after all, and write straight relational-access code to return
relations to the language, access the tuples, and populate objects as necessary.
In many cases, this code might even be automatically generated by a tool
examining database metadata, eliminating some of the principal criticism of
this approach (that being, "It's too much code to write and maintain").
Acceptance of O/R-M limitations. Developers simply accept that there is no
way to efficiently and easily close the loop on the O/R mismatch, and use an
O/R-M to solve 80% (or 50% or 95%, or whatever percentage seems
appropriate) of the problem and make use of SQL and relational-based access
(such as "raw" JDBC or ADO.NET) to carry them past those areas where an
O/R-M would create problems. Doing so carries its own fair share of risks,
however, as developers using an O/R-M must be aware of any caching the
O/R-M solution does within it, because the "raw" relational access will clearly
not be able to take advantage of that caching layer.
Integration of relational concepts into the languages. Developers simply
accept that this is a problem that should be solved by the language, not by a
library or framework. For the last decade or more, the emphasis on solutions
to the O/R problem have focused on trying to bring objects closer to the
database, so that developers can focus exclusively on programming in a single
paradigm (that paradigm being, of course, objects). Over the last several years,
however, interest in "scripting" languages with far stronger set and list
support, like Ruby, has sparked the idea that perhaps another solution is
appropriate: bring relational concepts (which, at heart, are set-based) into
mainstream programming languages, making it easier to bridge the gap
between "sets" and "objects". Work in this space has thus far been limited,
constrained mostly to research projects and/or "fringe" languages, but several
interesting efforts are gaining visibility within the community, such as
functional/object hybrid languages like Scala or F#, as well as direct
integration into traditional O-O languages, such as the LINQ project from
Microsoft for C# and Visual Basic. One such effort that failed, unfortunately,
was the SQL/J strategy; even there, the approach was limited, not seeking to
incorporate sets into Java, but simply allow for embedded SQL calls to be
preprocessed and translated into JDBC code by a translator.
Integration of relational concepts into frameworks. Developers simply
accept that this problem is solvable, but only with a change of perspective.
Instead of relying on language or library designers to solve this problem,
developers take a different view of "objects" that is more relational in nature,
building domain frameworks that are more directly built around relational
constructs. For example, instead of creating a Person class that holds its
instance data directly in fields inside the object, developers create a Person
class that holds its instance data in a RowSet (Java) or DataSet (C#) instance,
which can be assembled with other RowSets/DataSets into an easy-to-ship
block of data for update against the database, or unpacked from the database
into the individual objects.
Note that this list is not presented in any particular order; while some are more
attractive to others, which are "better" is a value judgment that every developer and
development team must make for themselves.
Just as it's conceivable that the US could have achieved some
measure of "success" in Vietnam had it kept to a clear strategy and
understood a more clear relationship between commitment and
results (ROI, if you will), it's conceivable that the object/relational
problem can be "won" through careful and judicious application of a
strategy that is celarly aware of its own limitations. Developers must
be willing to take the "wins" where they can get them, and not fall
into the trap of the Slippery Slope by looking to create solutions that
increasingly cost more and yield less. Unfortunately, as the history
of the Vietnam War shows, even an awareness of the dangers of the
Slippery Slope is often not enough to avoid getting bogged down in
a quagmire. Worse, it is a quagmire that is simply too attractive to
pass up, a Siren song that continues to draw development teams
from all sizes of corporations (including those at Microsoft, IBM,
Oracle, and Sun, to name a few) against the rocks, with spectacular
results. Lash yourself to the mast if you wish to hear the song, but
let the sailors row.
Endnotes
1
References