7
7
7
Abstract
The complexity of natural history collection information and similar information within the scope
of biodiversity informatics poses significant challenges for effective long term stewardship of that
information in electronic form. This paper discusses the principles of good relational database
design, how to apply those principles in the practical implementation of databases, and
examines how good database design is essential for long term stewardship of biodiversity
information. Good design and implementation principles are illustrated with examples from the
realm of biodiversity information, including an examination of the costs and benefits of different
ways of storing hierarchical information in relational databases. This paper also discusses
typical problems present in legacy data, how they are characteristic of efforts to handle complex
information in simple databases, and methods for handling those data during data migration.
Introduction
The data associated with natural history
collection materials are inherently complex.
Management of these data in paper form
has produced a variety of documents such
as catalogs, specimen labels, accession
books, stations books, map files, field note
files, and card indices. The simple
appearance of the data found in any one of
these documents (such as the columns for
identification, collection locality, date
collected, and donor in a handwritten
catalog ledger book) mask the inherent
complexity of the information. The
appearance of simplicity overlying highly
complex information provides significant
challenges for the management of natural
history collection information (and other
systematic and biodiversity information) in
electronic form. These challenges include
management of legacy data produced
during the history of capture of natural
Information modeling
The heart of conceptual database design is
information modeling. Information modeling
has its basis in set algebra, and can be
approached in an extremely complex and
mathematical fashion. Underlying this
complexity, however, are two core concepts:
atomization and reduction of redundant
information. Atomization means placing
only one instance of a single concept in a
single field in the database. Reduction of
redundant information means organizing a
database so that a single text string
representing a single piece of information
(such as the place name Democratic
Republic of the Congo) occurs in only a
single row of the database. This one row is
then related to other information (such as
localities within the DRC) rather than each
row containing a redundant copy of the
country name.
As information modeling has a firm basis in
set theory and a rich technical literature, it is
usually introduced using technical terms.
This technical vocabulary include terms that
describe how well a database design
applies the core concepts of atomization
and reduction of redundant information (first
normal form, second normal form, third
normal form, etc.) I agree with Hernandez
(2003) that this vocabulary does not make
the best introduction to information
modeling2 and, for the beginner, masks the
important underlying concepts. I will thus
2
Atomization
1) Place only one concept in each
field.
Legacy data often contain a single field for
taxon name, sometimes with the author and
year also included in this field. Consider
the taxon name Palaeozygopleura
hamiltoniae (HALL, 1868). If this name is
placed as a string in a single field
Palaeozygopleura hamiltoniae (Hall,
1868), it becomes extremely difficult to pull
the components of the name apart to, say,
display the species name in italics and the
author in small caps in an html document:
<em>Palaeozygopleura hamiltoniae</em>
(H<font size=-2>ALL</font>, 1868), or to
associate them with the appropriate tags in
an XML document. It likewise is much
harder to match the search criteria
Genus=Loxonema and Trivial=hamiltoniae
to this string than if the components of the
name are separated into different fields. A
taxon name table containing fields for
Generic name, Subgeneric name, Trivial
Epithet, Authorship, Publication year, and
Parentheses is capable of handling most
identifications better than a single text field.
However, there are lots more complexities
subspecies, varieties, forms, cf., near,
questionable generic placements,
questionable identifications, hybrids, and so
forth, each of which may need its own field
to effectively handle the wide range of
different variations of taxon names that can
be used as identifications of collection
objects. If a primary purpose of the data set
is nomenclatural, then substantial thought
needs to be placed into this complexity. If
the primary purpose of the data set is to
record information associated with collection
objects, then recording the name used and
indicators of uncertainty of identification are
the most important concepts.
Reducing Redundant
Information
The most serious enemy of clean data in
long -lived database systems is redundant
copies of information. Consider a locality
table containing fields for country, primary
division (province/state), secondary division
(county/parish), and named place
(municipality/city). The table will contain
multiple rows with the same value for each
of these fields, since multiple localities can
occur in the vicinity of one named place.
The problem is that multiple different text
strings represent the same concept and
different strings may be entered in different
rows to record the same information. For
example, Philadelphia, Phil., City of
Philadelphia, Philladelphia, and Philly are all
variations on the name of a particular
named place. Each makes sense when
written on a specimen label in the context of
other information (such as country and
state), as when viewed as a single locality
Country
USA
USA
USA
United
States
Primary Division
Montana
Pennsylvania
New York
Massachusetts
Country id
Name
300
USA
301
Uganda
Primary fk_c_country_id Primary Division
Division
id
300
300
Montana
301
300
Pennsylvania
302
300
New York
303
300
Massachusetts
Montana
Pennsylvania
New York
Massachusetts
Entity-Relationship modeling
Understanding the concepts to be stored in
the database is at the heart of good
database design (Teorey, 1994; Elmasri
and Navathe, 1994). The conceptual design
phase of the database life cycle should
produce a result known as an information
model (Bruce, 1992). An information model
consists of written documentation of
concepts to be stored in the database, their
relationships to each other, and a diagram
showing those concepts and their
relationships (an Entity-Relationship or E-R
diagram, ). A number of information models
for the biodiversity informatics community
exist (e.g. Blum, 1996a; 1996b; Berendsohn
et al., 1999; Morris, 2000; Pyle 2004), most
are derived at least in part from the
concepts in ASC model (ASC, 1992).
Information models define entities, list
attributes for those entities, and relate
entities to each other. Entities and
attributes can be loosely thought of as
tables and fields. Figure 5 is a diagram of a
locality entity with attributes for a mysterious
localityid, and attributes for country and
primary division. As in the example above,
this entity can be implemented as a table
with localityid, country, and primary division
fields (Table 4).
Primary Division
Montana
Pennsylvania
Figure 6. Comparison between entity and attributes as depicted in a typical CASE tool E-R diagram in a
variant of the IDEF1X format (left) and in the Chen format (right, which is more useful for pencil and paper
modeling). The E-R diagrams found in this paper have variously been drawn with the CASE tools xCase
and Druid or the diagram editor DiA.
Table 5. A portion of a BioLink (CSIRO, 2001)
tblDistributionRegion table.
intDistributionRegionID
15
16
17
18
vchrRegionName
Australia
Queensland
Uganda
Pennsylvania
Primary key
Primary keys are the means by which we
locate a single row in a table. The value for
a primary key must be unique to each row.
The primary key in one row must have a
different value from the primary key of every
other row in the table. This property of
uniqueness is best enforced by the
10
Example: Identifications of
Collection Objects
Consider the issue of handling
identifications that have been applied to
collection objects. The simplest way of
handling this information is to place a single
identification field (or set of atomized
genus_&_higher, species, authorship, year,
and parentheses fields) into a collection
object table. This approach can handle only
a single identification per collection object,
unless each collection object is allowed
more than one entry in the collection object
table (producing duplicate catalog numbers
in the table for each collection object with
more than one identification). In many
sorts of collections, a collection object tends
12
13
Figure 10. Normalized handling of identifications and identifiers. Identifications is an associative entity
relating Collection objects, species names and people.
Figure 11. Normalized handling of identifications with denormalized handling of the people who perfommed
the identifications (allowing multiple entries in identification containing the name of a single identifier).
14
Identification
Lunatia pilla
Velutina nana
Previous identification
Natica clausa
Velutina velutina
Preparations
Shell, alcohol
Shell
Table7.Catalog number and identification fields from Table 6 atomized so that each field now contains
only one concept.
Id_sp
pilla
nana
Vocabulary
Information modeling has a widely used
technical terminology to describe the extent
to which data conform to the mathematical
ideals of normalization. One commonly
encountered part of this vocabulary is the
phrase normal form. The term first normal
form means, in essence, that a database
has only one concept placed in each field
and no repeating information within one row,
that is, no repeating fields and no repeating
values in a field. Fields containing the value
1863, 1865, 1885 (repeating values) or the
value Paleozygopleura hamiltoniae Hall
(more than one concept), or the fields
Current_identification and
Previous_identification (repeating fields) are
example violations of first normal form. In
second normal form, primary keys do not
contain redundant information, but other
fields may. That is two different rows of a
table may not contain the same values in
their primary key fields in second normal
form. For example, a collection object table
containing a field for catalog number serving
as primary key would not be able to contain
more than one row for a single catalog
number for the table to be in second normal
form. We do not expect a table of
collection objects to contain information
about the same collection object in two
P_id_gen
Natica
Velutina
P_id_sp
clausa
velutina
Preparations
Shell, alcohol
Shell
15
ID_order
Current
Previous
Current
Previous
16
17
Catalog_no
Coll_obj_ID
641455
100
Coll_obj_ID
100
100
coll_obj_ID
100
100
Taxon_ID
1
2
Preparations
Shell
Alcohol
C_taxon_ID
ID_order
1
Current
2
Previous
Id_genus
Lunatia
Natica
Id_sp
pilla
clausa
18
19
Physical design
An information model is a conceptual design
for a database. It describes the concepts to
be stored in the database. Implementation
of a database from an information model
20
ANSP
ANSP
ANSP
ANSP
catalog_
number
34000
34001
28342
100382
21
22
1902/--/-1986/--/-1998/05/--
SELECT DISTINCT
collections_object.catalog_number
FROM taxon
LEFT JOIN identification
ON taxonid = c_taxon id
LEFT JOIN collection object
ON c_collections_objectid =
collections_objectid
WHERE
taxon.taxon_name = Chicoreus ramosus;
Generic_higher
trivial
Murex
sp.
Murex
ramosus
Murex
bicornis
date_identified
23
24
Maintaining integrity
In a spreadsheet or a flat file database,
deleting a record is a simple matter of
removing a single row. In a relational
database, removing records and changing
the links between records in related tables
becomes much more complex. A relational
database needs to maintain database
integrity. An important part of maintaining
integrity is knowing what do you do with
related records when you delete a record on
one side of a join. Consider a scenario:
You are cataloging a collection object and
you enter data about it into a database
(identification, locality, catalog number, kind
of object, etc...). You then realize that you
entered the data for this object yesterday,
and you are creating a duplicate record that
you want to delete. How far does the delete
go? You no doubt want to get rid of the
duplicate record in the collection object table
and the identifications attached to this
record, but you don't want to keep following
the links out to the authority file for taxon
names and delete the names of any taxa
used in identifications. If you delete a
collections object you do not want to leave
orphan identifications floating around in the
database unlinked to any collections object.
These identifications (carrying a foreign key
for a collections object that doesn't exist)
can show up in subsequent queries and
have the potential to become linked to new
collections objects (silently adding incorrect
identifications to them as they are created).
Such orphan records, which retain links to
no longer existent records in other tables,
violate the relational integrity of the
database.
When you delete a record, you may or may
25
26
Interface design
Simultaneously with the conceptual and
physical design of the back end of a
database, you should be creating a design
for the user interface to access the data.
Existing user interface screens for a legacy
database, paper and pencil designs of new
screens, and mockups in database systems
with easy form design tools such as
Filemaker and MS Access are of use in
interface design. I feel that the most
important aspect of interface design for
databases is to fit the interface to the
workflow, abstracting the user interface
away from the underlying complex data
structures and fitting it to the tasks that
users perform with the data. A typical user
interface problem is to place the user
interface too close to the data by creating
one data entry screen for each table in the
database. In anything other than a very
simple database, having the interface too
close to the data ends up in a bewildering
profusion of pop up windows that leave
users entirely confused about where they
are in data entry and how the current open
window relates to the task at hand.
28
Practical Implementation
Be Pragmatic
Most natural history collections operate in
an environment of highly limited resources.
Properly planning, designing, and
implementing a database system following
all of the details of some of the information
models that have been produced for the
community (e.g. Morris 2000) is a task
beyond the resources of most collections. A
reasonable estimate for a 50 to 100 table
database system includes about 500-1000
stored procedures, more than 100,000 lines
of user interface code, one year of design,
two or more years of programming, a
development team including a database
programmer, database administrator, user
interface programmer, user interface
designer, quality control specialist, and a
technical writer, all running to some
$1,000,000 in costs. Clearly natural history
collections that are developing their own
database systems (rather than using
external vendors or adopting community
based tools such as BioLink [CSIRO, 2001]
or Specify) must make compromises.
These compromises should involve
selecting the most important elements of
their collections information, spending the
most design, data cleanup, and programing
effort on those pieces of information, and
then omitting the least critical information or
storing it in less than third normal form data
structures.
A possible candidate for storage in less than
ideal form is the generalized Agent concept
Approaches to management of
date information
Dates in collection data are generally
problematic as they are often known only to
a level of precision less than a single day.
Dates may be known to the day, or in some
cases to the time of day, but often they are
known only to the month, or to the year, or
to the decade. In some cases, dates are
known to be prior to a particular date (e.g.
the date donated may be known but the
date collected may not other than that it is
sometime prior to the date donated). In
other cases dates are known to be within a
range (e.g. between 1932-June-12 and
1932-July-154); in yet others they are known
to be within a set of ranges (e.g. collected in
the summers of 1852 to 1855). Designing
database structures to be able to effectively
store, retrieve, sort, and validate this range
of possible forms for dates is not simple
(Table 18).
Using a single field with a native date data
type to hold collections date information is
generally a poor idea as date data types
require each date to be specified to the
precision of one day (or finer). Simply
storing dates as arbitrary text strings is
flexible enough to encompass the wide
variety of formats that may be encountered,
but storing dates as arbitrary strings does
not ensure that the values added are valid
dates, in a consistent format, are sortable,
or even searchable.
Storage of dates effectively requires the
implementation of an indeterminate or
arbitrary precision date range data type
supported by code. An arbitrary precision
date data type can be implemented most
simply by using a text field and enforcing a
format on the data allowed into that field (by
binding a picture statement or format
expression to the control used for data entry
into that field or to the validation rules for
4
30
Issues
Native sorting, searching, and validation. Unable to store date
ranges, will introduce false precision into data.
Single date field
character
Can sort on start date, can handle single dates and date ranges
easily. Needs minimum of pattern or format applied to entry
data, requires code to test date validity.
Start date and end two character fields, Able to handle date ranges and arbitrary precision dates.
date fields
6 character fields, or
Straightforward to search and sort. Requires some code for
6 integer fields.
validation.
Start date and end two date fields
Native sorting and validation. Straightforward to search. Able to
date fields
handle date ranges and arbitrary precision. Requires carefully
designed user interface with supporting code for efficient data
entry.
Date table
date field, numeric Handles single dates, multiple non-continuous dates, and date
containing date
fields, character
ranges. Needs complex user interface and supporting code.
field and start/end field, or character
field.
fields.
date
data type
Handling hierarchical
information
Hierarchies are pervasive in biodiversity
informatics. Taxon names, geopolitical
entities, chronostratigraphic units and
collection objects are all inherently
hierarchical concepts they can all be
31
T_Order
Caenogastropoda
Caenogastropoda
Caenogastropoda
Family
Muricidae
Muricidae
Muricidae
Sub Family
Muricinae
Muricinae
Muricinae
Genus
Murex
Chicoreus
Hexaplex
Table 20. Examples of problems with a hierarchy placed in a single flat file.
Class
Gastropoda
Gastropod
Gastropoda
T_Order
Caenogastropoda
Caenogastropoda
Neogastropoda
Family
Muricidae
Muricidae
Muricinae
Sub Family
Muricinae
Muricinae
Muricidae
Genus
Murex
Chicoreus
Hexaplex
Edge Representation
Denormalized table
A typical legacy structure for the storage of
higher taxonomy is a single table containing
separate fields for Genus, Family, Order
and other higher ranks (Table 19). (Note
that order is a reserved word in SQL [as in
the ORDER BY clause of a SELECT
statement] and most SQL dialects will not
allow reserved words to be used as field or
table names. We thus need to use some
variant such as T_Order as the actual field
name). Placing a hierarchy in a single flat
table suffers from all the risks associated
with non normal structures (Table 20).
Placing higher taxa in a flat table like Table
19 does allow for very easy extraction of the
higher classification of a taxon in a single
query as in the following examples. A flat
file is often the best structure to use for a
read only copy of a database used to power
a searchable website. Asking for the family
to which a particular genus belongs is very
simple:
SELECT family FROM higher_taxonomy
FROM higher_taxonomy
WHERE genus = Chicoreus;
32
Higher_taxon
[root]
Gastropoda
Caenogastropoda
Muricidae
Muricidae
Higher_taxon
[root]
Gastropoda
Murex
Muricidae
Muricidae
33
Rank
Class
Order
Family
Genus
Genus
Code Example 1.
// Pseudocode to illustrate repeated queries on edge
// representation table to walk up tree from current
// node to a higher taxon node at the rank of family.
// Note: handles some but not all error conditions.
$root_marker = [Root];
// value of higher taxon of root node
$max_traverse = 1000;
// larger than any leaf to root path
$rank = ;
// holds rank of parent nodes
$targetName = $initialTarget;
// start with some taxon
$counter = 0;
// counter to trap for infinite loops
while ($rank <> Family)
// desired result
and ($counter < $max_traverse)
// errors: infinite loop/no parent
and ($targetname <> $root_marker) // error: reached root
{
$counter++;
// increment counter
$sql = SELECT taxon_name, rank, higher_taxon
FROM higher_taxon
WHERE t1.taxon_name = '$targetName';
$results = getresults($connection,$sql);
if (numberofrows($results)==0)
{
// error condition: no parent found for current node
$counter = $max_traverse;
// exploit infinite loop trap
} else {
// a parent exists for the current node
$row = getrow($results);
$currentname = getcolumn($row, taxon_name);
$rank = getcolumn($row,rank);
$targetName = getcolumn($rowhigher_taxon);
} // endif
} // wend
34
Table 24. Storing the path from the current node to the root of the tree redundantly in a parentage field,
example shows a subset of fields from BioLink's tblBiota (CSIRO, 2001).
intBiotaID (PK)
1
2
3
4
5
vchrFullName
Gastropoda
Caenogastropoda
Muricidae
Chicoreus
Murex
intParentID
1
2
3
3
vchrParentage
/1
/1/2
/1/2/3
/1/2/3/4
/1/2/3/5
Tree Visitation
A different method for storing hierarchical
structures in a database is the tree visitation
algorithm (Celko, 1995a). This method
works by traversing the tree and storing the
step at which each node is entered in one
field and in another field, the step at which
each node is last exited. A table that holds
a tree structure includes two fields (t_left
and t_right [note that left and right are
usually reserved words in SQL and can't be
used as field names in many database
systems]). To store a tree using these two
fields, set a counter to zero, then traverse
the tree, starting from the root. Each time
you enter a node, increment the counter by
one and store the counter value in the t_left
field. Each time you leave a node for the
last time, increment the counter by one and
store the counter value in the t_right field.
You traverse the tree in inorder5, visiting a
node, then the first of its children, then the
first of its children, passing down until you
reach a leaf node, then go back to the leaf's
parent and visit its next child.
Table 25 holds the classification for 6 taxa.
The tree traversal used to set the left and
right values is shown below in Figure 21.
The value of the counter on entry into a
node is shown in yellow; this is the value
stored in the left field for that node. The
value of the counter on the last exit from a
node is shown in red, this is the value stored
5
Code Example 2.
// Pseudocode to illustrate query on edge representation table containing parentage
// string. Uses field names found in BioLink table tblBiota.
// Note: does not handle any error conditions.
// Note: does not include code to place higher taxa in correct order.
$rank = ;
$targetName = $initialTarget; // Name of the taxon who's parentage you want to find
$sql = SELECT vchrParentage
FROM tblBiota
WHERE vchrFullName = '$targetName'; // get parentage string
$results = getresults($connection,$sql);
// run query on database connection
$row = getrow($results);
// get the result set from the query
$parentage = getcolumn($row,rank);
// get the parentage string
// the parentage string is a list of ids separated by backslashes
@array_parentage = split($parentage,\);
// split the parentage string into ids
$sql = SELECT vchrFullName FROM tblBiota WHERE ;
$first=TRUE;
for ($x=1;$x<rowsin(@array_parentage);$x++) {
// for each id in parentage, build query get the name of the taxon
if ($first==FALSE) { $sql = $sql + and ; }
// Note: enclosing integer values in quotes in sql queries is usually not
// required, but is a good programing practice to help prevent sql injection
// attacks
$sql = $sql
+ intBiotaID = '
+ @array_parentage[$x] + ';
$first=FALSE;
}
$results = getresults($connection,$sql); // now run assembled query to get names
for ($x=1;$x<rowsin($results);$x++) {
$row=getrows(results)
@array_taxa[$x]=getcolumn($row,vchrFullName);
}
Left
1
2
3
4
5
7
Right
12
11
10
9
6
8
36
Indexing
An index allows at DBMS to very rapidly
look up rows from table. An appropriate
choice of fields to index is a very important
aspect of database implementation, and
careful tuning of indexes is an important
database administration task for a large
database. Being able to reduce the time for
a database operation from 10 seconds to 2
37
Figure 22. An index can greatly increase the speed at which data is retrieved from a database, but can slow
the rate of data insertion as both the table and its indexes need to be updated.
Figure 23. Page locks. Many database systems store data internally in pages, and an operation that locks a
row in a page may also lock all other queries trying to access any other row on the same page.
38
39
Data Stewardship
We often state to groups on tours through
natural history collections that our
specimens are of no value without their
associated data. We are well used to
thinking of stewardship of our collection
objects and their associated paper records,
but we must also think carefully about the
stewardship of electronic data associated
with those specimens. I will divide data
stewardship into two concepts: 1) Data
security, that is maintaining the data in an
environment where it is safe from malicious
damage and accidental loss Data security
encompasses network security,
administration, pre-planning and risk
mitigation. 2) Data quality control, that is
procedures to assist the users of the
database in maintaining clean and
meaningful data. Data quality control is of
particular importance during data migration,
as we are often performing large scale
transformations of collections records, and
thus need to carefully plan to prevent both
the loss of information and the addition of
new erroneous information. Data
stewardship involves the short term data
security tasks of system administration and
quality control and the long term perspective
of planning for movement of the data
through multiple database lifecycles.
Data Security
Providing carefully planned multiple layers
of security has become an essential part of
maintaining the integrity of electronic data.
Any computer connected to the Internet is a
target for attack, and any computer is a
target for theft. Simply placing a computer
40
Threat analysis
Computer and network security covers a
vast array of complex issues. Just as in
other areas of collection management,
assessing the risks to biodiversity and
collections information is a necessary
precursor to effective allocation of resources
to address those risks. Threat analysis is a
comprehensive review and ranking of the
risks associated with computer
infrastructure and electronic data. A threat
analysis of natural history collection data
housed within an institution will probably
suggest that the highest risks are as follows.
Internal and external security threats exist
for biodiversity information and its
supporting infrastructure. The two greatest
threats are probably equipment theft and
non-targeted hacking attacks that seek to
use machines as resources. An additional
severe risk is silent data corruption (or
malicious change) creeping into databases
and not being noticed for years. Risks also
exist for certain rare species. The release
of collecting locality information for rare and
endangered species may be a threat to
those species in the wild. Public distribution
of information about your institution's
holdings of valuable specimens may make
you a target for theft. In addition, for some
collections, access to some locality
information might be restricted by collecting
agreements with landowners and have
contractual limits on its distribution. A threat
analysis should suggest to you where
resources should be focused to maintain the
security of biodiversity data.
Michael Wojcik put it nicely in a post to
Bugtraq. What you need is a weighted
threat model, so you can address threats in
an appropriate order. (The exact metric is
debatable, but it should probably combine
Implementing Security
If you go to discuss database security with
your information technology support people,
and they tell you not to worry because the
machine is behind the firewall, you should
worry. Modern network security, like
navigation, should never rely on any single
method. As the warning on marine charts
goes The prudent navigator will not rely
solely on any single aid to navigation, the
core idea in modern computer network
security is defense in depth. Security for
your data should include a UPS, regular
backup, offsite backup, testing backup
integrity and the ability to restore from
backup, physical access control, need
limited access to resources on the network,
appropriate network topology (including
firewalls), encryption of sensitive network
traffic (e.g. use ssh rather than telnet),
applying current security patches to
software, running only necessary services
on all machines, and having an incident
response capability and disaster recovery
plan in place.
Other than thinking of defense in depth, the
most important part of network security is
implementing a rational plan given the
available resources that is based upon a
threat analysis. It is effectively impossible to
secure a computer network against attacks
from a government or a major corporation.
All that network security is able to do is to
raise the barrier of how difficult it will be to
penetrate your network (either electronically
or physically) and increase the resources
that an intruder would need to obtain or alter
your data. Deciding what resources to
expend and where to put effort to secure
your information should be based on an
analysis of the threats to the data and your
computer infrastructure.
While electronic penetration risks do exist
for collections data (such as extraction of
endangered species locality data by illicit
collectors, or a malicious intruder placing a
child pornography ftp site on your web
41
Hardware
Any machine on which a database runs
should be treated as a server. If that
machine loses power and shuts down, the
database may be left in an inconsistent
state and become corrupt. DBMS software
usually stores some portion of the changes
being made to a database in memory. If a
machine suddenly fails, only portions of the
information needed to make a consistent
change to the data may have been written
to disk. Power failures can easily cause
data corruption. A minimum requirement for
a machine hosting a database (but not a
machine that is simply connecting to a
database server as a client) is an
uninterruptible power supply connected to
the server with a power monitoring card and
with software that is capable of shutting
down services (such as the DBMS) in the
event of a prolonged power failure. In a
brief power outage, the battery in the UPS
provides power to keep the server running.
As the battery starts to run down, it can
signal the server that time is running out,
and software on the server can shut down
the applications on the server and shut
down the server itself. Another level of
protection that may be added to a server is
to use a set of hard disks configured as a
redundant RAID array (e.g. level 5 RAID).
RAID arrays are capable of storing data
Backup
Any threat analysis of biodiversity data will
end up placing backups at or near the top of
the risk mitigation list. A well planned
backup scheme is of vital importance for a
collection database. A good plan arises
directly out of a threat analysis. The rate of
change of information in a database and the
acceptable level of loss of that information
will strongly influence the backup plan.
Each database has different rates of
change, different backup needs and thus
requires a plan appropriate for its needs.
Regardless of how good a backup plan is in
place, no records related to collection
objects data should be held solely in
electronic form. Paper copies of the data
(preferably labels, plus printed catalog
pages) are an essential measure to ensure
the long term security of the data. In the
event of a catastrophic loss of electronic
data, paper records provide an insurance
that the information associated with
collection objects will not be lost.
Backups of electronic data should include
regular on-site backups, regular off-site
transport of backups, making copies on
durable media, and remote off-site
exchanges. The frequency and details of
the backup scheme will depend on the rate
of data entry and the threat analysis
assessment of how much data you are
willing to re-enter in the event of failure of
the database. A large international project
with a high rate of change might mandate
the ability to restore up to the moment of
failure. Such a project might use a backup
scheme using daily full backups, hourly
backups of incremental changes since the
last full backup, and a transaction log that
allows restoration of data from the last
backup to the most recent entry in the
transaction log, requiring the use of a SQL
server DBMS capable of this form of robust
backup. (A transaction log records every
update insert and delete made to the
database and allows recovery from the last
42
43
Access Control
Any person with high level access to a
database can do substantial damage, either
accidentally or on purpose. Anyone with
high level access to a database can do
substantial damage with a simple command
such as DROP DATABASE, or more subtly
through writing a script that gradually
degrades the data in a database. No purely
electronic measures can protect data from
the actions of highly privileged uses.
Database users should be granted only the
minimum privileges they need to do their
work. This principle of minimum privilege is
a central concept in computer security
(Hoglund & McGraw, 2004). Users of a
biodiversity database may include guests,
data entry personnel, curators,
programmers, and system administrators.
Guests should be granted only read (select)
only access, and that only to portions of the
database. Low level data entry personnel
need to be able to enter data, but should be
unable to edit controlled vocabularies (such
as lists of valid generic names), and
probably should not be able to create or
view transactions involving collection objects
such as acquisitions and loans. Higher level
users may need rights to alter controlled
vocabularies, but only system
administrators should have the ability to
grant access rights or create new users.
Database management systems include, to
varying degrees of granularity, the ability to
grant users rights to particular operations on
particular objects in a database. Many
support some form of the SQL command
GRANT rights TO user ON resource. Most
access control is best implemented by
simply using the access control measures
present in the database system, rather than
coding access control as part of the
business rules of a user interface to the
database. Restriction of access to single
records in the database (row level access
control), however, usually needs to be
implemented in higher layers.
Physical access control is also important. If
a database server is placed in some readily
accessible space, a passerby might shut it
down improperly causing database
corruption, unplug it during a disk write
operation causing physical failure of a hard
disk, or simply steal it. Servers are best
maintained in spaces with access limited to
System Administration
Creating new user accounts, deactivating
old user accounts and other such
maintenance of user access rights are tasks
that fall to a database administrator. A few
years ago, such maintenance of rights on
the local machine, managing backups and
managing server loads were the principle
tasks of a system administrator. Today,
defense of the local network has become a
absolutely essential system administration
task. A key part of that defense is
maintaining software with current security
patches. Security vulnerabilities are
continuously brought to light by the
computer security research community.
Long experience with commercial vendors
unresponsive to anything other than public
disclosure has led to a widespread adoption
of an ethical standard practice in the
security community. An ethical security
researcher is expected to notify software
vendors of newly discovered vulnerabilities,
then provide a 30 day grace period for the
vendor to fix the vulnerability, followed by
public release of details of the vulnerability.
Immediate public release of vulnerabilities is
considered unethical as it does not provide
software vendors with any opportunity to
protect people using their software. Waiting
an indefinite period of time for a vendor to
patch their software is also considered
unethical, as this leaves the entire user
community vulnerable without knowing it (it
being the height of hubris for a security
researcher to assume that they are the only
person capable of finding a particular
vulnerability). A 30 day window from vendor
notification to public release is thus
considered a reasonable ethical
compromise that best protects the interests
of the software user community. Some
families of vulnerabilities (such as C strcpy
buffer overflows, and SQL injection) have
proven to be very widespread and relatively
easy to locate in both open and closed
source software. A general perception
among security researchers (e.g. Hoglund
and McGraw, 2004 p.9) is that many
exploits are known in closed circles among
malicious and criminal programmers
44
45
Figure 24. A possible network topology with a web accessible database readily accessible to the Internet,
and a master database behind more layers of security. The web accessible copy of a database can be
denormalized and heavily indexed relative to the master database.
function sanitize() {
global $genus,$trivial;
$genus ~=
preg_replace(/[^A-Za-z] ,,$genus);
$trivial ~=
preg_replace(/[^a-z]/,,$trivial);
}
Quality Control
Quality control on data entry covers a series
of questions: Are literal data captured
correctly? Are inferences correctly made
from the literal data? Are data correctly
captured into the database, with information
being entered into the correct fields in the
correct form? Are links to resources being
made correctly (e.g. image files of
specimens)? Are the data and inferences
actually correct? Some of these questions
can be answered by a proofreader, others
by queries run by a database administrator,
whereas others require the expert
knowledge of a taxonomist. Tools for
quality control include both components
built into the database and procedures for
people to follow in interacting with the
database.
At the database level controls can be added
to prevent some kinds of data entry errors.
At the most basic level, field types in a
database can limit the scope of valid entries.
Integer fields will throw an error if a data
entry person tries to enter a string. Date
fields require a valid date. Most data in
biodiversity databases, however, goes into
string fields that are very loosely
constrained. Sometimes fields holding
string data can be tightly constrained on the
database level, as in fields that are declared
as enumerations (limited in their field
definition to a small set of values). In some
48
49
50
Error amplification
Large complex data sets that are used for
analyses by many users, who often lack a
clear understanding of data quality, are
susceptible to error amplification. Natural
history collection data sets have historically
been used by specialists examining small
numbers of records (and their related
specimens). These specialists are usually
highly aware of the variability in quality of
specimen data and routinely supply
corrections to identifications and
provenance data. Consider, in contrast, the
potential for large data sets of collection
information to be linked to produce, for
example, range maps for species
distributions based on catalog data.
Consider such a range map incorporating
erroneous data points that expand the
apparent range of a species outside of its
true range. Now consider a worker using
these aggregate data as an aid to identifying
a specimen of another species, taken from
outside the range of the first species, then
mistakenly identifying their specimen as a
member of the first species, and then
depositing the specimen in a museum
(which catalogs it and adds its incorrect data
to the aggregate view of the range of the
first species). Data sets that incorporate
errors (or inadequate metadata) from which
inferences can be made, and to which new
data can be added based on those
inferences are subject to error amplification.
Error amplification is a known issue in
molecular sequence databases such as
genbank (Pennisi, 1999; Jeong and Chen,
2001) where incorrect annotation of one
Documentation Problems
A common legacy problem during data
migration is poor documentation of the
original database design. Consider a
database containing a set of three fields for
original genus, original specific epithet, and
original subspecies epithet, plus another
three fields for current genus, current
specific epithet, and current subspecies
epithet, plus a single field for author and
year of publication (Figure 25).
52
Atomization problems
Sometimes legacy datasets include multiple
concepts placed together in a single field. A
common offender is the use of two fields to
hold author, year of publication, and
parentheses. Rows make plain sense for
combinations that use the original genus
(and don't parenthesize the author), but end
up with odd looking data for changed
combinations (Table 26).
Table 26. Parentheses included in author.
Generic name
53
54
Normalization Problems:
Multiple values in one field
Legacy datasets very often contain fields
that hold multiple repeated pieces of the
same sort of information. These fields are
usually remains of efforts to store one to
many relationships in the flat file format
allowed by early database systems. The
goal in parsing these fields is to split them
into their components and to place these
components into separate rows in a new
table, which is joined to the first in a many to
one relationship.
If users have been consistent in the
delimiter used to separate multiple repeated
components within the field (e.g. semicolon
as a separator in R1; R2), then writing a
parser is quite straightforward. Twenty five
to thirty year old data, however, can often
contain inconsistencies, and you will need to
carefully examine the content of these fields
to make sure that the content is consistent.
This is especially true if the text content is
complex and some of the text could contain
Normalization Problems:
Duplicate values with
misspellings
If the structures that data are stored in are
not in at least third normal form, they will
allow a field to contain multiple rows
containing duplicate values. A locality table
that contains fields for country and state will
contain duplicate values for country and
state for as many localities as exist in a
single state. Over time, new values entered
by different people, imports from external
data sources, and changes to the database
55
primary
map_ map_primary
_division country _division
PA
Pennsylvania
Pennsylvana
56
primary
_division
PA
USA
Pennsylvania
US
Pennsylvana
Map_
map_primary
country
_division
United Pennsylvania
States
United Pennsylvania
States
United Pennsylvania
States
Figure 29. Layers in a database system. A database with server side code can focus on maintaining the
integrity of the data, while the user interface can be separated and connect either locally to the server or
remotely over a network (using various network transport layers). A stable and well defined application
programming interface can allow multiple clients of different sorts to connect to the back end of the database
and allow both client and server code to be altered independently.
58
Figure 30. BioLink (Windows) and OBIS Toolkit (Java, cross platform) interfaces to a BioLink database on a
MS SQLServer, with a web copy of the database. The separation of user interface and database layers
allowed the OBIS Indo-Pacific Mollusc project to use both BioLink's Windows client and a custom crossplatform Java application for global (US, France, Australia) distributed data entry.
59
Conclusion
Biological information is inherently complex.
Management and long term stewardship of
information related to biological diversity is a
challenging task, indeed at times a daunting
task. Stewardship of biodiversity
information relies on good database design.
The most important principles of good
database design are to atomize information,
to reduce redundant information, and to
design for the task at hand. To atomize
information, design tables so that each field
contains only one concept. To reduce
redundant information, design tables so that
each row of each field contains a unique
value. Management of database systems
requires good day to day system
administration. Database system
administration needs to be informed by a
threat analysis, and should employ means
Acknowledgments
This document had its origins in a
presentation I gave at the Use of Digital
Technology in Museums workshop at
SPNHC in 2003. I would like to thank Tim
White for encouraging me to talk about the
fundamentals of database design using
examples from the domain of natural history
collections information in that workshop.
This paper has been substantially improved
by comments from and discussions with
Susan F. Morris, Kathleen Sprouffske,
James Macklin, and an anonymous
reviewer. Gary Rosenberg has provided
many stimulating discussions over many
years on the management and analysis of
collections and biodiversity data. I would
also like to thank the students in the
Biodiversity Informatics Seminar for the
REU program at ANSP in the summer of
2004 for many fruitful and pertinent
discussions. The PH core tables example
owes much to extensive discussions, design
work and coding with James Macklin. A
large and generous community of open
source programmers have produced a wide
variety of powerful tools for working with
data including the database software
MySQL, and PostgreSQL, and the CASE
tool Druid.
60
References
Anley, C. 2002. Advanced SQL Injection in SQL Server Applications. NGSSoftware Insight
Security Research [WWW PDF Document] URL
http://www.nextgenss.com/papers/advanced_sql_injections.pdf
ASC [Association of Systematics Collections] 1992. An information model for biological
collections. Report of the Biological Collections Data Standards Workshop, 8-24 August
1992. ASC. Washington DC. [Text available at: WWW URL
http://palimpsest.stanford.edu/lex/datamodl.html ]
ANSI X3.135-1986. Database Language SQL.
Beatte, S., S. Arnold, C. Cowan, P. Wagle, C. White, and A. Shostack 2002. Timing the
Application of Security Patches for Optimal Uptime. LISA XVI Proceedings p.101-110.
[WWW Document] URL
http://www.usenix.org/events/lisa02/tech/full_papers/beattie/beattie_html/
Beccaloni, G.W., M.J. Scoble, G.S. Robinson, A.C. Downton, and S.M. Lucas 2003.
Computerizing Unit-Level Data in Natural History Card Archives. pp. 165-176 In M.J.
Scoble ed. ENHSIN: The European Natural History Specimen Information Network. The
Natural History Museum, London.
Bechtel, K. 2003. Anti-Virus Defence In Depth. InFocus 1687:1 [WWW document] URL
http://www.securityfocus.com/infocus/1687
Berendsohn, W.G., A. Anagnostopoulos, G. Hagedorn, J. Jakupovic, P.L. Nimis, B.
Valds, A. Gntsch, R.J. Pankhurst, and R.J. White 1999. A comprehensive
reference model for biological collections and surveys. Taxon 48: 511-562. [Available at:
WWW URL http://www.bgbm.org/biodivinf/docs/CollectionModel/ ]
Blum, S. D. 1996a. The MVZ Collections Information Model. Conceptual Model. University of
California at Berkeley, Museum of Vertebrate Zoology. [WWW PDF Document] URL
http://www.mip.berkeley.edu/mvz/cis/mvzmodel.pdf and
http://www.mip.berkeley.edu/mvz/cis/ORMfigs.pdf
Blum, S. D. 1996b. The MVZ Collections Information Model. Logical Model. University of
California at Berkeley, Museum of Vertebrate Zoology. [WWW PDF Document] URL
http://www.mip.berkeley.edu/mvz/cis/logical.pdf
Blum, S.D., and J. Wieczorek 2004. DiGIR-bound XML Schema proposal for Darwin Core
Version 2 content model. [WWW XML Schema] URL
http://www.digir.net/schema/conceptual/darwin/core/2.0/darwincoreWithDiGIRv1.3.xsd
Bruce, T.A. 1992. Designing quality databases with IDEF1X information models. Dorset
House, New York. 547pp.
Carboni, A. et al. 2004. Druid, the database manager. [Computer software package distributed
over the Internet] http://sourceforge.net/projects/druid [version 3.5, 2004 July 4]
Celko, J. 1995a. SQL for smarties. Morgan Kaufmann, San Fransisco.
Celko, J. 1995b. Instant SQL Programing. WROX, Birmingham UK.
Chen, P.P-S. 1976. The Entity Relationship Model Towards a unified view of data. ACM
Transactions on Database Systems. 1(1):9-36.
61
Codd, E.F. 1970. A Relational Model of Data for Large Shared Data Banks. Communications
of the ACM. 13(6):377-387
Connoly, T., C. Begg, and A. Strachan 1996. Database Systems: A practical approach to
design, implementation and management. Addison Wesley, Harrow, UK.
CSIRO 2001. BioLink 1.0 [Computer software package distributed on CD ROM] CSIRO
Publishing. [Version 2.0 distributed in 2003].
DuBois, P. 2003 [2nd ed.]. MySQL. Sams Publishing, Indiana.
DuBois, P. et al., 2004. Reference Manual for the 'MySQL Database System' [4.0.18] [info doc
file, available from http://www.mysql.com] MySQL AB.
Eisenberg, A., J. Melton, K. Kulkarni, J-E Michels, and F. Zemke 2003. SQL:2003 Has
been Published. Sigmod Record 33(1):119-126.
Elmasri, R. and S.B. Navathe 1994. Fundamentals of Database Systems. Benjamin
Cummings, Redwood City, CA.
Ferguson, N. and B. Scheiner 2003. Practical Cryptography. John Wiley and Sons.
Hernandez, M.J. 2003. Database Design for Mere Mortals. Addison Wesley, Boston.
Hoglund, G. and G. McGraw 2004. Exploiting Software: How to Break Code. Addison Wesley,
Boston. 512pp.
ISO/IEC 9075:2003. Information technology -- Database languages -- SQL --.
Jeong, S.S. and R. Chen 2001. Functional misassignment of genes. Nature Biotechnology.
19:95.
Linsenbardt, M.A., and M.S. Stigler 1999. SQL SERVER 7 Administration. Osborne/McGraw
Hill, Berkeley CA. 680pp.
National Research Council, Committee on the Preservation of Geoscience Data and
Collections 2002. Geoscience Data and Collections: National resources in peril. The
National Academies Press, Washington DC. 108pp.
Morris, P.J. 2000. A Data Model for Invertebrate Paleontological Collections Information.
Paleontological Society Special Publications 10:105-108,155-260.
Morris, P.J. 2001. Posting to [email protected]: Security Advisory for BioLink users. Date:
2001-03-02 15:36:40 -0500 [Archived at: http://listserv.nhm.ku.edu/cgibin/wa.exe?A2=ind0103&L=taxacom&D=1&O=D&F=&S=&P=1910]
Pennisi, E. 1999. Keeping Genome Databases Clean and Up to Date. Science 286: 447-450
Petuch, E.J. 1989. New species of Malea (Gastropoda Tonnidae) from the Pleistocene of
Southern Florida. The Nautilus 103:92-95.
PostgreSQL Global Development Group 2003. PostgreSQL 7.4.2 Documentation: 22.2. File
system level backup [WWW Document] URL
http://www.postgresql.org/docs/7.4/static/backup-file.html
62
Pyle, R.L. 2004. Taxonomer: A relational data model for handling information related to
taxonomic research. Phyloinformatics 1:1-54
Resolution Ltd. 1998. xCase Professional. [Computer software package distributed on CD
ROM] Jerusalem, RESolution Ltd. [Version 4.0 distributed in 1998]
Schwartz, P.J. 2003. XML Schema describing the Darwin Core V2 [WWW XML Schema] URL
http://digir.net/schema/conceptual/darwin/2003/1.0/darwin2.xsd
Smith, M. 2002. SQL Injection: Are your web applications vulnerable? SpiLabs SPIDynamics,
Atlanta. [WWW PDF Document] URL
http://www.spidynamics.com/papers/SQLInjectionWhitePaper.pdf
Teorey, T.J. 1994. Database Modeling & Design. Morgan Kaufmann, San Fransisco.
Wojcik, M. 2004. Posting to [email protected]: RE: Suggestion: erase data posted to
the Web Date: 2004-07-08 07:59:09 -0700. [Archived at:
http://www.securityfocus.com/archive/1/368351]
63
64
Field Summary
Field
Sql type
PrimaryKey
NotNull
Default
AutoIncrement
Herb Sheet ID
Integer
Name
Varchar(32)
[Current User]
Date
Timestamp
Now
Verified by
Varchar(32)
Null
Verification Date
Date
Null
CaptureNotes
Text
VerificationNotes
Text
CurrentCollection
Varchar(255)
Verified
Boolean
False
Fields
must have verification rights in order to verify a herbarium sheet and to enter a value in
this field.
Example Value: Macklin, James
Name: Verification Date
Type: Date
Definition: Date the herbarium sheet data were verified. Should be a valid date since the
inception of the database (2003 and later).
Domain: Date. Valid date greater than the inception date of the database (2003).
Business Rules: Optional. Automatic. Default to Null. On verification of record, fill with
current date. Current user must have verification rights in order to verify a herbarium sheet
and to enter a value in this field. Value must be more recent than value found in
Herbarium Sheet: Date (records can only be verified after they are created).
Example Value: 2004-08-12
Name: CaptureNotes
Type: Text
Definition: Notes concerning the capture of the data on the herbarium sheet made by the
person capturing the data. May be questions about difficult to read text, or other reasons
why the data related to this sheet should be examined by someone more experienced
than the data entry person.
Domain: Free text memo.
Business rules: Manual, Optional.
Example Values: "Unable to read taxon name Q_____ ___ba", "Locality description hard
to read, please check".
Name: VerificationNotes
Type: Text
Definition: Notes made by the verifier of the specimen. May be working notes by verifier
prior to verification of record.
Domain: Free Text Memo.
Business Rules: Manual, Optional. May contain a value even if Verified By and
Verification Date are null.
Example Values: Taxon name is illegible, Fixed locality description
Name: CurrentCollection
Type: Varchar(255)
Definition: Collection in which herbarium sheet is currently stored. Data entry is currently
for the type collection, with some records being added for material in the general
systematic collection that are imaged to provide virtual loans.
Domain: "PH systematic collection", "PH type collection".
Business Rules: Required, Semimanual, default to "PH type collection". Should data
entry expand to routine capture of data from systematic collection, change to Manual and
allow users to set their own current default value.
Name: Verified
Type: Boolean
Definition: Flag to indicate if data associated with a herbarium sheet has been verified.
Domain: True, False.
Business rules: Required, Semiautomatic. Default to False. Set to true when record is
verified. Set the three fields Herbarium Sheet: Verified, Herbarium Sheet: Verified by, and
Herbarium Sheet: Verification sheet together. Only a user with rights to verify material can
change the value of this field. Once a record has been verified once maintain these three
verification stamps and do not allow subsequent re-verifications to alter this information.
66