An Introduction To The Relational Model of Data and The Structured Query Language For Demographers
An Introduction To The Relational Model of Data and The Structured Query Language For Demographers
An Introduction To The Relational Model of Data and The Structured Query Language For Demographers
WORKING PAPER
PAC2003-0002
_______________________________________________________________________________
An Introduction to the Relational Model of Data and the Structured Query Language for Demographers
Samuel Clark
February 14, 2003
Printed: 2003-02-14
An Introduction to the Relational Model of Data and the Structured Query Language for Demographers
Samuel Clark
Abstract
The Relational Model of Data (RMD) underlies the vast majority of commercially available database management systems sold today, and the Structured Query Language (SQL) is the language through which relational databases are accessed and manipulated. It is in the nature of Demography to store and manipulate large quantities of information, and consequently, it is useful for Demographers to be conversant with the RMD and SQL. This communication provides an ultra brief introduction to the RMD and SQL, assuming the reader has no background in the theory of data management. Following that, a number of SQL queries that are particularly relevant to Demography are presented and discussed, including counts and frequencies, the calculation of person years, event/exposure rates, and the creation of persontime datasets suitable for event history analysis. Because this communication aims to impart some practical knowledge, all of the demographically relevant queries are constructed in a Microsoft Access database that is freely available for download as part of this communication.
Page 1 of 45
Printed: 2003-02-14
Table of Contents
1 The Relational Model of Data 1.1 Relations 1.2 Operators 1.3 Constraints The Structured Query Language 2.1 DDL - Data Definition Language 2.1.1 CREATE TABLE Example 2.1.2 ALTER TABLE Example 2.1.3 DROP TABLE Example 2.2 DML - Data Manipulation Language 2.2.1 INSERT Example 2.2.2 UPDATE Example 2.2.3 DELETE Example 2.2.4 SELECT and UNION Examples 2.2.5 ORDER BY and GROUP BY Clauses 2.3 SQL in Practical Use Examples of SQL Queries for Demographers 3.1 A Useful Desktop Relatio nal DaMicrosoft Access 3.1.1 MS Access JOIN Syntax 3.1.2 MS Access User-Defined Functions 3.2 Tables & Queries Used in Examples 3.2.1 People Table 3.2.2 Ages Table Sexes Table 3.2.3 Sex-Age-Groups Query 3.2.4 Dates Table 3.2.5 SQL Queries for Demographers Counts and Frequencies 4.1 Explanation of query: qrySelectFrequencyBySexAge_General 4.2 Explanation: qrySelectFrequencyBySexAge_General_AllAges 4.3 A Further Note on Counts and SQL Person-Years 5.1 Explanation of function: ftnIntervalIntersection 5.2 Explanation of query: qrySelectPersonYearsBySexAge_General Event-Exposure Rates 6.1 Explanation of query: qrySelectMortalityRate_General Person-Time Format Tables 7.1 Explanation of query: qryMakePersonYears Summary Bibliography 3 3 3 4 5 5 5 6 6 6 6 7 8 8 11 15 17 17 18 19 19 19 20 21 21 23 23 24 25 26 27 28 34 34 37 39 40 42 44 45
5 6 7 8 9
Page 2 of 45
Printed: 2003-02-14
1.1 Relations
A relation is composed of a heading and a body. The heading is a predicate composed of a number of attributes, each with its own domain of possible values. The body is a collection of tuples (or rows) that represent true propositions formed by choosing a value from the domain of each of the attributes specifie d in the predicate. Each tuple represents a set of attribute values that when substituted into the predicate form a true proposition. As a consequence: There is no order specified (or necessary) for the attributes; There is no order specified (or necessary) for the tuples; Each tuple contains a unique value for each attribute; and, There are no duplicate tuples (each true proposition is recorded exactly once).
1.2 Operators
The operators used to manipulate relations are similar, and in large part analogous, to the familiar set operators in mathematics. In fact, relations are carefully defined sets. All of the operators take relations as arguments and produce a relation as their result this is the relational closure property. Closure is critically important because it insures that one can write and evaluate nested relational expressions; those in which an expression can be substituted in place of a relation. The most fundamental operators used to manipulate relations include: Restrict; Project; Union; and, Join .
Restrict produces a relation whose tuples satisfy a condition. In other words, Restrict allows one to select a subset of the tuples in a relation. Project produces a relation with only the attributes specified in the Project expression. The result contains all of the tuples of the original relation with only those attributes specified in the Project expression. Restrict produces a row-wise subset, and Project produces an attribute-wise subset. Union produces a relation containing all of the tuples from two specified relations of the same type. Because all tuples in a relation must conform to the Page 3 of 45
Printed: 2003-02-14
same predicate, the two relations contributing to the union must be based on the same predicate, or in other words, be of the same type. This insures that the result is a valid relation, and closure is maintained. Join produces a relation that consists of all tuples produced by forming all possible pairs of tuples from two specified relations. Most often, there is at least one common attribute shared by the two relations, and the result of the join is subject to the condition that the values of the common attribute in both relations are equal. Join does precisely what its name implies - it takes two relations and 'glues' them together into a 'wider' relation. If you imagine the input relations as sets of tuples, Join takes the Cartesian product of the two sets of tuples, and then selects only those new tuples for which the values of the common attribute (s) of the input tuples are equal. There are many different varieties of the Join operator, the one described here being the most general. In addition to the manipulation operators, there is a set of operators used to create and modify relations and update the values of their attributes. These will be discussed briefly in the section 2.1 below.
1.3 Constraints
The two crucial data integrity constraints have to do with unique values that identify tuples. A primary key is an attribute whose value is unique across all tuples in a relation. In general it is useful for all relations to have a primary key. The primary key can consist of a calculated attribute that combines values from other attributes to yield a unique value. However, for reasons not discussed here, it is often cleaner and more efficient to have a single-attribute unique key in each relation. Tuples in two relations often share a semantic relationship that is represented by a link between the relations. Depending on the cardinality of the relationship (how many tuples in each relation are related to how many tuples in the other), primary key values from the parent relation are stored in a special attribute of the child relation. This attribute is called a foreign key. For each tuple in the child relation the foreign key attribute contains the (unique) value of the primary key attribute of a tuple in the parent relation. Foreign keys are constrained i the sense that a value stored in a foreign key attribute must exist n as a value of the primary key in one of the tuples of the parent relation. In circumstances where tuples in the child relation are nonsensical in the absence of corresponding tuples in the parent relation, the use of primary and foreign key constraints insures that no such orphan (and meaningless) tuples ever exist in the child relation. Constraints come in many other forms including the careful construction of the domains (or types) from which attribute values may be drawn and manipulated.
Page 4 of 45
Following are examples of these commands and the output they produce using IBM's implementation of SQL, available in their DB2 Universal Database management system. Three tables are created and will be used in the next set of examples.
Example 2-1
Example 2-2 creates a table named 'locations' with fields id, name, and type. Again, id is the primary key, and none of the fields can contain a null value.
CREATE TABLE locations (id Integer NOT NULL PRIMARY KEY, name VarChar(20) NOT NULL, Type Char(1) NOT NULL);
Example 2-2
Example 2-3 creates a table named 'lives_at' with fields id, pid, and lid. This is Page 5 of 45
a many-to -many relationship table that stores information linking people to the places where they live; hence, the name 'lives_at'. The id field is a primary key, no nulls are allowed in any fields, and the two fields pid and lid store values from the id fields of the people and locations tables and are, therefore, foreign keys.
CREATE TABLE lives_at (id Integer NOT NULL PRIMARY Key, pid Integer NOT NULL, lid Integer NOT NULL, FOREIGN KEY (pid) REFERENCES people (id), FOREIGN KEY (lid) REFERENCES locations (id));
Example 2-3
Example 2-4
Example 2-5
We will discuss INSERT through UNION and leave the interested reader to investigate the others.
Page 6 of 45
INSERT INTO people (id,name,sex) VALUES (1,'Mary','F'), (2,'Paul','M'), (3,'Simon','M'), (4,'Jane','F'), (5,'Albert','M'), (6,'Ruth','F'); ID ----------1 2 3 4 5 6 NAME -------------------Mary Paul Simon Jane Albert Ruth SEX DOB --- ---------F M M F M F -
Example 2-6
Example 2-7 populates the locations table with the values listed below:
INSERT INTO locations (id,name,type) VALUES (1,'Nairobi','U'), (2,'Accra','U'), (3,'Mtubatuba','R'), (4,'Sinafala','R'), (5,'Cairo','U'); ID ----------1 2 3 4 5 NAME -------------------Nairobi Accra Mtubatuba Sinafala Cairo TYPE ---U U R R U
Example 2-7
Example 2-8 populates the lives_at table with the listed values, shown below:
INSERT INTO lives_at (id,pid,lid) VALUES (1,1,4), (2,2,5), (3,3,2), (4,4,2), (5,5,1), (6,6,3); ID PID LID ----------- ----------- ----------1 1 4 2 2 5 3 3 2 4 4 2 5 5 1 6 6 3
Example 2-8
Page 7 of 45
UPDATE people SET dob = '1950-6-4' WHERE id = 1; UPDATE people SET dob = '1966-10-12' WHERE id = 2; UPDATE people SET dob = '1980-4-1' WHERE id = 3; UPDATE people SET dob = '1981-9-12' WHERE id = 4; UPDATE people SET dob = '1991-12-2' WHERE id = 5; UPDATE people SET dob = '1991-7-11' WHERE id = 6; ID NAME ----------- -------------------1 Mary 2 Paul 3 Simon 4 Jane 5 Albert 6 Ruth
Example 2-9
SEX DOB --- ---------F 06/04/1950 M 10/12/1966 M 04/01/1980 F 09/12/1981 M 12/02/1991 F 07/11/1991
Example 2-10
the name field from the people table. The statement is run on the tables created in the previous examples, and the result is shown below:
SELECT p.name AS name, l.name AS location FROM people AS p, lives_at AS la, locations AS l WHERE p.id = la.pid AND la.lid = l.id ORDER BY p.name; NAME -------------------Albert Jane Mary Paul Ruth Simon LOCATION -------------------Nairobi Accra Sinafala Cairo Mtubatuba Accra
Example 2-11
Example 2-12 is the same as Example 2-11 with the addition of an additional restriction, p.dob < '1980 -01-01', that retrieves only records for which the date of birth is prior to January 1, 1980, below:
SELECT p.name AS name, l.name AS location FROM people AS p, lives_at AS la, locations AS l WHERE p.id = la.pid AND la.lid = l.id AND p.dob < '1980-01-01' ORDER BY p.name; NAME -------------------Mary Paul LOCATION -------------------Sinafala Cairo
Example 2-12
Example 2-13 produces a full Cartesian product of the people table with itself. In this case no restriction is put on the result of the Cartesian product resulting in a table consisting of records that pair each record in the people table with every other record in the people table, including itself. Again, a projection is performed to retrieve just the name fields from the resulting table. Last, the order by clause at the end of this statement sorts the resulting table first on the name field from the first instance of the people table and then on the name field of the second instance of the people table. The final result is shown below in Example 2-13.
Page 9 of 45
SELECT p1.name AS name1, p2.name AS name2 FROM people AS p1, people AS p2 ORDER BY p1.name, p2.name; NB: No WHERE clause resulting in the complete Cartesian product of the people table with itself NAME1 -------------------Albert Albert Albert Albert Albert Albert Jane Jane Jane Jane Jane Jane Mary Mary Mary Mary Mary Mary Paul Paul Paul Paul Paul Paul Ruth Ruth Ruth Ruth Ruth Ruth Simon Simon Simon Simon Simon Simon NAME2 -------------------Albert Jane Mary Paul Ruth Simon Albert Jane Mary Paul Ruth Simon Albert Jane Mary Paul Ruth Simon Albert Jane Mary Paul Ruth Simon Albert Jane Mary Paul Ruth Simon Albert Jane Mary Paul Ruth Simon
Example 2-13
Example 2-14 selects all of the records from the people table and performs a union of those records to create the result below in which all of the records in the people table appear twice (in violation of one of the basic precepts of the Relational Model of Data because the keyword ALL was used after UNION without the ALL option identical records only appear once in the table resulting from a UNION command). The second version of this query demonstrates a nested relational expression (the UNION expression). The UNION is performed first and then the outer SELECT statement sorts the result by name, below:
Page 10 of 45
SELECT * FROM people UNION ALL SELECT * FROM people; ID ----------1 2 3 4 5 6 1 2 3 4 5 6 NAME -------------------Mary Paul Simon Jane Albert Ruth Mary Paul Simon Jane Albert Ruth SEX DOB --- ---------F 06/04/1950 M 10/12/1966 M 04/01/1980 F 09/12/1981 M 12/02/1991 F 07/11/1991 F 06/04/1950 M 10/12/1966 M 04/01/1980 F 09/12/1981 M 12/02/1991 F 07/11/1991
SELECT * FROM (SELECT * FROM people UNION ALL SELECT * FROM people) AS individuals ORDER BY name; ID ----------5 5 4 4 1 1 2 2 6 6 3 3 NAME -------------------Albert Albert Jane Jane Mary Mary Paul Paul Ruth Ruth Simon Simon SEX DOB --- ---------M 12/02/1991 M 12/02/1991 F 09/12/1981 F 09/12/1981 F 06/04/1950 F 06/04/1950 M 10/12/1966 M 10/12/1966 F 07/11/1991 F 07/11/1991 M 04/01/1980 M 04/01/1980
Example 2-14
New people table after adding Louise, Adam, Bill and Madeline: ID ----------1 2 3 4 5 6 7 8 9 10 NAME -------------------Mary Paul Simon Jane Albert Ruth Louise Adam Bill Madeline SEX DOB --- ---------F 06/04/1950 M 10/12/1966 M 04/01/1980 F 09/12/1981 M 12/02/1991 F 07/11/1991 F 03/14/1986 M 11/01/1964 M 03/10/1997 F 03/26/2001
New lives_at table after adding lives_at links for the four new people: ID PID LID ----------- ----------- ----------1 1 4 2 2 5 3 3 2 4 4 2 5 5 1 6 6 3 7 7 4 8 8 5 9 9 2 10 10 2 NB: The locations table remains unchanged SELECT l.name AS location, p.id AS personid FROM people AS p, lives_at AS la, locations AS l WHERE p.id = la.pid AND la.lid = l.id ORDER BY l.name DESC; LOCATION PERSONID -------------------- ----------Sinafala 1 Sinafala 7 Nairobi 5 Mtubatuba 6 Cairo 2 Cairo 8 Accra 3 Accra 4 Accra 9 Accra 10
Example 2-15
Page 12 of 45
SELECT l.name AS location, count(p.id) AS frequency FROM people AS p, lives_at AS la, locations AS l WHERE p.id = la.pid AND la.lid = l.id GROUP BY l.name ORDER BY l.name DESC; LOCATION FREQUENCY -------------------- ----------Sinafala 2 Nairobi 1 Mtubatuba 1 Cairo 2 Accra 4 SELECT l.name AS location, max(p.id) AS MaximumValue FROM people AS p, lives_at AS la, locations AS l WHERE p.id = la.pid AND la.lid = l.id GROUP BY l.name ORDER BY l.name; LOCATION MAXIMUMVALUE -------------------- -----------Accra 10 Cairo 8 Mtubatuba 6 Nairobi 5 Sinafala 7
Four new people are added to the people table with ID numbers from 7 to 10 in order to have more than one location with multiple residents. The first example query identifies the ID numbers of all of the people living at each location, and it is obvious that several locations have more than one resident. The next two queries group over location and calculate the count and the maximum (max) of the people.id field in each group identified by a location. Example 2-16 below provides a more useful look at the GROUP BY and ORDER BY clauses. Each persons age in years is calculated on October 22, 2002 and displayed rounded to two decimal places using this bit of code:
AGE = DECIMAL(ROUND((DAYS('2002-10-22')-DAYS(p.dob))/365.25,2),4,2)
The first query using AGE groups over sex and calculates the average age for each sex. The next produces a table with one row for each person listing their name, location and age, and this forms the basis for the next two queries. The final two queries group by location and calculate the number of people at each location and the sum and averages of their ages, respectively.
Page 13 of 45
SELECT p.name, p.sex, DECIMAL(ROUND( (DAYS('2002-10-22')DAYS(p.dob))/365.25,2),4,2) A age20021022 S FROM people AS p; NAME -------------------Mary Paul Simon Jane Albert Ruth Louise Adam Bill Madeline SEX AGE20021022 --- ----------F 52.38 M 36.03 M 22.56 F 21.11 M 10.89 F 11.28 F 16.61 M 37.97 M 5.62 F 1.57
SELECT p.sex, DECIMAL(ROUND(avg((DAYS('2002-10-22') DAYS(p.dob))/365.25),2),4,2) AS avg_age20021022 FROM people AS p GROUP BY p.sex ORDER BY p.sex; SEX AVG_AGE20021022 --- --------------F 20.59 M 22.61 SELECT p.name AS name, l.name AS location, DECIMAL(ROUND((DAYS('2002 -1022')-DAYS(p.dob))/365.25 ,2),4,2) AS age20021022 FROM people AS p, lives_at AS la, locations AS l WHERE p.id = la.pid AND la.lid = l.id ORDER BY l.name, p.name; NAME -------------------Bill Jane Madeline Simon Adam Paul Ruth Albert Louise Mary LOCATION AGE20021022 -------------------- ----------Accra 5.62 Accra 21.11 Accra 1.57 Accra 22.56 Cairo 37.97 Cairo 36.03 Mtubatuba 11.28 Nairobi 10.89 Sinafala 16.61 Sinafala 52.38
Example 2-16
Page 14 of 45
SELECT l.name AS location, count(p.id) AS frequency, DECIMAL(ROUND(sum((DAYS('2002 -10-22')-DAYS(p.dob))/365.25),2),4,2) AS sum_age20021022 FROM people AS p, lives_at AS la, locations AS l WHERE p.id = la.pid AND la.lid = l.id GROUP BY l.name ORDER BY l.name; LOCATION FREQUENCY SUM_AGE20021022 -------------------- ----------- --------------Accra 4 50.86 Cairo 2 74.00 Mtubatuba 1 11.28 Nairobi 1 10.89 Sinafala 2 68.99 SELECT l.name AS location, count(p.id) AS frequency, DECIMAL(ROUND(avg((DAYS('2002 -10-22')-DAYS(p.dob))/365.25),2),4,2) AS avg_age20021022 FROM people AS p, lives_at AS la, locations AS l WHERE p.id = la.pid AND la.lid = l.id GROUP BY l.name ORDER BY l.name; LOCATION FREQUENCY AVG_AGE20021022 -------------------- ----------- --------------Accra 4 12.71 Cairo 2 37.00 Mtubatuba 1 11.28 Nairobi 1 10.89 Sinafala 2 34.50
Page 15 of 45
Although SQL i a computational complete language, it is usually used as a s data manipulation sublanguage, that in combination with a fully-featured host language, provides all of the functionality necessary to manage and manipulate the data, to perform the logic necessary for the application to run, and to interact with the user. Common host languages include C, C++, Java, Visual Basic and Cobol.
Page 16 of 45
Access has many reliability problems in a multi-user environment, and additionally, some standard SQL queries produce wrong results when exe cuted in Access. Additionally, some data consumers have policies that discourage consumption of data managed using Access. When reliability and accuracy are important, Access should be used with great care, or not at all. Microsofts SQL Server and IBMs DB2 Database Management Systems are good alternatives. Access is discussed and used here purely for convenience and because many readers have access to Access and not to other more robust Relational Database Management Systems.
Page 17 of 45
SQL that actually comprises the query, thus opening up some of the richer functionality of Accesss SQL. Finally, we must examine two Access-specific features of SQL before proceeding. The first is Accesss syntax for the SQL JOIN operator, and the second is the way in which Access introduces user-defined functions that can be integrated into SQL statements.
The <conditions> specify the relationship(s) between fields of the left and right tables that must evaluate to true in order for a result record to contain values from both records of the left and right tables. The LEFT, RIGHT and INNER options specify whether field values from records on the LEFT or RIGHT are included in the result when there is no matching record in the other table. For instance, the result of a LEFT JOIN includes a record corresponding to (and containing field values of) each record of the table on the left, but only contains values in the fields from the table on the right from matching r ecords of the table on the right. Likewise, the result of a RIGHT JOIN is the same except that it contains a record corresponding to each record of the table on the right. An INNER JOIN contains only values from matching records of both tables; the result of an INNER JOIN contains one record for each set of matching records from the left and right tables. The behavior of Accesss JOIN is demonstrated below in Example 3-1. LEFT
ID Name 1 Bill 2 Mary 3 Rachel ID
RIGHT
Name 1 Rob 2 Mary 3 Tim
Example 3-1
SELECT tblLEFT.Name, tblRIGHT.Name FROM tblLEFT INNER JOIN tblRIGHT ON tblLEFT.Name = tblRIGH T.Name; tblLEFT.Name Mary tblRIGHT.Name Mary
SELECT tblLEFT.Name, tblRIGHT.Name FROM tblLEFT LEFT JOIN tblRIGHT ON tblLEFT.Name = tblRIGHT.Name; tblLEFT.Name Bill Mary Rachel tblRIGHT.Name Mary
Page 18 of 45
SELECT tblLEFT.Name, tblRIGHT.Name FROM tblLEFT RIGHT JOIN tblRIGHT ON tblLEFT.Name = tblRIGHT.Name; tblLEFT.Name Mary tblRIGHT.Name Rob Mary Tim
Most of the examples presented here operate on the four people described in Table 1, the Small People table . There are only four people so that by-hand checking and verifying of the results can be done easily. A much larger people table, the Big People table, is included in the example database called tblPeopleBig containing 1200 individuals created by the module called mdlPopulatePeopleTable. It is left to the interested reader to understand how the larger people table is constructed and to investigate the operation of the example queries using the larger people table. A copy of the Small People table is also included called tblPeopleSmall; to switch between Small and Big and back again, simply copy and paste the desired table giving it the name tblPeople.
Page 19 of 45
Table 1 : tblPeople (Small People Table) ID 1 2 3 4 Sex Male Female Male Female DOB DOD 1/1/2000 12/31/2050 1/1/2000 7/15/2000 1/1/1980 12/31/2050 1/1/2000 7/15/2080
Each record in the Big Ages ( Table 2, below) and Small Ages tables (Table 3, below) contains a description of an age group, and by editing, adding or deleting records in the ages table, one is easily able to customize the age groups used by the queries to calculate demographic indicators. The result is that the user does not need to edit SQL statements to alter the age groups applied in each calculation. The ages table contains fields to describe: Each Each Each Each age age age age groups unique ID: ID, groups label: AgeGroup, groups starting age: StartAge, and groups ending age: EndAge.
Again, to switch between Small and Big and back again, simply copy and paste the desired table giving it the name tblAges.
Table 2: tblAges Big Ages Table ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 AgeGroup 0 1-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85-89 StartAge 0 1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 EndAge 0 4 9 14 19 24 29 34 39 44 49 54 59 64 69 74 79 84 89
Page 20 of 45
ID 20 21 22 23 24 25
AgeGroup 90-94 95-99 100 -104 105 -109 110 -114 115+
Table 3 : tblAges Small Ages Table ID 1 2 3 4 5 AgeGroup 0 1-4 5-1 9 20-49 50+ StartAge 0 1 5 20 50 EndAge 0 4 19 49 250
Sexes Table
Like the ages table, the sexes table (Table 4, below) contains one row to describe each sex that needs to be taken into account in the calculations. Although male and female obviously dominate the list of sexes, it is often the case that many sex unknown values occur, and it is certainly the case that there are non-trivial numbers of people who are not male, or female and may be coded with some other label. So, in keeping with the metadata-centric approach to these calculations, the user is easily able to specify the sex categories that are necessary for the calculations s/he wants to make. The examples only use the customary male and female categories. The sexes table contains fields to describe: Each sexs unique ID: ID, Each sexs label: Sex,
Table 4: tblSexes ID Sex 1 Female 2 Male
Page 21 of 45
SELECT tblSexes.Sex, tblAges.AgeGroup, tblAges.StartAge, tblAges.EndAge FROM tblSexes, tblAges ORDER BY tblSexes.Sex, tblAges.StartAge; NB: Notice that there is no WHERE clause in this SELECT statement; as a result there is no restriction on the Cartesian product formed by the two tables, and the whole Cartesian product is returned each sex paired with each age.
Example 3-2
Operating on the sexes and ages tables (Table 2), the sex-age-groups query produces the results displayed in Table 5, below.
Table 5: qrySelectSexAgeGroups Sex Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male AgeGroup 0 1-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100-104 105-109 110-114 115+ 0 1-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 StartAge 0 1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 0 1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 EndAge 0 4 9 14 19 24 29 34 39 44 49 54 59 64 69 74 79 84 89 94 99 104 109 114 250 0 4 9 14 19 24 29 34 39 44 49 54 59 64 69 74 79 84 89 94
Page 22 of 45
The example table contains dates describing calendar years from 1995 through 2109.
Table 6: tblDates ID 1 2 3 4 5 6 : 110 111 112 113 114 115 BeginDate 1/1/1995 1/1/1996 1/1/1997 1/1/1998 1/1/1999 1/1/2000 : 1/1/2104 1/1/2105 1/1/2106 1/1/2107 1/1/2108 1/1/2109 EndDate 12/31/1995 12/31/1996 12/31/1997 12/31/1998 12/31/1999 12/31/2000 : 12/31/2104 12/31/2105 12/31/2106 12/31/2107 12/31/2108 12/31/2109
Page 23 of 45
Example 4-1
In Example 4-1, the qrySelectFrequencyBySexAge_General_AllAges query calls the Count query which actually counts the number of people alive on [Date] by sex and age. The qrySelectFrequencyBySexAge_General_AllAges insures that all age groups for both sexes appear in the final result with a zero when the age group is empty. The results of running the query in Example 4-1 on the Big People table on January 1, 2050 are contained in Table 7, below:
Table 7: qrySelectFrequencyBySexAge_General_AllAges Run on the Big People Table on 1/1/2050 Sex Female Female Female Female Female Female Female Female Female AgeGroup 0 1-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 Frequency 6 20 29 31 32 32 32 32 37
Page 24 of 45
Sex Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male
AgeGroup 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100 -104 105 -109 110 -114 115+ 0 1-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100 -104 105 -109 110 -114 115+
Frequency 33 29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 28 31 29 28 28 28 28 23 27 31 1 0 0 0 0 0 0 0 0 0 0 0 0 0
almost all cases the input data does not have dates specified to the precision of a fraction of a day (i.e. hours and minutes), and this additional precision in the duration of one year is therefore spurious and unwarranted. 365 days is the recommended duration of one year. The SELECT statement begins by specifying the fields that are to be returned by the query: Sex, AgeGroup and a count of the ID field of the people table. Sex comes from the people table, AgeGroup from the ages table and the count() function aggregates and counts over the categories defined by the GROUP BY clause toward the end of the query: Sex, AgeGroup and StartAge. StartAge is redundant with AgeGroup and is included in the GROUP BY clause only to allow us to include it in the ORDER BY clause so that the resulting records can be sorted by sex and the starting age of each age group (a numeric value as opposed to the AgeGroup label which is text and does not sort in numeric order). The FROM clause indicates that a Cartesian product is formed between the people and ages tables. This has the effect of pairing each person in the people table with all of the age groups in the ages table. The WHERE clause restricts the records included in the result such that there is a unique record for each person; the one that satisfies the conditions: 1) that the perso ns current age (calculated on [Date]) falls within the time period defined by the days that begin and end (inclusive) the age group with which the record is associated, and 2) that the person has been born on or before [Date] and survives to at least [Date].
Dedicated crosstab engines can be much faster at the expense of the literally infinite flexibility offered by SQL.
Page 27 of 45
5 Person-Years
Person years form the denominator of all event-exposure rates, and consequently calculating person years of exposure is of fundamental importance to Demographers. The calculation of person years is traditionally skipped over and assumed t be relatively trivial, but when one is confronted with such a o calculation, numerous difficulties arise; including, over what time period should the calculation be made, how many days, hours and minutes are actually in a year, what do we do about months with different numbers of days and leap years, how are age groups defined, and finally , how do we construct reproducible, defensible , efficient, and reusable calculations that address all these concerns and produce a reasonable estimate of the person years exposed by various categories of the unit of analysis? Finally, given that the data are often stored in a relational database, how can we use SQL to perform the person years calculation in a generic, reusable way with user-defined parameters defining the time period covered by the calculation and the analysts preference for the exact duration of one year? Example 5-1, below is an efficient solution to these challenges using standard SQL and one user-defined function.
qrySelectPersonYearsBySexAge_General: PARAMETERS [Begin] DateTime, [End] DateTime, [DaysInYear] IEEESingle; SELECT tblPeople.Sex, tblAges.AgeGroup, Round((Sum(ftnIntervalIntersection([Begin],[End],[DOB]+[StartAge]*[DaysIn Year],IIf(IsNull([DOD]),[DOB]+([EndAge]+1)*[DaysInYear]1,IIf([DOD]<[DOB]+([EndAge]+1)*[DaysInYear] 1,[DOD],[DOB]+([EndAge]+1)*[DaysInYear]-1))))/[DaysInYear]),2) AS PersonYears FROM tblAges, tblPeople WHERE (((tblPeople.DOB)<=[End]) AND ((tblPeople.DOD)>=[Begin])) OR (((tblPeople.DOB)<=[End]) AND ((tblPeople.DOD) Is Null)) GROUP BY tblPeople.Sex, tblAges.AgeGroup, tblAges.StartAge ORDER BY tblPeople.Sex, tblAges.StartAge; ftnIntervalIntersection: Public Function ftnInterva lIntersection( _ ByVal dteB1 As Long, _ ByVal dteE1 As Long, _ ByVal dteB2 As Long, _ ByVal dteE2 As Long) As Long 'Logic ' - Takes two intervals defined by their beginning and ending dates as input ' - For each interval, the begin date must precede the end date and the begin ' and end date cannot be the same ' - Valid time during the intervals is assumed to include all of the days ' spanned by the interval including the begin date but excluding the end ' date ' - The two intervals are compared and the number of days contained in any ' overlap is calculated 'Return values ' valid : >=0 the number of days that fall within both of the input intervals ' logic error : -1 ' invalid input : 0
Example 5-1
Page 28 of 45
'Zero results from either invalid input or no overlap; this prevents one invalid interval 'from killing a query that calls this function thousands of times, without invalidating the 'overall result 'Verify that the intervals are valid intervals If (dteE1 - dteB1 >= 0) And (dteE2 - dteB2 > = 0) Then 'Intersect the intervals and return the length of the intersecting segment 'Left and Right defined by interval 1 and 2 'Add one to the result to account for the fact that the subtraction does not include 'the terminal day, and we want closed-closed interval arithmetic 'Left overlap If (dteB1 < dteB2) And (dteE1 >= dteB2) And (dteE1 <= dteE2) Then ftnIntervalIntersection = dteE1 - dteB2 + 1 'Left no overlap ElseIf (dteB1 < dteB2) And (dteE1 < dteB2) Then ftnIntervalIntersection = 0 'Right overlap ElseIf (dteB1 >= dteB2) And (dteB1 <= dteE2) And (dteE1 > dteE2) Then ftnIntervalIntersection = dteE2 - dteB1 + 1 'Right no overlap ElseIf (dteB1 > dteE2) And (dteE1 > dteE2) Then ftnIntervalIntersection = 0 '2 contains 1 ElseIf (dteB1 >= dteB2) And (dteE1 <= dteE2) Then ftnIntervalIntersection = dteE1 - dteB1 + 1 '1 contains 2 ElseIf (dteB1 <= dteB2) And (dteE1 >= dteE2) Then ftnIntervalIntersection = dteE2 - dteB2 + 1 'Error in logic: the defined cases do not capture the temporal relationship 'between the two intervals Else ftnIntervalIntersection = -1 End If 'One of the intervals is not valid: i.e. it is of zero length or the end date precedes 'the begin date Else ftnIntervalIntersection = 0 End If End Function
The primary reason the calculation of person years is complex is that it involves temporal logic. One must account for a number of states that an individual can occupy through time, perform logic to determine if the state occupied by an individual at a given point in time can be included in the accumulation of person years, and finally accumulate and present the total person years exposed for various categories of the unit of analysis usually sex and age, at a minimum. Because a temporal state is defined as a duration of time during which an attribute of interest does not change, it is most efficient to store only a representation of the events that cause an individuals state to change. In the framework of temporal states, an event is simply an instantaneous (on the time scale of interest) occurrence that results in a change of state for the individual(s) who experienced it. For the calculation of person years, the time scale is one day, the unit of analysis is people, and the events of interest are individuals birthdays and the beginning and end of the period over which the Page 29 of 45
person year calculation is made. The birthdays designate the instant when individuals change state from one age to the next, and the beginning and end of the period over which the calculation is made encapsulate the duration over which we want an observation of person years exposed. In both cases the anchored duration over which a given state persists is called an episode. An episode has a defined (by events) beginning and ending and corresponds to the anchored duration of time during which a well-defined state associated with a meaningful thing persists. The birthdays are stored in an abstract way in the ages table (tblAges) as unanchored durations of time that when compared to (or anchored by) individuals dates of birth identify the dates when the individuals experience a birthday and transition to the next age. In a generic and easily modified way, the ages table defines the age-episodes of individuals. The dates defining the period over which the calculation is to be made are provided by the analyst and define and anchor the observation-episode of interest. By comparing the age-episodes to the observation-episode one can determine how much of each age -episode of each individual falls within or overlaps the observation-episode. It is then straightforward to aggregate by sex and agegroup and take the sum of the overlaps to yield the total person years exposed by sex and age. The overlap logic is accomplished by the user-defined function ftnIntervalIntersection; taking as arguments the beginning and end dates for the two episodes being compared, and returning the number of days that the two episodes overlap, or equivalently, the number of dates shared by the two episodes. Table 8 through Table 10, below , present the results of running the qrySelectPersonYearsBySexAge_General query on the Big People and Small People tables. In Table 8, total person years exposed by sex and age are calculated over the period 1/1/2000 to 1/1 /2100 for the Big People table. All sex-age groups are presented, even when there are no person years exposed, so that the results can be easily copied (cut and pasted) to a presentation table. The numbers look plausible, and when combined with an unders tanding of the construction of the Big People table, a little reflection convinces us that the totals of sex -age-specific person years presented are accurate.
Table 8: qrySelectPersonYearsBySexAge_General: Run on the Big People Table over the Period 1/1/2000 to 1/1/2100 Sex Female Female Female Female Female Female Female Female Female Female Female AgeGroup 0 1-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49 PersonYears 620.43 2383.61 2787.85 2561.85 2325.37 2089.59 1848.84 1597.3 1344.67 1092.37 855.07
Page 30 of 45
Sex Female Female Female Female Female Female Female Female Female Female Female Female Female Female Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male
AgeGroup 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100 -104 105 -109 110 -114 115+ 0 1-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100 -104 105 -109 110 -114 115+
PersonYears 721.99 656.94 586.11 511.13 430.3 369.25 284.26 209.08 114.4 34.79 0 0 0 0 570.63 2200.75 2537.87 2314.19 2100.96 1887.06 1678.11 1479.97 1282.9 1085.51 873.12 706.7 621.85 542.8 467.87 398.81 309.96 245.06 170.33 115.11 44.82 0.07 0 0 0
Table 9 and Table 10 present the same query run on the Small People table over the periods 1/1/2000 to 1/1/2100 and 1/1/1980 to 12/31/2000 respectively. Because there are only four records in the Small People table, it is easy to verify that the results of the query are accurate. Table 11 is the result of running the same query over the period 1/1/1980 to 12/31/2000 using the Small Ages table to produce the calculation using fewer, wider (more aggregated) age groups.
Table 9: qrySelectPersonYearsBySexAge_General: Run on the Small People Table over the Period 1/1/2000 to 1/1/2100 Sex Female Female Female Female Female Female AgeGroup 0 1-4 5-9 10-14 15-19 20-24 PersonYears 1.54 4 5 5 5 5
Page 31 of 45
Sex Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male
AgeGroup 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100 -104 105 -109 110 -114 115+ 0 1-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100 -104 105 -109 110 -114 115+
Table 10 : qrySelectPersonYearsBySexAge_General: Run on the Small People Table over the Period 1/1/1980 to 12/31/2000 Sex Female Female Female Female Female Female AgeGroup 0 1-4 5-9 10-14 15-19 20-24 PersonYears 1.54 0 0 0 0 0
Page 32 of 45
Sex Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male
AgeGroup 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100 -104 105 -109 110 -114 115+ 0 1-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100 -104 105 -109 110 -114 115+
PersonYears 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 4 5 5 5 1.02 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Table 11 : qrySelectPersonYearsBySexAge_General: Run on the Small People Table over the Period 1/1/1980 to 12/31/2000 Using the Small Ages Age-Groups Sex Female Female Female Female Female AgeGroup 0 1-4 5-1 9 20-49 50+ PersonYears 1.54 0 0 0 0
Page 33 of 45
PersonYears 2 4 15 1.02 0
The next two sections contain a detailed examination of the person years query and its user-defined overlap function and should be skipped by the casual reader.
on which the time interval for the calculation starts and ends and [DaysInYear] as a single precision floating point number specifying the number of days in one year. The SELECT clause specifies three fields from the Cartesian product of tblAges and tblPeople: Sex, AgeGroup, and PersonYears, discussed below. The Cartesian product pairs every age group from the ages table with every individual in the people table. In this way every individual has the potential to contribute person-time to the total person years exposed in every age group. The WHERE clause insures that the individuals life overlaps with the analysis interval at all, and the interval intersection function ( above ) determines how much of each age interval for each person overlaps with the analysis interval. The WHERE clause specifies some temperal logic to insure that the only records eligible for inclusion are those for which some part of the individuals life overlaps with the analysis interval; formally, that the following condition holds: ((tblPeople.DOB)<=[End]) AND ((tblPeople.DOD)>=[Begin])) OR ((tblPeople.DOB)<=[End]) AND (tblPeople.DOD) Is Null)). In other words, those for which: (the individual is born before the end of the analysis interval) AND (the individual dies after the beginning of the analysis interval) OR (the individual is born before the end of the analysis interval) AND (the individual has not yet died by the end of the analysis interval) The resulting re cords that remain after these WHERE conditions are met form the basis for the person years calculation. Each of these records consists of an age group identifier (string describing the ages included in the age group), the beginning [StartAge] and ending [EndAge] ages of the age group in years since birth, and an individuals sex, date of birth [DOB], and date of death [DOD] (null if the indivi ual is still alive at the last time observed). d The [DOB], [DOD], [StartAge], and [EndAge] fields are used to determine the date of all of the birthdays that an individual has that result in a change from one age group to the next, as defined by the ages table. Each record corresponds to the unique matching of an age group with an individual, and each individual is matched with all age groups. Each of these age group-individual records is then fed to the interval intersection function passing it the beginning and end of the analysis interval, [Begin] and [End] as the definition of the first interval, and passing it the dates on which the individual started and stopped living through the age group as the second interval. The interval intersection function returns the number of days that the individual lived in the age group during the analysis interval and stores it in the [PersonYears] field. The records are then grouped by sex, age group, and the start age of each age group. Over each of these groups the number of days stored in the [PersonYears] field is summed and divided by the value of [DaysInYear] (passed to the query as a parameter) to yield the total number of years exposed in each Page 35 of 45
sex-age group. Following is a commented copy of the code fragment that is called to calculate the overlap between the age group and the analysis interval; what populates the [PersonYears] field in the Person Years query. Sum over sex-age groups. Sum( Call the interval intersection function. ftnIntervalIntersection( The first episode is the analysis interval defined by the [Begin] and [End] date values passed to the query as parameters. [Begin],[End], The second episode is the age interval as defined by the individuals date of birth, date of death, and the age s specified in the ages table. The beginning date of the age interval is the individuals date of birth plus the start age of the age group multiplied times the number of days in a year, [DaysInYear] passed to the query as a parameter. [DOB]+[StartAge]*[DaysInYear], The end of the age interval is complicated. If the individual has not died, the end of the age interval is the individuals date of birth plus the end age of the age group multiplied times the number of days in one year. The +1 and -1 account for the fact that the end age for the age group is specified as the last completed age, not the age of the next older age group, and we do not count the last day an age interval because we count the first. If the individual has died, if the date of d eath is before the end of the age interval, the end of the age interval is the individuals date of death, otherwise, the end of the age interval is the individuals date of birth plus the end age of the age group multiplied times the number of days in one year, same issue with the +1 and -1. IIf(IsNull([DOD]), [DOB]+([EndAge]+1)*[DaysInYear]-1, IIf([DOD]<[DOB]+([EndAge]+1)*[DaysInYear]-1, [DOD], [DOB]+([EndAge]+1)*[DaysInYear] -1)) ) The sum of the overlap days in each sex-age group is divided by the number of days in one year to yield the person years exposed in each sex-age group. )/[DaysInYear])
Page 36 of 45
6 Event-Exposure Rates
Dividing the number of events occurring to a defined group of individuals by the number of person years that they were exposed to the risk of the event yields the staple measure of Demography: the event-exposure rate. Amongst many others, standard mortality and fertility rates are calculated in this way. Now that we have SQL queries to calculate event counts and person years exposed, it is an easy matter to combine them into a single query that calculates an event-exposure rate. Such a query is presented in Example 6-1.
qrySelectMortalityRate_General: SELECT qrySelectSexAgeGroups.Sex, qrySelectSexAgeGroups.AgeGroup, qrySelectDeathsBySexAge_General.Deaths, qrySelectPersonYearsBySexAge_General.PersonYears, Round((IIf([PersonYears]=0 Or IsNull([Deaths]),0,1000*([Deaths]/[PersonYears]))),2) AS MortalityRate FROM (qrySelectSexAgeGroups LEFT JOIN qrySelectDeathsBySexAge_General ON (qrySelectSexAgeGroups.AgeGroup = qrySelectDeathsBySexAge_General.AgeGroup) AND (qrySelectSexAgeGroups.Sex = qrySelectDeathsBySexAge_General.Sex)) LEFT JOIN qrySelectPersonYearsBySexAge_General ON (qrySelectSexAgeGroups.AgeGroup = qrySelectPersonYearsBySexAge_General.AgeGroup) AND (qrySelectSexAgeGroups.Sex = qrySelectPersonYearsBySexAge_General.Sex) ORDER BY qrySelectSexAgeGroups.Sex, qrySelectSexAgeGroups.StartAge;
Example 6-1
The results of running this query on both the Big and Small People tables using both the Big and Small Ages tables over various periods are displayed in Table 12 through Table 14. The age group, number of deaths, number of person years and finally the mortality rate per 1,000 is displayed. This query provides a re-usable, clean, efficient and flexible way to extract mortality rates directly from a relational database without the need for any external logic or processing. Calculation of any event-exposure rate can be performed in an analogous way.
Table 12 : qrySelectMortalityRate_General: Run on the Small People Table over the Period 1/1/1980 to 7/15/2080 Using the Small Ages Age-Groups Sex Female Female Female Female Female Male Male Male Male Male AgeGroup 0 1-4 5-19 20-49 50+ 0 1-4 5-19 20-49 50+ Deaths PersonYears MortalityRate 1.54 649.35 4 0 15 0 30 0 1 30.59 32.69 2 0 8 0 30 0 60 0 2 22.08 90.58 1
Table 13 : qrySelectMortalityRate_General:
Page 37 of 45
Run on the Big People Table over the Period 1/1/2000 to 12/31/2099 Using the Small Ages Age-Groups Sex Female Female Female Female Female Male Male Male Male Male AgeGroup 0 1-4 5-19 20-49 50+ 0 1-4 5-19 20-49 50+ Deaths 3 25 84 198 9 23 96 162 PersonYears 620.42 2383.57 7674.93 8827.57 3917.84 570.62 2200.72 6952.91 8286.44 3622.98 MortalityRate 4.84 10.49 10.94 22.43 0 15.77 10.45 13.81 19.55 0
Table 14 : qrySelectMortalityRate_General: Run on the Big People Table over the Period 1/1/2000 to 12/31/2099 Using the Big Ages Age-Groups Sex Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Female Male Male Male Male Male Male Male Male Male Male Male AgeGroup 0 1-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100-104 105-109 110-114 115+ 0 1-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49 Deaths 3 25 24 32 28 35 29 34 36 34 30 PersonYears 620.42 2383.57 2787.8 2561.79 2325.33 2089.55 1848.79 1597.25 1344.62 1092.33 855.03 721.96 656.9 586.07 511.09 430.25 369.21 284.21 209.04 114.36 34.75 0 0 0 0 570.62 2200.72 2537.83 2314.16 2100.92 1887.02 1678.07 1479.94 1282.86 1085.47 873.07 MortalityRate 4.84 10.49 8.61 12.49 12.04 16.75 15.69 21.29 26.77 31.13 35.09 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15.77 10.45 14.19 12.1 15.23 13.25 18.47 17.57 18.71 23.95 34.36
9 23 36 28 32 25 31 26 24 26 30
Page 38 of 45
Male Male Male Male Male Male Male Male Male Male Male Male Male Male
50-54 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100-104 105-109 110-114 115+
706.65 621.81 542.75 467.83 398.77 309.91 245.02 170.3 115.07 44.79 0.07 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
Page 39 of 45
Page 40 of 45
qryMakePersonYears: PARAMETERS [DaysInYear] IEEESingle; SELECT tblPeople.ID , tblPeople. ex, tblPeople.DOB, tblPeople.DOD, S Round(((1+[BeginDate]-[DOB])/[DaysInYear]),2) AS AgeAtBeginning, IIf(((Not IsNull([DOD])) And ([DOD]>=[BeginDate]) And ([DOD]<=[EndDate])),True,False) AS Died, tblDates.BeginDate, tblDates.EndDate INTO tblPersonYears FROM tblDates, tblPeople WHERE (((IIf(([DOB]<=[BeginDate]) And (IsNull([DOD]) Or ([DOD]>=[BeginDate])),True,False))=True)) ORDER BY tblPeople.ID, tblDates.BeginDate;
Example 7-1
The abbreviated result of running the Make Person Years query on the Small People table using a Dates table with records for the years 1970 through 2084 is shown Table 15. The complete set of person years for individuals 1 and 2 are presented along with the first two person years lived by individual 3. The rest of the result set is omitted for the sake of brevity. Each person year contains fields describing the individuals ID, sex, date of birth, date of death, age at the beginning of the time interval, whether or not they died during the in terval, the date of the beginning of the interval, and the date of the end of the interval. This is a very simple data structure that could be used (with an appropriately large sample) to produce sex-age-specific estim ates of the annual hazard (duration-specific probability) of death. A conventional life table could be constructed from those values converted directly to standard life table agespecific probabilities of death (n qx ) values. Additionally with this technique, the null hypothesis that male and female probabilities of death are equal at each age could be tested, and the null hypothesis that age-specific mortality rates do not change from one historical period to the next could be tested among many other possibilities.
Table 1 5: qryMakePersonYears: Run on the Small People Table over the Period 1/1/1970 to 12/31/2084 ID 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Sex Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male DOB 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 DOD AgeAtBeginning Died BeginDate EndDate 12/31/2050 0 0 1/1/2000 12/31/2000 12/31/2050 1.01 0 1/1/2001 12/31/2001 12/31/2050 2.01 0 1/1/2002 12/31/2002 12/31/2050 3.01 0 1/1/2003 12/31/2003 12/31/2050 4.01 0 1/1/2004 12/31/2004 12/31/2050 5.01 0 1/1/2005 12/31/2005 12/31/2050 6.01 0 1/1/2006 12/31/2006 12/31/2050 7.01 0 1/1/2007 12/31/2007 12/31/2050 8.01 0 1/1/2008 12/31/2008 12/31/2050 9.01 0 1/1/2009 12/31/2009 12/31/2050 10.01 0 1/1/2010 12/31/2010 12/31/2050 11.01 0 1/1/2011 12/31/2011 12/31/2050 12.01 0 1/1/2012 12/31/2012 12/31/2050 13.01 0 1/1/2013 12/31/2013 12/31/2050 14.01 0 1/1/2014 12/31/2014 12/31/2050 15.01 0 1/1/2015 12/31/2015 12/31/2050 16.01 0 1/1/2016 12/31/2016 12/31/2050 17.02 0 1/1/2017 12/31/2017 12/31/2050 18.02 0 1/1/2018 12/31/2018 12/31/2050 19.02 0 1/1/2019 12/31/2019 12/31/2050 20.02 0 1/1/2020 12/31/2020 12/31/2050 21.02 0 1/1/2021 12/31/2021
Page 41 of 45
ID 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 ?
Sex Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Male Female Male Male ?
DOB 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/2000 1/1/1980 1/1/1980 ?
DOD AgeAtBeginning Died BeginDate EndDate 12/31/2050 22.02 0 1/1/2022 12/31/2022 12/31/2050 23.02 0 1/1/2023 12/31/2023 12/31/2050 24.02 0 1/1/2024 12/31/2024 12/31/2050 25.02 0 1/1/2025 12/31/2025 12/31/2050 26.02 0 1/1/2026 12/31/2026 12/31/2050 27.02 0 1/1/2027 12/31/2027 12/31/2050 28.02 0 1/1/2028 12/31/2028 12/31/2050 29.02 0 1/1/2029 12/31/2029 12/31/2050 30.02 0 1/1/2030 12/31/2030 12/31/2050 31.02 0 1/1/2031 12/31/2031 12/31/2050 32.02 0 1/1/2032 12/31/2032 12/31/2050 33.03 0 1/1/2033 12/31/2033 12/31/2050 34.03 0 1/1/2034 12/31/2034 12/31/2050 35.03 0 1/1/2035 12/31/2035 12/31/2050 36.03 0 1/1/2036 12/31/2036 12/31/2050 37.03 0 1/1/2037 12/31/2037 12/31/2050 38.03 0 1/1/2038 12/31/2038 12/31/2050 39.03 0 1/1/2039 12/31/2039 12/31/2050 40.03 0 1/1/2040 12/31/2040 12/31/2050 41.03 0 1/1/2041 12/31/2041 12/31/2050 42.03 0 1/1/2042 12/31/2042 12/31/2050 43.03 0 1/1/2043 12/31/2043 12/31/2050 44.03 0 1/1/2044 12/31/2044 12/31/2050 45.04 0 1/1/2045 12/31/2045 12/31/2050 46.04 0 1/1/2046 12/31/2046 12/31/2050 47.04 0 1/1/2047 12/31/2047 12/31/2050 48.04 0 1/1/2048 12/31/2048 12/31/2050 49.04 0 1/1/2049 12/31/2049 12/31/2050 50.04 -1 1/1/2050 12/31/2050 7/15/2000 0 -1 1/1/2000 12/31/2000 12/31 /2050 0 0 1/1/1980 12/31/1980 12/31/2050 1.01 0 1/1/1981 12/31/1981 ? ? ? ? ?
beginning of the interval, and date of the end of the interval. Calculated fields include the persons age at the beginning of the interval (making use of the number of days in one year) and a boolean variable indicating whether or not the person died during the interval. The WHERE clause restricts the records in the result set to include only those for which the person was alive at the beginning of the time interval. The result is a dataset containing one record for each person-time interval lived in the People table corresp onding to a time interval included in the dates table. A collection of both time-varying and time invariant attributes of the people are defined at the beginning of each person-time interval, and whether or not the person died during the interval is recorded.
Page 43 of 45
8 Summary
This communication provides a very basic introduction to the Relational Model of Data and the Standard Query Language used to manage data in a Relational Database. Illustrative Examples are provided using the most standard version of SQL commercially available, that found in IBMs DB2 Database Management System. After covering the basics, various queries useful to Demographers are discussed and described in detail. Because of its ease of use and widespread availability, Microsofts Access database is used to construct and run all of the Demographically interesting examples. Both the DB2 script that runs the illustrative examples and the Access database with the Demography -related examples are available for download as part of this communication. If you are part of an educational or research organization, you can acquire a free copy of IBMs DB2 Database Management System through IBMs Scholars program, see http://www.ibm.com/db2. Microsoft Access is a standard part of Microsofts Office productivity suite, or it can be purchased as a standalone product.
Page 44 of 45
9 Bibliography
Allen, J. F. 1983. "Maintaining Knowledge about Temporal Intervals." Communications of the ACM, 26(11 November 1983):832-843. Allison, P. 1982. "Discrete -time Methods for the Analysis of Event Histories." Pp. 61-98 in Sociological Methodology , edited by S. Leinhardt. San Francisco: Jossey-Bass. . 1984. Event History Analysis. Regression for Longitudinal Event Data . Beverly Hills: Sage. Codd, E. F. 1970. "A Relational Model of Data for Large Shared Data Banks." Communications of the ACM, 13(6):377-387. . 1990. The Relational Model for Database Management Version 2. Reading Massachusetts: Addison-Wesley. Date, C. J. 2000. An Introduction to Database Systems, 7th ed. Reading Massachusetts: Addison-Wesley. Rodrguez, G. 2002. "Lecture Notes for Generalized Linear Statistical Models". http://data.princeton.edu/wws509 . October, 2002 .
Page 45 of 45