SQL For Beginners
SQL For Beginners
SQL For Beginners
Front matter 5
2
CONTENTS 3
Glossary 167
Front matter
Preface
This book can be printed, but it’s best accessed as an electronic pdf. That’s
because, if you see this colour, you can click it (it’s a link). Many things
are sneakily hyperlinked within the document. For example, clicking on an
item in the table of contents will jump you to that item in the book, and
clicking a reference to a figure will jump you to that figure. This book also
features a glossary of terms. If you see a new term appearing in bold, you
can jump to the glossary definition by clicking on the bold word.
If you already know which type of SQL you want to learn, keep in mind
that we cover standard SQL, along with the MySQL and T-SQL variants. If
these words mean nothing to you, then read on, friend. All will be revealed.
Introduction
Look, SQL is a polarising language. I’m not sure if anyone truly likes it. I’ve
found it’s best to come clean about this from the start. Regardless, I swear
that SQL is fun. Don’t believe me? Take my course and tell me you hated it.
Every query is a mini puzzle to solve. Maybe SQL is less fun when you’re an
experienced engineer, but who cares? SQL is powerful. It has been around
since the 1970s, and today it is the standard bridge between data engineers
and data analysts.
A wise man once told me that a wise woman once told him that a drunk
person’s words are a sober person’s thoughts. On this account, don’t take
it from me, but from a reddit post by a drunk senior engineer:
5
CONTENTS 6
To me, that last line is SQL in a nutshell. I may hate it, but I recom-
mend it. And look, I suspect most of my intended audience aren’t around
to “make bank”. You’re more likely here to learn to use SQL on research
projects involving some kind of dataset that doesn’t seem to fit well in a
CSV file. I recommend learning SQL, not so you can demand a higher salary
in your next engineering job, but for two main reasons:
1. You’re taking my course, so you probably already have found you need
to use SQL. If not, then you’re probably in a position where you could
use it soon, to great effect.
2. This book will teach you a model for structuring and processing data.
As we’ll learn, parts of that model are used in many different lan-
guages, not just SQL. This model underlies SQL and has stood a grand
test of time.
In the coming few pages, you and I will unpack the following quote:
Just kidding, let’s not unpack that quote. Right now, we don’t care what
a “domain specific language” is, and we can just call it a “programming
language”. What we will do, very briefly, is learn about relational databases,
and the model that underlies SQL. Here is a more important quote to get
you on your way:
Perhaps you’re thinking: “hey listen here, Danny, I don’t care about database
management, I just want to use SQL to get my dataset so I can analyse it and
be on my way.”
CONTENTS 7
1
I mean oracle in the sense of a magic person who answers questions, not in the sense
of Oracle Corporation which, incidentally, manufactures database systems.
Chapter 1
8
CHAPTER 1. THE RELATIONAL MODEL 9
Under the hood, a DBMS interacts with a computer via some low-level
magic, mostly of interest to engineers. Above the hood, the DBMS interacts
with humans. So, the DBMS aims to share a conceptual representation (i.e.,
a structure) of the data with these humans. For the DBMS, this conceptual
representation is stored as a catalogue of information called metadata (lit-
erally, data about data). The use of metadata, kept separate from the main
data, allows the DBMS to maintain a nice layer of abstraction between its
CHAPTER 1. THE RELATIONAL MODEL 10
The paper draws on some of the mathematical theory of relations (or set
theory and first-order predicate logic). It applies this math to the task
of modelling (i.e., structuring) a database with metadata. The resulting
model, called the relational model, turns out to have some very nice math-
ematical properties that are super useful for a DBMS to take advantage of.
Much of Ted’s seminal paper on the relational model looks a bit like this:
Figure 1.3: Some undefined mathematical symbols to scare us away from Ted’s paper.
[Codd, 1970].
Friends
FriendID FirstName LastName FavColour
1 X A red
2 Y B blue
3 Z C NULL
If the Friends table includes all the names of my friends, then the above
operation will give me a list of their first names, X, Y , and Z. We will look
at the SELECT and FROM keywords closely in time. For now, we’re going to
focus on building our vocabulary on the anatomy of tables.
Friends
FriendID FirstName LastName FavColour
WRON
Friends G
FriendID FirstName LastName FavColour PetName1 PetName2
1 X A red NULL NULL
2 Y B blue Chikin NULL
3 Z C NULL Cauchy Gauss
According to the above table, my friend X has no pets, Y has one pet
(Chikin), and Z has two pets (Cauchy and Gauss). This set-up is problematic
for a few reasons.
• Firstly, I have to store NULL in every entry where there is no pet. This
takes up space and is cumbersome.
• Secondly, if I meet a new friend who has three pets, then we need
to add an extra column to the table. In this case, after adding a new
column (PetName3 ), we would have to insert new NULL values under
PetName3 into every row that doesn’t have 3 pets. With many friends,
and many pets, the amount of NULL values quickly escalates.
CHAPTER 1. THE RELATIONAL MODEL 15
The new table will contain data on all the pets. Each pet will receive its
own unique identifier, petID. A new and crucial little attribute, FriendID,
will describe which friend each pet belongs to.
CORRE
Pets CT
PetID PetName PetDOB FriendID
1 Chikin 24/09/2016 2
2 Cauchy 01/03/2012 3
3 Gauss 01/03/2012 3
Now, for example, if we want to retrieve the details on the owner of the
pet named Chikin, then we can start by looking up the Pets.FriendID for
Chikin (finding that it’s equal to 2) and then we match Chikin to its owner
by finding out where the column Friends.FriendID is equal to 2.
Pets Friends
PetID . . . FriendID FriendID . . .
1 2 1
2 3 2
3 3 3
Figure 1.8: Finding the details of the friend who has the pet with PetID of 1.
For the above to work smoothly, each entry stored under Pets.FriendID
must correspond an existing entry stored under Friends.FriendID (recall
our notation that Pets.FriendID means “the FriendID column of the Pets
table”). We need every Pets.FriendID entry to have a matching entry in
Friends.FriendID, so that we can be guaranteed that every pet will have an
owner. This requirement, in database lingo, is called referential integrity.
In Section 1.4.3, we will learn why referential integrity is important.
CHAPTER 1. THE RELATIONAL MODEL 16
Going the other way, if we want the details of all pets belonging to, say,
my friend Z, then we start by finding, in the Friends table, that my friend
Z has FriendID equal to 3, and then we can search for every Pets.FriendID
entry that is also equal to 3.
Pets Friends
PetID . . . FriendID FriendID . . .
1 2 1
2 3 2
3 3 3
Figure 1.9: Finding the details of all pets who belong to the friend with FriendID of 3.
Take a moment to convince yourself that this works, and that the above-
mentioned problems with our previous table (Figure 1.6) have been solved.
By using two tables instead of one, we have removed many NULL values,
while also making the database more flexible. Part of being “more flexible,”
is that each pet now has its own unique ID number (PetID). This means, if
we want to, we can add a new one-to-many relationship between pets and
something else (like pet toys), just the same way that we created a one-to-
many relationship between Friends and Pets. That is, we would create a
new table (say, PetToys) with an attribute PetID, that “points at” the PetID
column of the Pets table, indicating to which pet each toy belongs (just how
we indicated to which friend each pet belongs). Try it yourself now, as an
exercise.
In this section, we modelled a one-to-many relationship. In the next
section, we will expand on this idea to cover the two remaining types of
relationships that can arise between two tables in a relational database:
one-to-one relationships, and many-to-many relationships. All three kinds
of relationships are managed by wonderful attributes called primary and
foreign key pairs.
ample, the primary key for the Friends table is the FriendID column, and
the primary key for the Pets table is the PetID column. Since the role of a
primary key is to uniquely identify rows, we must ensure that every primary
key entry is unique. That is, no two rows in a table can share the same value
for their primary key entries.
It is good practice to give every table a primary key. When creating a SQL
database, it’s easy to specify which attributes are primary keys. However,
unfortunately, some database administrators didn’t get the memo. So, in
the databases that you use, you may potentially encounter tables having
no clear primary key. In such cases, as we’ll soon learn, when writing SQL
queries, we may have to choose an attribute to play the role of primary key,
even when the database doesn’t recognise it officially as a primary key. The
same goes for foreign keys.
A foreign key is any column (or collection of columns) where each entry
is equal to one, and only one, primary key entry in some other table. Thus,
given a foreign key entry, we can always find the unique primary key entry
that it is equal to. For this reason, we say that the foreign key is “pointing
at”, or that it references, the corresponding primary key. When creating a
foreign key, the creator writes SQL code to tell the database which primary
key the foreign key is “pointing at”.
Looking at these tables, can you decide if either (i) or (ii) below are true?
(ii) For each PostCode entry in Houses, is there definitely at least one en-
try in Suburbs with that PostCode?
• If (i) is false, then two suburbs can share the same post code. For such
a pair of suburbs, these tables would not allow us to decide which of
the two suburbs a given house belongs to.
and one back can be scratched by more than one friend. In practice, we can
model a many-to-many relationship using one new table and two one-to-
many relationships. In other words, we make one new table, and use two
primary/foreign key pairs.
In this Scratched table, the foreign key ScratcherID references the primary
key FriendID from the Friends table. This lets us know which friend did
the back scratching. Similarly, the foreign key ScratcheeID references the
primary key FriendID from the Friends table. This lets us know whose back
was being scratched.
Friends Friends
FriendID FirstName ... FriendID FirstName ...
1 X 1 X
2 Y 2 Y
3 Z 3 Z
Scratched
ScratcherID Date Time ScratcheeID
1 05/09/2018 12:00pm 2
1 05/09/2018 12:30pm 3
2 06/09/2018 11:00am 1
3 07/09/2018 10:00am 1
In this example, both foreign keys reference the primary key from the
Friends table. In Figure 1.10, we are visualising this as if there are two
copies of Friends. In general, a many-to-many relationship can exist be-
tween any two tables, i.e., not necessarily between one table (Friends) and
itself. In any case, we always model a many-to-many relationship using
two one-to-many relationships: that is, a new table (Scratched) and two
primary/foreign key pairs, as we have done in Figure 1.10.
For practice, let’s model one more many-to-many relationship. This
time, between pets and friends. A pet can play with more than one friend,
and a friend can play with more than one pet. For whatever reason, we
decide to keep count. We need a new table, PlayCount, and two primary/-
foreign key pairs.
CHAPTER 1. THE RELATIONAL MODEL 20
Pets Friends
PetID PetName ... FriendID FirstName ...
1 Chikin 1 X
2 Cauchy 2 Y
3 Gauss 3 Z
PlayCount
PetID Count FriendID
1 3 1
1 5 2
3 4 2
We can see from the PlayCount table (Figure 1.11) that my friend X
played with Chikin 3 times, Y played with Chikin 5 times, and Y played
with Gauss 4 times. Nobody played with Cauchy.
So, there we have it. A many-to-many relationship between two tables
is really just two one-to-many relationships, with a new intermediate table
(PlayCount in Figure 1.11, or Scratched in Figure 1.10) to store the two new
foreign keys, along with any attributes of the relationship itself (such as a
Count attribute, or the Date and Time).
Remember, way back in Section 1.4.2, when I said that every table should
have a primary key? Well, where is the primary key in PlayCount? Where is
it? It’s not there. We need to make it. What’s more, it won’t be any single
column. The primary key of PlayCount will be two columns.
So far, we’ve only seen primary keys that are a single column. But, re-
member, the primary key can be more than one column. Of course, as we
well know, the primary key also has to be unique, and it must have no
NULL values. To achieve uniqueness in Scratched, we need to use all four
columns as the primary key - gasp! all four columns? Yes, because one friend
can scratch the same friend’s back on different dates and times, so we need
all four columns, to achieve uniqueness. So, the primary key needs to be
all four columns in Scratched, but what about in PlayCount? Well, to get
uniqueness in PlayCount, we only need the primary key to be composed of
two columns: the two foreign keys. Indeed, when the same friend plays with
the same pet on different occasions, we can just increment the correspond-
ing Count attribute, rather than adding a whole new row to PlayCount and
violating the uniqueness of the two foreign keys.
CHAPTER 1. THE RELATIONAL MODEL 21
WRON
Friends G
FriendID FirstName ... PptCountry PptNo PptExpiry
1 X Australia E1321 12/03/2021
2 Y New Zealand LA123 01/09/2032
3 Z Monaco S9876 19/06/2028
Assuming that each friend has only one passport, we say that there is a
one-to-one relationship between friends and passports. The above table
may often be perfectly fine for capturing this one-to-one relationship. In
many cases, there is no need to introduce a new table when modelling a
one-to-one relationship, at all. Indeed, individual tables capture one-to-
one relationships themselves, already. For example, by including a First-
Name column in the Friends table, we are implying that there is a one-to-
one relationship between a friend and their own first name.
However, for keeping track of my friends passport details, I’ve decided I
need more than one table. This implies I want to think of friends as separate
entities to their passports. One reason might be that, perhaps, many of my
friends do not have passports, and I don’t want to create unnecessary NULL
values (recall Figure 1.6). Another reason could be about the kind of data
I want to delete when I delete a friend. Suppose I lose my friend, X. Well,
I definitely want to delete their details from my Friends table. In the next
table, I’ve deleted my ex-friend, X:
Friends
FriendID FirstName ... PptCountry PptNo PptExpiry
2 Y New Zealand LA123 01/09/2032
3 Z Monaco S9876 19/06/2028
CORRE
Passports CT
PptNo PptCountry PptExpiry FriendID
E1321 Australia 12/03/2021 NULL
LA123 New Zealand 01/09/2032 2
S9876 Monaco 19/06/2028 3
In the above table, the foreign key FriendID will reference the FriendID
column of the Friends table, just as it would when modelling a one-to-many
relationship. Now, if I lose a friend whose passport details I’m holding onto,
then I can delete them from my Friends table and insert NULL into the cor-
responding FriendID entry in the Passports table.
This is a good time to talk about data redundancy. There is still a prob-
lem with my Passports table. If I have a NULL value in the FriendID column,
then I won’t be able to find the name of the person who the corresponding
passport belongs to. Hmm, should I include my friend’s names in the Pass-
ports table as well, so the names won’t get deleted when I delete a friend?
Think about this for a moment. Will this cause any issues? Yes, it may:
if I keep friends’ names in both the Friends table and the Passports table,
then the same piece of data will be repeated in two different locations in my
database. This is known as data redundancy.
The problems with data redundancy are that it takes up unnecessary
space, it can lead to inconsistencies in the data (if mistakes are made during
data entry), and it leads to complications when updating data (because we’ll
need to update the data in multiple locations). The process of restructuring
a relational database to reduce redundancy and preserve the integrity of
data is known as normalisation.
To solve our problem with passport names, it would be best to re-think
our database a little. For example, we could have a table called Contacts,
with details of all the people we know. We could have one-to-one rela-
tionships between Contacts and each of two other tables, named Friends
and Enemies, that contain friend-specific and enemy-specific data (like,
favourite colours for friends, and secret hideouts for enemies). Relational
database design is a deep and interesting topic, lying mostly outside the
scope of these notes. So, next time you meet a good database engineer,
give them a high five (and then employ them).
CHAPTER 1. THE RELATIONAL MODEL 23
Exercise 1.5.1
We’ll start with some quick questions to warm up.
1. Answer true or false to each of the following:
a) SQL is a relational database management system.
b) Microsoft SQL Server is a programming language.
c) A primary key must always be unique.
d) A primary key must never be NULL .
e) A foreign key must always be unique.
f) A foreign key must never be NULL .
g) In each table, only one column can be the primary key.
2. Referential integrity demands that (choose one): (i) the primary key
must always exist; (ii) the foreign key must always exist; or, (iii) for
every foreign key entry, there must be a corresponding primary key
entry.
Exercise 1.5.2
For the following two tables, you are told that each home can have many
tenants, but each tenant lives in just one home.
CHAPTER 1. THE RELATIONAL MODEL 24
Tenant
Home
TenantID FirstName
HomeID Street
1 Thomas
1 11 Fisher Avenue
2 Skylar
2 3 Cook Bend
3 Huong
3 17 Nightingale Court
4 Ananya
Tenant
TenantID FirstName HomeID
1 Thomas 1
2 Skylar 1
3 Huong 2
4 Ananya 3
Exercise 1.5.3
A European travel agent has a collection of ‘language immersion’ vacation
packages. Each vacation is within one country. The VacationHistory table
below, lists details of each vacation that a traveller has been on.
CHAPTER 1. THE RELATIONAL MODEL 25
VacationHistory
TravellerID FirstName Country Language StartDate VacationID
1 Lennon France French 2018-01-14 1
1 Lennon France French 2017-05-23 1
1 Lennon Spain Spanish 2016-05-20 2
2 Viviana France French 2017-03-09 1
2 Viviana Spain Spanish 2018-03-22 2
2 Viviana Germany German 2012-11-10 3
3 Zhang Germany German 2018-12-31 3
Vacation Traveller
VacationID Country Language TravellerID FirstName
1 France French 1 Lennon
2 Spain Spanish 2 Viviana
3 Germany German 3 Zhang
VacationRecord
TravellerID VacationID StartDate
1 1 2018-01-14
1 1 2017-05-23
1 2 2016-05-20
2 1 2017-03-09
2 2 2018-03-22
2 3 2012-11-10
3 3 2018-12-31
Chapter 2
SQL is a standard
One of the great things about SQL, is that it’s more than just a language.
It’s also a standard. Since 1987, SQL has been a standard of both the Amer-
ican National Standards Institute (ANSI), and the International Organisa-
tion for Standardisation (ISO). This means, every SQL database manage-
ment system (including Oracle, MySQL, T-SQL, SQLite, PostgreSQL, etc),
implements some number of the same basic SQL standards, in some way or
another. I will refer to these different implementations as ‘dialects’ of SQL.
Each dialect of SQL differs enough from all the others that, if you wanted
to switch from one dialect to another, you would have to spend some time
independently learning the new syntax. However, in theory, since all these
dialects implement the same SQL standards, it shouldn’t take you long.
Likewise, if a company decides to invest in a new database management
system, then it is much easier for them to switch from one SQL dialect to
another than if they changed to an entirely different standard (i.e., stopped
using SQL entirely).
The different dialects of SQL also add new features, on top of the SQL
standard, that make life significantly easier to use when writing code. In the
interests of variety and understanding, we’ll use two dialects in this book.
We’ll use one proprietary dialect, T-SQL, owned by Microsoft corporation,
and one free (open source) dialect, MySQL, managed by Oracle corpora-
27
CHAPTER 2. BASIC SQL QUERIES 28
tion. Both dialects are hugely popular, being among the top three database
management systems globally.
Of course, most of the code we learn will be SQL standard, meaning it will
work on both T-SQL and MySQL. Whenever we introduce something pecu-
liar to only one dialect of SQL (e.g., to T-SQL), we’ll make clear what the
alternative is (if it exists) in the other dialect (e.g., in MySQL). If we don’t
mention T-SQL or MySQL, it means what we are introducing something
that works in both dialects (i.e., a part of the standard).
What is a query?
Tipping our hats to English grammar terminology, the words ‘clause’ and
‘statement’ are frequently used to describe parts of the SQL language. A
SQL clause is the smallest logical component of a SQL statement that lets
you filter or customise how you want your data to be manipulated or re-
turned to you. Examples are SELECT, FROM, WHERE, GROUP BY, HAVING and ORDER
BY. Clauses become parts of statements. A SQL statement is somewhat com-
parable to an English language sentence. It contains one or more clauses.
For example, the statement SELECT FirstName FROM Friends returns the first
names of all my friends. In this chapter, we’ll introduce and visualise the
inner workings of the basic fundamental clauses in SQL, using them to build
basic fundamental statements. The action of using statements to request
data or information from a database is called a query.
Formal languages
Technically, SQL clauses are based on a pair of formal mathematical lan-
guages, called relational algebra and relational calculus. We won’t be cover-
ing the mathematics in this course, since these are mostly the domain of
computer scientists and theoreticians. That which we will be learning, SQL,
is a practical engineering approximation to these formal languages. It may
make us sound fancy to name-drop a couple of formal languages, but in
the end you will see that all we are learning is a collection of fairly intuitive
ways to chop tables up and recombine them into new tables.
FROM Friends
Friends
FriendID FirstName LastName FavColour
1 X A red
2 Y B blue
3 Z C NULL
=⇒
RESULT
FirstName FavColour
X red
Y blue
Z NULL
The SQL syntax is designed to try to mimic the English language a bit. This
means that the order in which you write clauses is not necessarily the same
order that you would think about doing them procedurally. This can lead
to a fair bit of confusion for people who have done some programming in
other languages, where the order that things are written matches the order
that things are actually done. For example, in the above query, the FROM
clause is executed first, bringing up the Friends table, and the SELECT clause
CHAPTER 2. BASIC SQL QUERIES 30
SELECT *
FROM Friends;
When using SELECT, it is good practice to avoid using the * keyword at all,
even when you want to select all columns of a table. This is because, in
practice, you might end up writing other code that depends on your SELECT
clause returning a fixed set of columns (maybe even in a fixed order). As
we’ll see later, it’s possible for the structure of a table to be changed in the
database (e.g., by adding or dropping columns), which would alter the result
of calling SELECT *, and possibly break any code that depends on the orig-
inally intended result. We will, however, often use SELECT * in this course,
for convenience and brevity.
When selecting attributes with SELECT, it’s also easy to rename columns
in the result table, using the AS keyword. For example, we can write
RESULT
My_friends_name Their_fav_colour
X red
Y blue
Z NULL
So if, for some twisted reason, you decide to name a column select instead
of My_friends_name, then you could write:
2.2.3 ORDER BY
We can use the ORDER BY clause at the end of a query, simply to order the
rows of the result.
SELECT *
FROM Friends
ORDER BY FriendID;
The above will just return the Friends table, exactly as in Figure 1.4. This
is because the Friends table is already ordered by FriendID. To reverse the
order, we can use the keyword DESC, specifying a descending order:
SELECT *
FROM Friends
ORDER BY FriendID DESC;
RESULT
FriendID FirstName LastName FavColour
3 Z C NULL
2 Y B blue
1 X A red
SELECT *
FROM Friends
ORDER BY LastName DESC;
The above query returns the table in Figure 2.3, since Figure 2.3 already
happens to be sorted in descending order of LastName, alphabetically.
Character strings can also contain numbers and other non-alphabetic
symbols. When the ‘alphabetic’ order is extended to cover non-alphabetic
symbols, it is called the a ‘lexicographic’ order. This order can cause some
confusion when character strings containing numbers are ordered. For ex-
ample, consider the following table, Numbers:
Numbers
Num NumString
111 ‘111’
31 ‘31’
32 ‘32’
211 ‘211’
In Figure 2.4, I’ve used quote marks to clarify that NumString stores the
numbers as strings, rather than as numbers. Let’s order by Num:
SELECT *
FROM Numbers
ORDER BY Num;
RESULT
Num NumString
31 ‘31’
32 ‘32’
111 ‘111’
211 ‘211’
SELECT *
FROM Numbers
ORDER BY NumString;
RESULT
Num NumString
111 ‘111’
211 ‘211’
31 ‘31’
32 ‘32’
We will learn more about all of the data types in Table 2.1, as we progress
through the course. The two that I want to comment on here are DECIMAL(p,s)
and VARCHAR(n). The DECIMAL(p,s) data type holds any decimal number with
a maximum of p digits, and with s digits after the decimal place. So, for ex-
ample, the largest number that can be stored in DECIMAL(5,2) is 999.99; it
can store the number 0.12, but it cannot store 0.123. The VARCHAR(n) data
type can hold any string of characters, with a maximum length of n. So,
VARCHAR(5) can hold the word ‘smart’, or the string ‘a12’, but it can’t hold
the word ‘smarty’.
SELECT NumString
FROM Numbers
ORDER BY CAST(NumString AS INT);
RESULT
Num NumString
31 ‘31’
32 ‘32’
111 ‘111’
211 ‘211’
Notice, in Figure 2.7, that the above query has ordered NumString using
the numerical order, rather than the lexicographic order. Nice! Aside from
ordering, there are many more cases where it can be beneficial to use CAST
(and we’ll see an important example in Section 3.2.2).
There are a large number of scalar functions, but three of the most use-
ful categories of scalar functions are:
CHAPTER 2. BASIC SQL QUERIES 35
• mathematical functions;
• string functions; and
• date and time functions.
Next, I’ll present some commonly used scalar functions, along with a few
examples. I’ll also provide some links to more of these functions, so you
can expand your repertoire as you learn SQL. In most of these examples,
I’ll use this table of restaurant orders:
Orders
OrderID Item Price OrderDT
1 Boiled leaves 2.99 2021-12-31 15:13:00
2 Bow wow 15 2021-12-31 15:34:00
3 Cackleberry stew 32.55 2022-01-01 09:32:00
4 Mug of murk 4.40 2022-01-01 10:16:00
Mathematical functions
Function Description
SQRT Square root
ROUND Rounding
RAND Generate random number
Should we desire, we can take the square roots of prices in the Orders table:
RESULT
SquareRoot
1.7291616465790582
3.872983346207417
5.705260730238365
2.0976176963403033
Figure 2.9: The (long and ugly) square roots of prices, from Orders.
CHAPTER 2. BASIC SQL QUERIES 36
We can also ‘nest’ functions, one inside the other, to do a sequence of trans-
formations on the same column. For example, since the output in Fig-
ure 2.13 is pretty hard to look at, we might want to round the numbers
to only two decimal places. We can do that with the ROUND function:
RESULT
SquareRoot
1.73
3.87
5.71
2.1
Figure 2.10: The less ugly square roots of prices, from Orders.
Not all of the mathematical functions operate on columns. The RAND func-
tion, for example, just produces a random number between 0 and 1.
SELECT RAND();
The full list of T-SQL mathematical functions is available here, and the
MySQL functions are available here.
String functions
Function Description
CONCAT Concatenate columns
SUBSTRING Extract characters
RESULT
Summary
The Boiled leaves cost 2.99
The Bow wow cost 15.00
The Cackleberry stew cost 32.55
The Mug of murk cost 4.40
The SUBSTRING function can cut characters out of a string according to their
position. For example, starting with the six-letter word ‘amazed’, we can
use SUBSTRING('amazed',2,4) to cut out the letters in positions 2 to 4, re-
turning the word ‘maze’. Here’s another example, on Orders:
SELECT SUBSTRING(Item, 1, 3) AS FirstThree
FROM Orders;
RESULT
FirstThree
Boi
Bow
Cac
Mug
For a full list of T-SQL string functions, click here, and for MySQL string
functions click here.
Function Description
DAY Extract the day (of the month)
MONTH Extract the month
YEAR Extract the year
The above date and time functions can be used to extract parts from date
or time data. For example:
CHAPTER 2. BASIC SQL QUERIES 38
RESULT
TheDay TheMonth
31 12
31 12
01 01
01 01
The list of T-SQL date and time functions is here, and the MySQL date and
time functions are here. Note, by following these links, you can also find
functions to return the current date and time.
2.2.6 WHERE
While SELECT specifies which columns to return, the WHERE clause specifies
which rows to return. The returned rows are chosen based on whether they
meet a search condition. A search condition is a logical statement that
evaluates to either true or false, for any given row. For example, the search
condition 1 = 1 will always be true.
SELECT *
FROM Friends
WHERE 1 = 1;
The WHERE clause in the above query is pointless because it does not exclude
any rows. In essence, the role of a WHERE clause is to exclude rows. If no
rows are to be excluded then it would be neater to avoid writing the clause,
since our query would return all rows of the table just the same. The act
of excluding rows of a table based on meeting a search search condition is
often referred to as filtering.
Search conditions can get fairly complicated, to the point where they
can, and often do, have whole separate queries nested inside them (but
we’ll open that can of worms later). Simple search conditions compare two
expressions via a logical operator, such as the symbols = (equals), < (less
than), or <= (less than or equal to). We give more details on logical operators
in Section 2.4.1, and more on search conditions in Section 2.4.4.
To create a more useful search condition than 1 = 1, we can include the
name of an attribute (i.e., column) as one of the expressions. For example,
CHAPTER 2. BASIC SQL QUERIES 39
the search condition FavColour = 'red' evaluates to true for every row that
has 'red' in the FavColour column.
Recall that clauses are not executed, in practice, in the same order that
they appear in the SQL syntax. The first clause to be executed is usually
FROM. The last clause to be executed is usually SELECT.
FROM Friends
Friends
FriendID FirstName LastName FavColour
1 X A red
2 Y B blue
3 Z C NULL
=⇒
RESULT
FirstName LastName
X A
stances of some entity that satisfy some condition”. This could be, “give
me all flight arrival and departure times for flights leaving Melbourne and
arriving in New Zealand on the 20th of November.” But, with what we’ve
learned, we can only filter based on the rows of the table that is referred to
in the FROM clause. For example, the following would produce an error:
SELECT FirstName
FROM Friends
WHERE PetName = 'Chikin';
Msg 207, Level 16, State 1, Line 1 Invalid column name 'PetName'.
Figure 2.15: Error message printed because PetName is not in the Friends table.
Figure 2.16: Error produced when an alias from SELECT is used in WHERE.
The above error is produced because the alias Their_fav_colour doesn’t ex-
ist yet when the WHERE clause executes. This is because SELECT is executed
last in the above query. Of the queries we’ve learned so far, the order of ex-
ecution is: FROM, WHERE, SELECT, and finally ORDER BY. You may not always
CHAPTER 2. BASIC SQL QUERIES 41
remember the exact order of execution, as a beginner, but if you pay atten-
tion to error messages (like that in Figure 2.16) then they will often nudge
you in the right direction.
=⇒
SELECT FirstName, PetName
RESULT
FirstName PetName
Y Chikin
Z Cauchy
Z Gauss
The above can be achieved more succinctly, using aliases. An alias is a sin-
gle letter, or a short word, that allows us to refer to a table without having
to write its full name. In the following, we choose the letters F and P as
aliases for Friends and Pets, respectively.
SELECT *
FROM Friends F JOIN Pets P ON F.FriendID = P.FriendID;
In SQL, there sometimes exists optional keywords that have no effect on the
meaning of a query. For example, when writing an alias, we can optionally
preceded the alias with the word AS, as in:
SELECT *
FROM Friends AS F JOIN Pets AS P ON F.FriendID = P.FriendID;
So, sometimes there are lots of ways to do the same thing! In fact, the
following are two other equivalent ways to write the above query.
CHAPTER 2. BASIC SQL QUERIES 43
SELECT *
FROM Friends F INNER JOIN Pets
ON F.FriendID = Pets.FriendID;
SELECT *
FROM Friends F, Pets P
WHERE F.FriendID = P.FriendID;
The last approach is called an implicit join, because the JOIN command
is not written explicitly but signalled by the comma between Friends F and
Pets P, and because the WHERE clause, instead of the ON clause, has been
used to specify the join condition F.FriendID = P.FriendID. We will avoid
using the implicit join, because it is less clear and is no longer part of the
SQL standard.
Table1 Table2
A B C D E A
1 Ignorance is slavery. 3 1
2 War is weakness. 4 2
3 Freedom is strength. 1 3
4 Friendship is peace. 2 4
If we join the tables, by comparing the primary key (Table1.A) with the
associated foreign key (Table2.A), then we get the intended table:
A B C D E
1 Ignorance is slavery. 3
2 War is weakness. 4
3 Freedom is strength. 1
4 Friendship is peace. 2
But if we instead mistakenly join the tables using the wrong join condition,
say, Table1.A = Table2.E, we get
A B C D A
1 Ignorance is strength. 3
2 War is peace. 4
3 Freedom is slavery. 1
4 Friendship is weakness. 2
Great, so we know how to join tables now, and Figure 2.20 warned us
about the potentially bleak results of choosing the wrong join condition.
But, what if we want our query to keep all rows of one table, even if they
don’t match any rows from the other table? For this, we need LEFT JOIN.
The LEFT JOIN in the above query keeps everything from Friends (the ta-
ble appearing to the left – hence the word ‘left’). To achieve this, it will re-
turn NULL values in place of any corresponding missing attributes from Pets.
So, since X has no pets, the query returns X’s name, but inserts NULL in the
position where X’s PetName would go if they had a pet:
CHAPTER 2. BASIC SQL QUERIES 45
FirstName PetName
X NULL
Y Chikin
Z Cauchy
Z Gauss
In the wild, you might see LEFT JOIN written as LEFT OUTER JOIN, though
it does the same thing as LEFT JOIN (so the word ‘outer’ is redundant). You
might also see a RIGHT JOIN appear in the wilderness, which does the same
thing as a left join, but on the right side instead of the left side. In other
words, the two queries below are equivalent.
Table 2.5: The standard comparison operators, with examples that return TRUE.
For a 25 year old male, Gender = 'M' evaluates to TRUE, and Age > 35 evalu-
ates to FALSE. So, the above will become
At this point, the AND operator does its thing, converting the above state-
ment into one logical value. You can use the truth tables in Figure 2.22, to
find out the answer.
Logical operators include the familiar words AND, OR and NOT. They com-
bine two or more instances of TRUE, FALSE or NULL , and produce new in-
stances of TRUE, FALSE or NULL . This should become more clear while looking
at the truth tables below.
CHAPTER 2. BASIC SQL QUERIES 47
AND
true AND true = true
false AND true = false
true AND false = false
false AND false = false
OR
true OR true = true
false OR true = true
true OR false = true
false OR false = false
NOT
NOT true = false
NOT false = true
Operator
ALL
ANY
SOME
EXISTS
BETWEEN
IN
LIKE
We’re not going to learn all of these right now. By the time you’re done
with Section 4.1, you’ll be able to read SQL documentation and figure out
what they do yourself. Some of them, we will look at in more detail soon.
Comparison operators, logical operators, and other operators, all go hand-
in-hand with search conditions, so we see them more in the next section.
CHAPTER 2. BASIC SQL QUERIES 48
The above search condition will exclude every row not representing a
male over the age of 35. We have used the brackets to make the order of
operations clearer. You don’t always have to use these brackets, but it is
often good for clarity. Search conditions can get about as complicated as
you like. For example, the below will ensure your query results include only
people who are both female and over 35, or both male and under 25.
WHERE ((Gender = 'F') AND (Age > 35)) OR ((Gender = 'M') AND (Age < 25))
Operator precedence
A quick route to great suffering is to forget about operator precedence when
writing a search condition. Operator precedence refers to the order that
governs which operators are executed first in a search condition.
Precedence Operators
1 Anything in round brackets
2 =,<,>,<=,>=,!=,!<,!> (comparison operators)
3 NOT
4 AND
5 OR, ALL, ANY, SOME, EXISTS, BETWEEN, IN, LIKE
Notice from Table 2.7, that AND is evaluated before OR. This can cause some
sneaky errors. Consider, for example, the following two search conditions:
Example 2.4.1. Table 2.7 will allow us to evaluate the following compli-
cated (and poorly written) search condition. You may also want to refer
back to the truth tables for logical operators, in Figure 2.22.
1. Starting with ‘anything in round brackets’, and noting that the ex-
pression (TRUE OR FALSE) evaluates to (TRUE), we have:
1 < 2 AND 2 = 2 OR 1 = 1 AND NOT (TRUE)
SELECT Name
FROM RandomPeople
WHERE Gender IN (SELECT Gender
FROM RandomPeople
GROUP BY Gender
HAVING AVG(Age) > 40);
There’s a lot going on above. We haven’t yet learned the clauses GROUP BY,
or HAVING, or anything about AVG. But, if you squint at the query, you can
see there is a whole second query that appears nested in brackets after the
keyword IN. That whole query is actually part of the search condition - so,
that whole nested query is actually part of the WHERE clause. We’ll find out
nested queries work later, in Section 3.4.
This process is called pattern matching. Along with %, standard SQL pro-
vides another pattern matching character, _ (the underscore). The under-
score matches any single character. So, the statement
having three underscores, will match any name starting with ‘Bi’ and being
exactly five characters in length.
Later, we will see that the T-SQL dialect adds additional wildcard char-
acters, and that MySQL (but not T-SQL) also has its own additional version
of the much more powerful regular expression pattern matching, which in-
troduces many more wildcard tools, allowing MySQL programmers to per-
form very imaginative pattern matching. If you’re keen to look at that now,
more details on these can be found under the T-SQL documentation for
LIKE, and under the MySQL documentation for pattern matching (but these
may go a little over your head at this stage).
CHAPTER 2. BASIC SQL QUERIES 51
NULL = NULL
It doesn’t return TRUE! In fact, it returns NULL . This makes sense, when you
realise that NULL literally means ‘unknown’ or ‘does not exist,’ and that
every NULL value is treated as distinct from every other NULL value (that is,
no two unknowns are necessarily the same). In fact, the same kind of thing
happens when we compare anything at all to NULL . Take, for example
10 = NULL .
The above operation returns NULL . This also makes sense, because we can-
not be sure whether the unknown thing, represented by the NULL , is actu-
ally equal to 10 or not, so we have to return NULL to indicate that the result
is unknown.
This behaviour of NULL with comparison operators can lead to some
particularly sneaky mistakes, in SQL code, that can produce incorrect re-
sults without ever causing any errors (so you’ll possibly never notice the
mistake). The solution is often to use the standard SQL clause IS NULL, as
we will see in the following example.
Example 2.4.2. Perhaps the most common mistake that I’ve seen creep up
with NULL values, begins with some variation of the following. We want
to retrieve all Friends whose favourite colour is not blue, and we implicitly
expect the result to include all Friends whose favourite colour is NULL . The
mistake is that we write:
Think about it like this: the WHERE clause will only keep rows where the
search condition FavColour != 'blue' evaluates to TRUE. Now, since the ex-
pression NULL != 'blue' evaluates to NULL , the rows with FavColour NULL
are discarded, alongside the rows with FavColour blue. The result is:
RESULT
FirstName FavColour
X red
Figure 2.23: Friends whose FavColour is definitely not blue (so, excluding NULL values).
CHAPTER 2. BASIC SQL QUERIES 52
If we want to include NULL values in the result, we can make the search
condition return TRUE for NULL values, by making use of the IS NULL clause:
CASE WHEN FirstName = 'X' THEN 'Dr. X' ELSE FirstName END
The above expression will return ’Dr. X’ for anyone with FirstName X.
Otherwise, it will just return their FirstName. Here it is within a query:
SELECT
CASE WHEN FirstName = 'X' THEN 'Dr. X' ELSE FirstName END
AS NewNames
FROM Friends;
RESULT
NewNames
Dr. X
Y
Z
The CASE WHEN syntax is very verbose, so I had to move it (as well as the alias,
AS NewNames) to a new line. New lines don’t affect the behaviour of the query
CHAPTER 2. BASIC SQL QUERIES 53
at all – they are just there for aesthetics. Hopefully, it’s still easy to see that
CASE WHEN is positioned like any other column in the SELECT clause.
Example 2.4.3. As another example, let’s return to the simple query that
produced Figure 2.1. That query was just the following:
Now, suppose we want to alter the FavColour column, to return the word
‘yes’ when a friend has a favourite colour, and ‘no’ when the favourite colour
is null. This means we’ll be using the search condition FavColour IS NULL.
Here is the query:
SELECT FirstName,
CASE WHEN FavColour IS NULL THEN 'no' ELSE 'yes' END AS HasFavCol
FROM Friends;
RESULT
FirstName HasFavCol
X yes
Y yes
Z no
For the above, the output will correspond to whichever is the first search
condition to evaluate to TRUE. If none of them are TRUE, then final_output
is returned. Here, we’ll use multiple cases for renaming friends:
RESULT
NewNames
Dr. X
Prof. Y
Z
Exercise 2.5.1
This exercise should be done by hand (e.g., pen and paper). We’ll practice
using SELECT, FROM, CASE WHEN, ORDER BY and aliases. You’re given the one
table:
Alphanumeric
Number Letter NumString
1 a ‘34’
2 b ‘121’
Note, the NumString column uses quote marks to indicate that the numbers
are stored as strings rather than actual numbers.
1. Write down the result of the query below.
-- This is written in T-SQL syntax
SELECT T.Letter AS [Letter of Alphabet], T.Number AS Num
FROM Alphanumeric T;
SELECT T.Number,
CASE WHEN ... THEN ...
WHEN ... THEN ...
END AS ...
FROM Alphanumeric T;
CHAPTER 2. BASIC SQL QUERIES 55
When finished filling in the blanks, write down the result of the query.
4. Write down the result of the following query.
SELECT *
FROM Alphanumeric
ORDER BY NumString DESC;
Note, DESC indicates the rows will be ordered from greatest to least
(descending order).
5. The following query produces an error. Can you explain why?
SELECT FirstName AS MyFriendName
FROM Friends
WHERE MyFriendName = 'X';
Note that the order of execution of SQL clauses is not necessarily the
same as the order they are written. The clause FROM executes before
WHERE, which in turn executes before SELECT.
RESULT
Letter of Alphabet Num
a 1
b 2
2. The MySQL and T-SQL syntax taught in this course are rarely dif-
ferent, thanks to the SQL standard. However, this is one case where
they differ. In MySQL, column names can be quoted using the back-
tick character instead of square brackets:
-- This is written in MySQL syntax
SELECT T.Letter AS `Letter of Alphabet`, T.Number AS Num
FROM Alphanumeric T;
RESULT
Number LETTERS
1 The letter is A
2 The letter is B
RESULT
Number Letter NumString
1 a ‘34’
2 b ‘121’
Exercise 2.5.2
This exercise should be done by hand (e.g., with pen and paper). We will
practice joining two tables together, and using a simple WHERE clause. You
are given the following two tables.
Home
HomeID Street
1 11 Fisher Avenue
2 3 Cook Bend
3 17 Nightingale Court
Tenant
TenantID FirstName HomeID
1 Thomas 1
2 Skylar 1
3 Huong 2
4 Ananya 3
RESULT
TenantID FirstName HomeID Street
1 Thomas 1 11 Fisher Avenue
2 Skylar 1 11 Fisher Avenue
3 Huong 2 3 Cook Bend
4 Ananya 3 17 Nightingale Court
3. The query takes the result in answer 1, filters out any rows where
HomeID is not equal to 1, and then selects only the FirstName and
Street columns.
RESULT
FirstName Street
Thomas 11 Fisher Avenue
Skylar 11 Fisher Avenue
Exercise 2.5.3
With this exercise, we’ll learn how to join 3 tables together. You should do
this exercise by hand.
1. You are given the three tables below:
Write down the result of the query above. Hint: you should take
your answer from part 1, and join that to Table3, using column
Y as the primary/foreign key pair.
c) Write down the result of the following query.
SELECT T1.A, T2.X, T2.Y, T3.B
FROM Table3 T3 JOIN Table2 T2 ON T3.Y = T2.Y
JOIN Table1 T1 ON T2.X = T1.X;
PlayCount Friends
PetID Count FriendID FriendID FirstName ...
1 3 1 1 X
1 5 2 2 Y
3 4 2 3 Z
Pets
PetID PetName ...
1 Chikin
2 Cauchy
3 Gauss
RESULT
A X Y
1 x1 y2
2 x2 y1
RESULT
A X Y B
1 x1 y2 2
2 x2 y1 1
c) The result of this query is the same as part b, since it does not
matter what order the tables are joined in, provided Table1 joins
to Table2 and Table2 joins to Table3.
2. The solutions are as follows.
a) The relationship is many-to-many.
b) The result of joining the three tables is:
RESULT
... PetName PetID Count FriendID FirstName ...
Chikin 1 3 1 X
Chikin 1 5 2 Y
Gauss 3 4 2 Y
Exercise 2.5.4
We’ll now join 3 tables together in a more realistic scenario. You will need
to link information from one table, to information from another table, but
you will not be able to do it without joining via a third table. You are given
the following three tables.
CHAPTER 2. BASIC SQL QUERIES 60
Person
P_ID FName LName S_ID BirthYr Y_ID Z_ID E_CF
32 Bob Smith 24 2004 2 E2 H2
1 Sam Smith 12 2002 2 J8 I7
16 Ivy Smith 32 1997 8 M5 66
5 Joy Jones NULL 1999 7 B4 32
9 Sky Jones NULL 2011 8 E3 9
Suburb
S_ID Name PostCode D_ID
24 Balwyn 3103 1
12 Glen 3146 1
32 Hawthorn 3122 3
Demographics
D_ID G_ID M_ID T_ID StartBracket EndBracket
3 32 3 4 50000 100000
1 1 7 39 150000 200000
2 4 2 38 100000 150000
We are told that P_ID is the primary key of the Person table, S_ID is the
primary key of the Suburb table, and D_ID is the primary key of the De-
mographics table. Suppose we want to link a person’s age to the average
annual income bracket (StartBracket and EndBracket) of their suburb.
1. If we want to join 3 tables, how many primary/foreign key pairs do
we need to use?
2. Which primary/foreign key pairs do we need to use to link a person’s
BirthYear to the income bracket of their suburb?
3. Write a query that produces a table with two columns: the BirthYr of
each person, beside the StartBracket and EndBracket of the suburb
they live in.
4. Write down the table produced by your solution to question 3.
RESULT
BirthYr StartBracket EndBracket
2004 150000 200000
2002 150000 200000
1997 50000 100000
Exercise 2.5.5
In this exercise, we will practice working with search conditions. You may
want to refer carefully to the operator precedence rules in Table 2.7. I will
try to trick you into making mistakes regarding operator precedence and
NULL values.
1. Write down what each of the following expressions evaluates to.
a) 1 = 1 AND 'a' = 'a'
b) NOT 0 < 1
c) 1 != 1 OR 2 = 2
d) NULL != NULL
e) NULL = NULL
f) 1 = 1 OR 2 < 3 AND 3 != 3
g) NOT 1 > 2 AND 'blue' = 'green'
2. Consider the following table of house sales.
HouseSales
Suburb Bedrooms PriceThousands
Bundoora 4 550
Bundoora 2 700
Alphington 5 1200
SELECT *
FROM HouseSales
WHERE Suburb = 'Bundoora'
AND (PriceThousands < 600 OR Bedrooms > 3);
Atoms
Element Num Mass
Argon 18 39.948
Potassium 19 NULL
Calcium NULL NULL
Scandium 21 44.956
RESULT
Suburb Bedrooms PriceThousands
Bundoora 4 550
Alphington 5 1200
RESULT
Suburb Bedrooms PriceThousands
Bundoora 4 550
RESULT
Element Num Mass
Potassium 19 NULL
Calcium NULL NULL
RESULT
Element Num Mass
Exercise 2.5.6
In this exercise you will practice using LEFT JOIN. You are given the follow-
ing two tables.
Atoms Components
Element Num Mass Molecule Num
Argon 18 39.948 SO3 Ar 18
Potassium 19 NULL Ar·HCCH 18
Calcium NULL NULL ArCa 18
Scandium 21 44.956 ArCa 20
CHAPTER 2. BASIC SQL QUERIES 64
1. We’ll warm up with a regular join (also known as an inner join). Write
down the result of the following query.
SELECT A.Element, A.Num, C.Molecule
FROM Atoms A JOIN Components C ON A.Num = C.Num;
RESULT
Element Num Molecule
Argon 18 SO3 Ar
Argon 18 Ar·HCCH
Argon 18 ArCa
2. For a left join, every row from the left table (Atoms) is included in
the result at least once. However, the inner join (from solution 1) is
also present. The result is:
RESULT
Element Num Molecule
Argon 18 SO3 Ar
Argon 18 Ar·HCCH
Argon 18 ArCa
Potassium 19 NULL
Calcium NULL NULL
Scandium 21 NULL
1. Go to the set-up page on the course repository and follow the instruc-
tions to set up a free local MySQL or T-SQL database management
system, and fill it with the data provided. This guide will also instruct
you to choose and download an appropriate SQL editor for writing
and executing code.
2. There are many SQL resources, but I personally like the excellent MySQL
tutorial at www.selectstarsql.com. Read the front matter if you like,
bookmark the tutorial, and come back to it later. I believe it’s always
important to learn the basics from more than one author. This par-
ticular tutorial also lets you execute MySQL code in the browser.
3. Search online for a simple syntax guide. For MySQL, I like the w3schools
guide. For T-SQL, I like the one from dofactory. Don’t spend too long
on this. It’s just to let you know they exist. Later, we’ll also learn to
read the proper T-SQL and MySQL documentation directly.
4. After completing the setup step (item 1 above), you have access to
multiple databases. Use the directory tree in your SQL editor to begin
investigating them. You can switch databases with the USE keyword.
For example, to use the Sandpit database, execute:
USE Sandpit;
GO -- 'GO' is for T-SQL only, remove this line for MySQL.
5. In the Sandpit database you can find all the tables from these notes.
T-SQL organises tables into ‘schemas’. For example, Friends is in the
Notes schema, so it is named Notes.Friends. MySQL, on the other
hand, doesn’t support schemas using the ‘.’ symbol. So, in MySQL, I
have named them with underscores instead. This means, the Friends
table in the Notes schema is called Notes_Friends, in MySQL. This
naming convention allows us to pretend MySQL has schemas. Use
the directory tree to determine some of the names of other schemas.
6. Use the directory tree to figure out some of the table names and col-
umn names in the Notes schema. You may notice that some column
names are slightly different to the ones in these notes. Why might
that be? Any ideas?
7. Use your editor’s interface to view the columns that are present in
the Notes.Friends table. What do you see? Can you determine the
data types of each column? Can you determine whether NULL values
are allowed in each column? If you like, you can learn more about
data types (and find the data types varchar and int) in the T-SQL
or MySQL documentation (note, in MySQL, int is sometimes called
integer). Don’t spend too long on the docs now!
CHAPTER 2. BASIC SQL QUERIES 66
8. Open a new query tab. In the Sandpit database, execute the following
query that selects all of the rows and columns of the Notes.Pets table.
SELECT *
FROM Notes.Pets;
Exercise 2.7.1
Complete each of the tasks below, using SQL queries. They all relate to
the tables in the Notes schema of the Sandpit database. We will practice
using SELECT, FROM and WHERE.
4. We have not yet worked with dates, but we will now. The Scratched
table contains a column ScratchDate. Execute the following query
and explain what it does:
SELECT ScratcherID, ScratchDate, ScratchTime, ScratcheeID
FROM Notes.Scratched -- in MySQL, use Notes_Scratched instead
WHERE ScratchDate = '20180905';
5. Replace the search condition in the above query with one that re-
turns all records where ScratchDate is on or before 6th Sep, 2018.
Note, the date format is 'YYYYMMDD', and you can use the <= compar-
ison operator.
6. Retrieve the ScratcheeID and ScratcherID for all people who have
participated in back scratching between the hours of 11AM and 12PM
(inclusive). You can use the comparison operators <= and >=, as well
as the logical operator AND. The time format is HH:MM:SS (24 hour for-
mat).
7. Retrieve the ScratcherID for all people who scratched a back either
at 11AM on Sep 6th, 2018, or at 10AM on Sep 7th, 2018. You can use
the logical operator OR.
2.
-- In T-SQL
SELECT PetName
FROM Notes.Pets P
WHERE P.FriendID = 3;
-- In MySQL
SELECT PetName
FROM Notes_Pets P -- 'Notes' is part of the table name
WHERE P.FriendID = 3; -- P is an alias
3.
SELECT FirstName, LastName
FROM Notes.Friends
WHERE FavColour = 'red';
4. The query returns all records with ScratchDate 5th Sep, 2018.
CHAPTER 2. BASIC SQL QUERIES 68
5.
SELECT ScratcherID, ScratchDate, ScratchTime, ScratcheeID
FROM Notes.Scratched
WHERE ScratchDate <= '20180906';
6.
SELECT ScratcherID, ScratcheeID
FROM Notes.Scratched
WHERE ScratchTime >= '11:00:00' AND ScratchTime <= '12:00:00';
7.
SELECT ScratcherID
FROM Notes.Scratched
WHERE
(ScratchDate = '2018-09-06' AND ScratchTime = '11:00:00')
OR
(ScratchDate = '2018-09-07' AND ScratchTime = '10:00:00');
Exercise 2.7.2
We’ll now practice using CASE WHEN, ORDER BY and column aliases (with quot-
ing). For this question, we will use the Colours table in the Ape schema, as
well as the Scratched table in the Notes schema. Both are in the Sandpit
database.
2.
SELECT *
FROM Ape.Colours
ORDER BY ColourName DESC;
3.
SELECT *
FROM Notes.Scratched
ORDER BY ScratchDate, ScratchTime;
4.
SELECT *
FROM Notes.Scratched
ORDER BY ScratchDate DESC, ScratchTime;
5.
SELECT ColourID,
Comments,
CASE WHEN ColourName = 'magenta' THEN 'purple'
WHEN ColourName = 'turqoise' THEN 'blue'
ELSE ColourName END AS ColourName
FROM Ape.Colours;
Exercise 2.7.3
We will now practice writing search conditions with the operators BETWEEN,
IN, LIKE and NOT. These questions all relate to the Houses and Suburbs ta-
bles, in the Notes schema of the Sandpit database.
1. Use the IN operator to return the names of all home owners in the
post codes 3128, 3142 and 3083.
Hint: the condition MyColumn IN (1,2,3) returns TRUE when an entry
of MyColumn equals either 1,2 or 3.
2. Use the LIKE operator to get the street address (ignoring suburb name)
CHAPTER 2. BASIC SQL QUERIES 70
of all houses that are on an avenue. This means, any house where
the street address ends in ‘Ave’. You can read about LIKE on page 50
of these notes.
3. Use NOT, with LIKE, to get the house_ID of every house that is not on
an avenue.
4. Use LIKE (and other operators) to get the street address (ignoring
suburb name) of all houses that have a post code starting with ‘31’
and that also cost strictly less than $300,000.
5. We haven’t seen the BETWEEN operator yet.
The following two search conditions are equivalent:
-- using >=, and <=
MyNumber >= 1 AND MyNumber <= 2
2. I have used an empty space in '% Ave' to make sure ‘Ave’ is preceded
by an empty space.
SELECT house_address
FROM Notes.Houses
WHERE house_address LIKE '% Ave';
3.
SELECT house_address
FROM Notes.Houses
WHERE house_address NOT LIKE '% Ave';
5.
CHAPTER 2. BASIC SQL QUERIES 71
SELECT *
FROM Notes.Suburbs
WHERE vaccination_rate BETWEEN 0.4 AND 0.7;
Exercise 2.7.4
We will now practice dealing with NULL values.
1. Run the following query and explain what is wrong with it.
SELECT *
FROM Notes.Friends -- In MySQL, replace with Notes_Friends
WHERE FavColour = NULL;
2. Use the IS NULL operator to find all houses where the post code is
unknown.
3. Find all houses where the post_code is not unknown.
4. Retrieve all house IDs and post codes from the Houses table, but
for any NULL post codes, change the entry to ‘UNKNOWN’. The post
code column in the result table should be called ‘post_code_modified’.
3.
SELECT *
FROM Notes.Houses
WHERE post_code IS NOT NULL;
4.
CHAPTER 2. BASIC SQL QUERIES 72
SELECT house_ID,
CASE WHEN post_code IS NULL THEN 'UNKNOWN'
ELSE post_code END AS post_code_modified
FROM Notes.Houses;
Exercise 2.7.5
Now it’s time to practice using JOIN, LEFT JOIN and RIGHT JOIN. This exer-
cise will use tables from the Ape and Notes schemas, in the Sandpit database.
1. Join the Friends and Pets tables, using the correct primary/foreign
key pair. From the result, how many pets does the friend named ‘Z’
have, and what are their names?
2. Join Table1 and Table2 using the correct primary and foreign key
pair. First, use your SQL editor to find out which columns are the
primary and foreign keys, in each table. In your result, retrieve only
columns B and C from Table1, and only column D from Table2.
3. To get the initials of an ape, we can use the SUBSTRING function (see
Section 2.2.5). The expression SUBSTRING(FirstName, 1, 1) extracts
the first letter of FirstName. Try it out first with the following query:
Now, for all apes that have a favourite colour, list their initials, next
to the name of their favourite colour.
4. Modify your solution to question 3, so that the result also includes
any apes that do not have a favourite colour.
5. Modify your solution to question 3, so that the result also includes
any colours that are not the favourite of any ape.
3.
SELECT SUBSTRING(FirstName, 1, 1) AS FirstInitial,
SUBSTRING(LastName, 1, 1) AS LastInitial
ColourName
FROM Ape.Friends F JOIN Ape.Colours C
ON F.FavColourID = C.ColourID;
Exercise 2.7.6
In this exercise, we will practice joins involving three tables. This exercise
will use tables from the Ape and Notes schemas, in the Sandpit database.
The syntax for a three way join should look like this:
SELECT *
FROM Table1 T1
JOIN Table2 T2 ON T1.attribute1 = T2.attribute2
JOIN Table3 T3 ON T2.attribute3 = T3.attribute4;
1. In the Ape schema, join the EatingFrom table to both the Banana-
Tree table and the Friends table, using the appropriate primary/for-
eign key pairs.
2. Modify your solution to question 1, so that it only produces results
for trees planted in July 2016.
3. Produce a single table that holds the first name of each friend who
CHAPTER 2. BASIC SQL QUERIES 74
scratched a back, the first name of the friend whose back was scratched,
and the date and time of the scratching. Finally, order the result by
the date of scratching (in ascending order). Make sure that you use
column aliases to give the resulting FirstName columns appropriate
names (e.g., ScratcherName and ScratcheeName), so that the two
can’t be confused.
2.
SELECT *
FROM Ape.BananaTree B
JOIN Ape.EatingFrom E ON B.TreeID = E.TreeID
JOIN Ape.Friends F ON F.FriendID = E.FriendID
WHERE B.YearPlanted = 2016 AND B.MonthPlanted = 7;
3.
SELECT Sr.FirstName AS ScratcherName,
Se.FirstName AS ScratcheeName,
S.ScratchDate,
S.ScratchTime
FROM Notes.Friends Sr
JOIN Notes.Scratched S ON Sr.FriendID = S.ScratcherID
JOIN Notes.Friends Se ON Se.FriendID = S.ScratcheeID
ORDER BY ScratchDate;
Chapter 3
In the previous chapter, we learned how to write queries that chop up tables
and join them together, taking advantage of SELECT, FROM and JOIN, while us-
ing search conditions and various operators in the WHERE clause. But, so far,
all of our skills just retrieve ‘raw’ data as it appears already in the database.
Our next step, is to start learning to derive new data from this raw data.
Aggregation and grouping are fundamental SQL concepts for deriving
data. Aggregating queries work by partitioning the rows of a table into
groups, and then applying aggregation functions to return a single value for
each group. There are many applications for this, but one worth mention-
ing is, when we want to avoid extracting a very large dataset, aggregating
queries allow us to summarise the dataset – extracting only the smaller
aggregated dataset.
As we will see next, the GROUP BY clause determines how the groups are
partitioned. Then, the HAVING clause decides which (if any) groups to dis-
card. Finally, an aggregating function can be used to get basic summary in-
formation (such as the average, or the standard deviation) within each of
the groups.
3.1 GROUP BY
The GROUP BY clause does pretty much what it says on the tin: it groups the
rows of a table (using the entries within one or more columns). The easiest
way to understand it is with a few examples. The following query groups
the Pets table by FriendID, and then selects the FriendID column.
75
CHAPTER 3. AGGREGATING, GROUPING AND WINDOWING 76
SELECT FriendID
FROM Pets
GROUP BY FriendID;
Pay attention to the fact that GROUP BY is executed before SELECT, even though
SELECT was written first:
FROM Pets
Pets
PetID PetName PetDOB FriendID
1 Chikin 24/09/2016 2
2 Cauchy 01/03/2012 3
3 Gauss 01/03/2012
=⇒ 3
GROUP BY FriendID
SELECT FriendID
RESULT
FriendID
2
3
After grouping (with GROUP BY), in the second step within Figure 3.1, each
row of the table represents one group. The last two rows of the Pets ta-
ble were placed into a single group, together, because they both shared the
same FriendID. So, after grouping, the table had two rows instead of three.
We told SQL to group by FriendID, so SQL made sure it returned only one
CHAPTER 3. AGGREGATING, GROUPING AND WINDOWING 77
value for each row in the FriendID column. However, we didn’t tell SQL
what to do with the values in the other columns (PetID, PetName and Pet-
DOB), so it placed those values into tuple entries, which we are represent-
ing here with red parentheses.
After grouping, we were able to execute SELECT FriendID with no issues,
in the third step within Figure 3.1. However, if in that step we had chosen to
SELECT any of the other columns, then SQL would have produced an error.
An example of this error is displayed Figure 3.2.
SELECT PetDOB
FROM Pets
GROUP BY FriendID;
Figure 3.2: Error message printed if PetDOB column is selected with tuples in it.
The above error was returned because SQL cannot return a RESULT table
that has any tuple entries. The reason SQL cannot return tuple entries like
this, is that it wants to return only one value for each entry. This property of
entries is referred to as atomicity. After applying GROUP BY, in Figure 3.2,
the entry in the second row of the PetDOB column contains two values,
Cauchy and Gauss. So, the error was caused by trying to SELECT PetDOB. In
general, when there are tuple entries, SQL doesn’t automatically check if
the tuples contain only one unique value in them.
In summary, when we use GROUP BY, we can’t select any columns that
will end up with tuple entries. That is, we can’t select any columns that we
haven’t also included in our GROUP BY clause. How restrictive! This causes a
lot of confusion for new SQL programmers, but it’s a very natural restriction
when you get used to it. It (hopefully) begins to become natural when you
understand that all entries in SQL query results must be atomic.
Thankfully, the tuple entries are designed to be dealt with in various
ways. One way to deal with those pesky tuples is to add their columns to
the GROUP BY clause. SQL will only group rows together if all of the columns
CHAPTER 3. AGGREGATING, GROUPING AND WINDOWING 78
in the GROUP BY clause share the same values. Since Cauchy and Gauss were
born on the same day, see what happens if we GROUP BY PetDOB, FriendID:
Figure 3.3: An intermediate step in query execution, grouping by both PetDOB and FriendID
Keep in mind that we can’t execute GROUP BY on its own without a SELECT
statement; the above illustration displays only the intermediate step achieved
by GROUP BY. Rows are formed into groups based on whether or not all the
columns in the GROUP BY clause have matching entries. We can see this in
action in Figure 3.3. By chance, the two pets that have FriendID equal to 3
also share the same birthday. So, the groups were unchanged compared to
Figure 3.2. What did change, is that now we don’t have tuples in the Pet-
DOB column. Since the tuples are gone from PetDOB, we can now SELECT
PetDOB in our query as well, without causing an error.
Example 3.1.1. For this GROUP BY example, we’ll use a table called Letters:
Letters
A B Num
a b 1
a c 2
a b 3
a c 4
Letters
A B Num
a b 1 A B Num
=⇒
a c 2 (a, a) b (1, 3)
a b 3 (a, a) c (2, 4)
a c 4
The ‘a’ entries in column A weren’t grouped together, because we didn’t ask
SQL to check them, so they were just placed into tuple entries, according
to the groups they belong to, as determined by column B. If we instead
choose to GROUP BY A, we get:
Letters
A B Num
a b 1
=⇒ A B Num
a c 2
a (b, c, b, c) (1, 2, 3, 4)
a b 3
a c 4
Letters
A B Num
a b 1 A B Num
=⇒
a c 2 a b (1, 3)
a b 3 a c (2, 4)
a c 4
Notice that, unlike last time we grouped by A, the four rows containing
‘a’ in column A were not all merged into one row. This is because we also
grouped by B at the same time, and rows are only merged if all columns in
the GROUP BY clause match. Now we can SELECT either A, or B, or both, if we
like, because both are in the GROUP BY clause, so neither column is left with
any tuples in it.
CHAPTER 3. AGGREGATING, GROUPING AND WINDOWING 80
RandomPeople
Name Gender Age
Beyoncé F 37
Laura Marling F 28
Darren Hayes M 46
Bret McKenzie M 42
Jack Monroe NB 30
=⇒
GROUP BY Gender
AVG(Age) AS AverageAge
RESULT
Gender AverageAge
F 32.5
The above query returns the average age of all females in the Random-
People table. Pay close attention to the order in which the clauses in the
above query are executed. Remember, the actual order of execution does
not match the order in which things are written. In particular, notice that
CHAPTER 3. AGGREGATING, GROUPING AND WINDOWING 82
WHERE is executed before GROUP BY. This order will be important to recall
when we start using the HAVING clause.
We call AVG an aggregation function. There are a host of other aggre-
gation functions available, which vary slightly between dialects of SQL. In
Section 4.1, we will learn to read the best source of information on these
functions: the (T-SQL or MySQL) documentation. Here is a table of simple
and useful aggregation functions:
SELECT COUNT(Gender)
FROM RandomPeople;
The above query first groups all of the entries from Gender, forming a sin-
gle tuple entry (F,F,M,M,NB). Then, it counts the number of values in that
tuple. The query returns a single number, 5. Clearly, we would also get
the number 5 if we instead chose to COUNT(Name), or COUNT(Age). What’s the
point of counting any specific column then? There is a point.
The COUNT function skips over any NULL values in whatever column you
feed it. Remember the FavColour column of the Friends table (Figure 1.4)?
If we SELECT COUNT(FavColour) FROM Friends, then the FavColour tuple will
look like (red, blue, NULL ), and the query will return 2, rather than 3, ig-
noring the NULL value. If we don’t want to ignore NULL values at all when
counting, then the * expression comes to the rescue. For example, writ-
ing the query SELECT COUNT(*) FROM RandomPeople, is like asking explicitly
to “count the number of rows of RandomPeople”.
CHAPTER 3. AGGREGATING, GROUPING AND WINDOWING 83
3.2.1 COUNT(DISTINCT)
We have just seen that COUNT(ColumnName) returns the number of non-NULL
entries in the column called ColumnName. We can also modify COUNT with
the DISTINCT keyword. For example:
The above query will return a table with a single row and column, contain-
ing the number 3: that is, the number of distinct genders in RandomPeo-
ple (Figure 3.8). We can, of course, group the records before applying the
COUNT(DISTINCT) function, as we do here (with Letters, Figure 3.4):
SELECT B, COUNT(DISTINCT A) AS NumA, COUNT(A) AS NumRows
FROM Letters
GROUP BY B;
RESULT
B NumA NumRows
b 1 2
c 1 2
Figure 3.11: Counting the distinct entries in A, grouping by B (from Letters table).
RESULT
Gender AverageAge
F 32
Compare the above to the output of the same query in Figure 3.10, where
it was assumed that Age was a decimal number. You’ll notice the result
should be 32.5, not 32. There is a way around this problem, via ‘casting’ the
Age column to a different data type, like DECIMAL(5,2), using CAST(Age AS
DECIMAL(5,2)), before aggregating:
Note, the above query will produce average ages with more than 2 decimal
places. This is because aggregation functions will, when applied to DECIMAL
data types, will ‘play it safe’ by returning numbers with more decimal places
than they started with.
The above query achieves what I set out to do, but it doesn’t display the
NumPeople counts next to the labels for each group. I would like to see ‘B
people’ and ‘non-B people’ displayed next to the counts. Now, this is one
place where SQL can be annoyingly verbose: we have to repeat the whole
CASE WHEN expression in the select list.
RESULT
NameGroup NumPeople
B people 2
non-B people 3
In MySQL only, the above query can be simplified using the alias from the
SELECT list (which we called NameGroup). This does not work in T-SQL:
Another trick to avoid repeating CASE WHEN, that works in both MySQL and
T-SQL, is to use the WITH clause. The WITH clause is a little above our skill
level at the moment, but we will learn about it in Section 4.4.3.
3.3 HAVING
In the previous section (Figure 3.10), we saw an example in which the WHERE
clause was used to discard all the rows that didn’t satisfy Gender = 'F'.
When using GROUP BY, it’s also possible to discard entire groups of rows,
based on the output of aggregation functions. In other words, it’s possi-
ble to use search conditions like AVG(Age) > 40. The natural thing to try is
WHERE AVG(Age) > 40, putting the search condition in the WHERE clause. How-
ever, this produces an error!
CHAPTER 3. AGGREGATING, GROUPING AND WINDOWING 86
Figure 3.14: Error message printed if an aggregation function is used in the WHERE clause.
Notice from Table 3.2, that WHERE executes before GROUP BY, and having ex-
ecutes after GROUP BY. So, the WHERE clause acts on individual rows, while
the HAVING clause acts on groups of rows. Here’s an example of how we dis-
card entire groups of rows based on a search condition with an aggregation
function:
=⇒
HAVING AVG(Age) > 40
AVG(Age)AS AverageAge
RESULT
Gender AverageAge
M 44
In the above execution diagram (Figure 3.15), we see that the search con-
dition AVG(Age)> 40, in the HAVING clause, was executed after GROUP BY, so
that AVG was able to act on the grouped data.
In the query for Figure 3.15, notice that we used AVG(Age) twice: once
in the HAVING clause and once in the SELECT clause. Why is that? In the
second step of Figure 3.15, we see that the clause HAVING AVG(Age) > 40 did
not actually insert the average ages into the RESULT table. The ages weren’t
inserted into the RESULT table until SELECT Gender, AVG(Age) AS AverageAge
was executed. This is useful, for example, when we want to discard groups
using one aggregation function, and select columns using a different one,
as in this next query. This next query uses STDEV(Age) to return the sample
standard deviation of ages, for each gender whose average age is greater
than 40:
The HAVING clause works with any aggregation function (see examples in
Table 3.1). For the above query, we used the search condition AVG(Age) >
40, but a variety of search conditions are possible, ranging from very simple
to highly complicated. We covered search conditions back in Section 2.4.4.
In the next section, we’ll see just how complicated search conditions can
get.
(i) Find all the RandomPeople with Gender = 'F' OR Gender = 'NB'.
(ii) Find all the genders in RandomPeople having AVG(Age) < 40.
CHAPTER 3. AGGREGATING, GROUPING AND WINDOWING 89
Naturally, there should be a way to combine (i) and (ii) into a single
query that says, “find all the RandomPeople whose Gender has an average
age less than 40.” The easiest way to do this, is with a nested query.
Many people will, at some point, try to do the above in an apparently
even easier way, without a nested query, by writing something like this:
SELECT Name
FROM RandomPeople
GROUP BY Gender
HAVING AVG(Age) < 40;
SELECT Name
In the final step of Figure 3.17, we try to SELECT Name, but Name has some
pesky red tuple entries in it, that were generated by GROUP BY. In other
CHAPTER 3. AGGREGATING, GROUPING AND WINDOWING 90
SELECT Name
FROM RandomPeople
GROUP BY Gender, Name
HAVING AVG(Age) < 40;
In the above, I grouped by both Gender and Name, so the name column will
have no red tuples, so I won’t get an error. However, the query doesn’t
achieve my aim, because the groups will now be separated according to
Gender and Name. So, when I then run HAVING AVG(Age) < 40, the aver-
ages won’t be calculated within whole genders, but instead within groups
of people who share both the same gender and the same name. In the Ran-
domPeople table, this amounts to each person belonging to their own one
person group, so the average ages would just be each person’s own age.
Well, how do I get the names I want? Why is it so damn hard, when I can
see the names right there??!
The solution is to use a nested query. To wrap our heads around how
this will work, let’s for a moment use the word RESULT to denote the nested
query. Consider the following query as a step towards “find all the Ran-
domPeople whose Gender has an average age less than 40”:
SELECT Name
FROM RandomPeople
WHERE Gender IN (RESULT);
The above query uses a search condition with a command we haven’t seen
in action yet: the IN operator. We saw the IN operator in Table 2.6, but we
didn’t see what it does. That’s because I wanted to wait until we could use
IN with a nested query. For the above query, IN checks if Gender matches
anything contained in RESULT. So, to find the people we are after, we need
RESULT to contain (F,NB).
Here, RESULT can be a whole query of its own, in which case we call
the query ‘nested’. We can replace the word RESULT with the query that
produces the RESULT we want. So, plugging in a query that returns (F,NB):
CHAPTER 3. AGGREGATING, GROUPING AND WINDOWING 91
SELECT Name
FROM RandomPeople
WHERE Gender IN (SELECT Gender
FROM RandomPeople
GROUP BY Gender
HAVING AVG(Age) < 40);
And there we have it, our first use of a nested query. It’s really just two
queries, where the ’inner’ (i.e., nested) query executes first, and then the
result of that execution gets passed to the IN operator as part of the search
condition. This means the whole nested query is part of the search con-
dition in the WHERE clause. Nested queries can be used in many places (in-
cluding in FROM, HAVING, SELECT and even GROUP BY). We will see some more
examples as we go, as well as in the exercises.
SELECT Name
FROM RandomPeople RP
WHERE age > (SELECT AVG(age)
FROM RandomPeople
WHERE gender = RP.gender);
The alias RP is defined in the outer query, but it is used in the WHERE clause
of the inner query. With a basic nested query, we think of the inner query
being executed first, followed by the outer query. That can’t happen here,
because the inner query depends on part of the outer query. This depen-
dence on the outer query is the reason for the name ‘correlated’. This is
a correlated nested query. The execution of this query is very different to
a regular nested query. Here, we have to think of the nested query being
executed multiple times: once for each row of the RandomPeople table. The
following execution diagram will help us visualise this.
CHAPTER 3. AGGREGATING, GROUPING AND WINDOWING 92
RP.Gender = 'F'
Row 1
Name Gender Age
RP.Name RP.Gender RP.Age
Beyoncé F 37
Beyoncé F 37 32.5
Laura Marling F 28
RP.Gender = 'F'
Row 2
Name Gender Age
RP.Name RP.Gender RP.Age
Beyoncé F 37
Laura Marling F 28 32.5
Laura Marling F 28
RP.Gender = 'M'
Row 3
Name Gender Age
RP.Name RP.Gender RP.Age
Darren Hayes M 46
Darren Hayes M 46 44
Bret McKenzie M 42
RP.Gender = 'M'
Row 4
Name Gender Age
RP.Name RP.Gender RP.Age
Darren Hayes M 46
Bret McKenzie M 42 44
Bret McKenzie M 42
SELECT name
Name
Beyoncé
Darren Hayes
Figure 3.18: Execution diagram for a correlated nested query. The table RandomPeople RP
was given its alias, RP, in the outer query. So, the nested query executes once for each row
of RandomPeople RP
CHAPTER 3. AGGREGATING, GROUPING AND WINDOWING 93
We can describe the procedure in Figure 3.18 with the following steps. In
the steps, we use a different colour for RandomPeople RP than we do for
RandomPeople, to emphasize that these are treated as distinct copies of
the same table.
(1) First, split the RandomPeople RP table into rows, giving 5 rows. For
each row, do the following:
(i) Extract the RP.Gender entry from the current row, and use this
value of RP.Gender in the nested query. For example, for row 1,
the gender is 'F', so use WHERE Gender = 'F' in the nested query,
to find the two rows of RandomPeople that have Gender 'F'.
(ii) For the rows found in step (i), compute AVG(Age). For example,
using WHERE Gender = 'F', the average age will be 32.5.
(2) After repeating steps (i) and (ii) for all 5 rows of RandomPeople RP, we
will have 5 average ages, giving the tuple (32.5, 32.5, 44, 44, 30). Call this
tuple RESULTS, and add it as a temporary column to RandomPeople.
(3) Execute SELECT name FROM RandomPeople WHERE age > RESULTS, replac-
ing RandomPeople with the temporary table found in step (2).
Hopefully, by studying the above steps along with Figure 3.18, you will
develop a good sense of how a correlated nested query works. In particu-
lar, since the nested query was executed 5 times (called ‘looping’), there are
more steps involved than a regular (uncorrelated) nested query, which only
executes once. So, clearly, a correlated nested query may be less efficient
than a regular nested query, in terms of execution time. However, remem-
ber, it is the job of the Database Management System (DBMS) to try to op-
timise your query for you, ‘under the hood’. So, in many cases, a correlated
nested query may end up being very efficient, due to optimisations carried
out by the DBMS, that we don’t really need to learn about. In practice, you
should worry about writing queries that you understand first. Then, if they
are slow, you can try to make them faster.
If you have grasped the execution pattern, then you’ll soon see, during
the exercises, that we can achieve some complicated things fairly easily, by
taking advantage of the ‘looping’ behaviour of a correlated nested query.
3.5 Windowing
Window functions are the last big piece of the puzzle in our chapter on
grouping and aggregation. They aren’t usually introduced in a beginner
SQL course, but I think they should be, because they give a complete picture
on how grouping can be used. The kind of ‘grouping’, that we do in this
CHAPTER 3. AGGREGATING, GROUPING AND WINDOWING 94
Since we used GROUP BY Gender, the above will return one row for each gen-
der, giving the minimum age within each gender:
RESULT
Gender MinimumAge
F 28
M 42
NB 30
Now, if we just drop the GROUP BY clause from the query, we get:
The error in Figure 3.20 is caused because the aggregation function MIN, in
the absence of GROUP BY, still returns one row per group. The difference is
that, in the absence of GROUP BY, the whole table is treated as one group. So,
MIN(Age) just returns a single value (the minimum age from the whole ta-
ble, 28). The error happens because we included Gender in SELECT Gender,
MIN(Age): we violated the atomicity constraint, by trying to select multiple
genders, (F, M, NB), for just a single value of age, 28.
Now, let’s modify the query, so that it forms partitions, rather than
groups. We do this with the OVER clause:
RESULT
Gender MinimumAge
F 28
F 28
M 28
M 28
NB 28
Figure 3.21: Every row gets the minimum age from the whole table.
The query has given us the minimum age from the whole table, 28, returned
once for each row of the whole table. So, the returned table in Figure 3.21
has as many rows as RandomPeople, and it has 28 in every row. In sum-
mary, the ‘window’ for the aggregation function was the whole table, but
the result was returned once for every row.
Notice that, unlike GROUP BY, the OVER clause went in the SELECT clause
and attached directly to the MIN function. This is because, unlike GROUP BY,
the OVER clause always needs to work with a function. In the next section,
we discuss all the types of functions that work with OVER. But, first, notice
that OVER has some round brackets next to it. Those round brackets are
begging for us to input some information about what columns we want to
use for partitioning, so we don’t just partition by the whole table.
Here’s an example where we partition by Gender, within the OVER clause.
To do this, we must use the PARTITION BY command:
RESULT
Gender MinimumAge
F 28
F 28
M 42
M 42
NB 30
Figure 3.22: Every row gets the minimum age from that gender.
Table 3.3: Some of the analytic functions in T-SQL and MySQL. These work with the OVER
clause, along with any function in Table 3.1.
Function Returns
FIRST_VALUE Entry in first row of partition
LAST_VALUE Entry in last row of partition
LAG Entry one row behind current row
LEAD Entry one row ahead of current row
ROW_NUMBER Number of current row in partition
RANK Rank of current row in partition
DENSE_RANK Rank, without gaps
PERCENT_RANK Percentage of rank value
CUME_DIST Cumulative distribution value
NTILE Bucket numbers (like histogram)
a partition.
You have your whole life to experiment with window functions (how ex-
citing), and we’re going to get practice during the exercises. For now, let’s
take a look at ordering rows within the OVER clause.
SausageSizzleSummary
SaleDate Product Sales
1999-12-31 pork 3
1999-12-31 veggie 3
2000-01-01 pork 2
2000-01-01 veggie 7
2000-01-02 pork 6
2000-01-02 veggie 6
2000-01-03 pork 6
2000-01-03 veggie 2
2000-01-04 pork 1
2000-01-05 veggie 5
• Which day(s) had the highest sales overall, for a single product type,
and which product was it?
You can easily answer these questions yourself, just by examining the
SausageSizzleSummary table, but we would like an automated way to com-
pute all of the ranks. To start with, we’ll examine the differences between
the three window functions ROW_NUMBER, RANK and DENSE_RANK.
CHAPTER 3. AGGREGATING, GROUPING AND WINDOWING 98
SELECT Sales,
ROW_NUMBER() OVER(ORDER BY Sales) AS row_num_sales,
RANK() OVER(ORDER BY Sales) AS rank_sales,
DENSE_RANK() OVER(ORDER BY Sales) AS dense_rank_sales
FROM SausageSizzleSummary;
RESULT
sales row_number rank dense_rank
1 1 1 1
2 2 2 2
2 3 2 2
3 4 4 3
3 5 4 3
5 6 6 4
6 7 7 5
6 8 7 5
6 9 7 5
7 10 10 6
In the above query, ORDER BY Sales tells the ranking functions to use Sales
from lowest to highest (ascending order). Examining the result (Figure 3.24),
we can see that the ROW_NUMBER function gives a distinct number for each
row, while RANK and DENSE_RANK only change when Sales changes. We can
also see that RANK skips the next available ranking value after a tie, while
DENSE_RANK does not skip any values.
Our goal is to rank the daily sales, both overall and by product. So, for
our purpose, we will use DENSE_RANK twice: once with no partition, and once
with a partition by Product. We also want to override the default (ascend-
ing) ordering: the keyword DESC will tell DENSE_RANK to order by descending
number of sales.
RESULT
product saleDate overall_sales_rank product_sales_rank sales
pork 2000-01-02 2 1 6
pork 2000-01-03 2 1 6
pork 1999-12-31 4 2 3
pork 2000-01-01 5 3 2
pork 2000-01-04 6 4 1
veggie 2000-01-01 1 1 7
veggie 2000-01-02 2 2 6
veggie 2000-01-05 3 3 5
veggie 1999-12-31 4 4 3
veggie 2000-01-03 5 5 2
• The day with the overall highest selling product was New Year’s Day,
and it was veggie sausages (with 7 sales).
• The highest pork sales were on Jan 2nd and 3rd (tied at 6 sales).
Aside from answering questions like those above, ranks have many ap-
plications. For example, ranks are a common tool in non-parametric statis-
tics, and are particularly useful for computing quantiles, including the me-
dian.
Exercise 3.6.1
In this exercise, we’ll look at grouping and aggregating (that is, using ag-
gregation functions like AVG or COUNT). We will also look at using CAST to
change data types.
1. Order the following by which one executes first (not which is written
first in a query): GROUP BY, FROM, HAVING, WHERE, ORDER BY, SELECT.
2. This query makes use of the Letters table, below.
CHAPTER 3. AGGREGATING, GROUPING AND WINDOWING 100
Letters
A B Num
a b 1
a c 2
a b 3
a c 4
b)
SELECT B, MAX(Num)
FROM Letters
GROUP BY B;
c)
SELECT A, B, MAX(Num) AS MaxNum, MIN(Num) AS MinNum
FROM Letters
GROUP BY A,B;
Friends
FriendID FirstName LastName FavColour
1 X A red
2 Y B blue
3 Z C NULL
RandomPeople
Name Gender Age
Beyoncé F 37
Laura Marling F 28
Darren Hayes M 46
Bret McKenzie M 42
Jack Monroe NB 30
a)
SELECT FirstName
FROM Friends
GROUP BY FavColour;
b)
SELECT FavColour
FROM Friends
GROUP BY FavColour;
c)
SELECT AVG(Age)
FROM RandomPeople
WHERE AVG(Age) < 55
GROUP BY Gender;
d)
SELECT AVG(Age)
FROM RandomPeople
GROUP BY Gender
HAVING Age > 20;
e)
SELECT Gender, MAX(Age) AS AgeMax, MIN(Age) AS AgeMin
FROM RandomPeople
GROUP BY Gender
HAVING COUNT(*) < 3;
f)
SELECT COUNT(*)
FROM RandomPeople;
SausageSizzleSummary
SaleDate Product Sales
1999-12-31 pork 3
1999-12-31 veggie 3
2000-01-01 pork 2
2000-01-01 veggie 7
2000-01-02 pork 6
2000-01-02 veggie 6
2000-01-03 pork 6
2000-01-03 veggie 2
2000-01-04 pork 1
2000-01-05 veggie 5
You are told that the Sales column uses an INT data type. The fol-
lowing query aims to calculate the exact average number of sales for
each SaleDate. Explain what is wrong with the query, and then write
a new query that fixes the problem.
RESULT
a) MAX(Num)
4
RESULT
B MAX(Num)
b)
b 3
c 4
CHAPTER 3. AGGREGATING, GROUPING AND WINDOWING 103
RESULT
A B MaxNum MinNum
c)
a b 3 1
a c 4 2
d) Error: the search condition Age > 20 does not contain an aggre-
gation function. Search conditions without aggregation func-
tions must be used in WHERE, not in HAVING.
4. The query does not produce an error, but it does not produce the
exact average sales. We are told that Sales is stored as an INT, imply-
ing the result will be rounded down to the nearest whole number. To
fix this, we can cast Sales to a DECIMAL:
SELECT SaleDate, AVG(CAST(Sales AS DECIMAL)) AS AvgSales
FROM SausageSizzleSummary
GROUP BY SaleDate;
Exercise 3.6.2
In this exercise, we will practice using GROUP BY with the CASE WHEN expres-
sion. You are given the following EduStudy table.
CHAPTER 3. AGGREGATING, GROUPING AND WINDOWING 104
EduStudy
Id Income Education
EI13 low 5
EI122 low 1
EI281 low-mid 4
EI3332 middle 3
EI4751 high-mid 3
EI12 high 2
In order to study the relationship between income and education, 5000 sur-
vey participants were categorised according to their education level (with
1 low to 5 high), and their income bracket. The above EduStudy table holds
a subset of the results.
1. The research team would like to categorise all participants according
to the following rules:
• if Income is either low or low-mid, and Education is strictly
greater than 3, the category should be ‘Group A’;
• if Income is either high or high-mid, and Education is strictly
less than 3, the category should be ‘Group B’; and
• otherwise, the category should be NULL .
By hand, add a column to EduStudy, called Category, and fill it in
according to the above rules.
2. Fill in the blanks to create the Category column from question 1.
SELECT *,
CASE WHEN ...
THEN ...
WHEN ...
THEN ...
END AS Category
FROM EduStudy;
RESULT
Id Income Education Category
EI13 low 5 Group A
EI122 low 1 NULL
EI281 low-mid 4 Group A
EI3332 middle 3 NULL
EI4751 high-mid 3 NULL
EI12 high 2 Group B
3. The query returns an error (in T-SQL, at least), because the alias
called Category is created in the SELECT clause, so it cannot be used
in the GROUP BY clause (which is executed before SELECT).
4. The query will work in both MySQL and T-SQL if we copy the whole
CASE WHEN expression into the GROUP BY clause, in place of Category.
CHAPTER 3. AGGREGATING, GROUPING AND WINDOWING 106
There are other ways to achieve the above, without repetition (e.g.,
using the WITH clause, which we will cover in Section 4.4.3). For now,
the above is reliable.
Exercise 3.6.3
In this exercise, we will experiment with nested queries. We will use the
EduStudy table again.
EduStudy
Id Income Education
EI13 low 5
EI122 low 1
EI281 low-mid 4
EI3332 middle 3
EI4751 high-mid 3
EI12 high 2
SELECT Income
FROM EduStudy
GROUP BY Income
HAVING AVG(Education) >= 3;
b) The table resulting from part a contains the total number of par-
ticipants in each income group. We can join this table to EduS-
tudy to get our desired result, but which columns should we use
as the primary/foreign key pair?
c) Given your answers above, fill in the appropriate subquery and
the appropriate join condition, in the following query, in order
to achieve our goal.
SELECT T2.*, Num
FROM (subquery) T1 JOIN EduStudy T2 ON ...;
SELECT Id
FROM EduStudy
WHERE Income IN (SELECT Income
FROM EduStudy
GROUP BY Income
HAVING AVG(Education) >= 3);
RESULT
Income Num
low 2
low-mid 1
middle 1
high-mid 1
high 1
Exercise 3.7.1
In this exercise we will practice basic usage of GROUP BY.
SELECT B
FROM Notes.Letters
GROUP BY B;
2. Now we will execute a query that fails and returns an error that is
very common when using GROUP BY. Execute the below, and explain
why the error occurred. You may need to revise Section 3.1, and in
particular page 79.
SELECT A, B
FROM Notes.Letters
GROUP BY B;
3. In the query below, we’ve edited question 2, so that it uses the aggre-
gation function MAX. This prevents the error from occurring. Execute
it and explain what the query does (and why it prevents the error).
SELECT MAX(A), B
FROM Notes.Letters
GROUP BY B;
SELECT A, B
FROM Notes.Letters
GROUP BY A, B;
5.
SELECT Gender
FROM Notes.RandomPeople
GROUP BY Gender;
6. It achieves the same thing as question 5, but without using GROUP BY.
Exercise 3.7.2
In this exercise, we will practice basic aggregation functions.
5. Fix the query from question 4 using GROUP BY TreeID, but then ex-
plain why the result is not very useful.
6. How long do banana trees take to grow? Retrieve the average height
of banana trees for each year planted.
7. Use the aggregation function COUNT to display the number of times
that each FavColourID appears in Ape.Friends (including NULL ). The
COUNT function counts the number of rows (keeping in mind that any-
thing in the SELECT clause will be executed after GROUP BY). The COUNT
function will not count NULL values in the column that it is asked to
count. For example, COUNT(FavColourID) will not count rows where
FavColourID is NULL .
8. The apes want to know if planting in certain months will lead to
taller banana trees. For each month planted in Ape.BananaTree, get
the maximum height of the trees. But, restrict your results to trees
planted before 2017, so they’ve had time to grow.
9. Get the minimum width for each month and year in Ape.BananaTree.
CHAPTER 3. AGGREGATING, GROUPING AND WINDOWING 111
6.
SELECT YearPlanted, AVG(Height) AS AvgTreeHeight
FROM Ape.BananaTree
GROUP BY YearPlanted;
7.
-- alternatively, you could use COUNT(*)
SELECT FavColourID, COUNT(FriendID) AS ColourCount
FROM Ape.Friends
GROUP BY FavColourID;
8.
CHAPTER 3. AGGREGATING, GROUPING AND WINDOWING 112
Exercise 3.7.3
We’ll now practice basic usage of the HAVING clause.
3.
SELECT MIN(width) AS MinWidth,
MAX(width) AS MaxWidth
FROM Ape.BananaTree
HAVING AVG(height) >= 5;
4.
SELECT MIN(width) AS MinWidth,
MAX(width) AS MaxWidth
FROM Ape.BananaTree
WHERE TreeID IN (1,4,6,16,18)
HAVING AVG(height) >= 5;
Exercise 3.7.4
In this exercise, we will practice using nested queries (i.e., subqueries).
Using what you know about order of execution regarding WHERE and
LEFT JOIN, can you modify the above query to be simpler (and avoid
using a nested query)? You might want to start by copying out the
subquery and executing it on its own.
2. Below is another way to do the same thing as the query in question
1. . . but wait! No it isn’t! The query below returns an empty table. It
looks like it should work. What a headache. Why is this happening?
3. The EduStudy table is included in the Notes schema. The query be-
low involves repetition of the same CASE WHEN expression twice (once
in SELECT, and once in GROUP BY). Rewrite the query, using a subquery
(appearing in a FROM clause) to remove this repetition.
CHAPTER 3. AGGREGATING, GROUPING AND WINDOWING 114
Second hint: you can use the following as the nested query:
SELECT gender, AVG(age) AS AverageAge
FROM Notes.RandomPeople
GROUP BY gender;
4.
SELECT RP.PersonName
FROM Notes.RandomPeople RP JOIN (
SELECT gender, AVG(age) AS AverageAge
FROM Notes.RandomPeople
GROUP BY gender) R
ON RP.gender = R.gender
WHERE RP.age > R.AverageAge;
Exercise 3.7.5
In this exercise we will explore windowed aggregation functions, using the
OVER clause. We will also learn an unnecessarily difficult way to calculate
the variance of a column.
2.
SELECT *, AVG(CAST(Education AS DECIMAL)) OVER() AS avg_edu
FROM Notes.EduStudy;
3.
SELECT Education - AVG(CAST(Education AS DECIMAL))
OVER() AS avg_edu_diff
FROM Notes.EduStudy;
4.
SELECT POWER(T.avg_edu_diff, 2) AS square_deviation
FROM (
SELECT Education - AVG(CAST(Education AS DECIMAL))
OVER() AS avg_edu_diff
FROM Notes.EduStudy
) T;
CHAPTER 3. AGGREGATING, GROUPING AND WINDOWING 117
5.
SELECT AVG(POWER(T.avg_edu_diff, 2))
FROM (
SELECT Education - AVG(CAST(Education AS DECIMAL))
OVER() AS avg_edu_diff
FROM Notes.EduStudy
) T;
Exercise 3.7.6
In this exercise we’ll practice using windowed aggregation functions with
the OVER and PARTITION BY clauses.
2.
SELECT B.BananaID,
B.TasteRank,
B.TreeID,
T.YearPlanted,
MIN(B.TasteRank) OVER(PARTITION BY B.TreeID) AS MinTaste
FROM Ape.Banana B
JOIN Ape.BananaTree T ON B.TreeID = T.TreeID;
SELECT *
FROM (
SELECT B.BananaID,
B.TasteRank,
B.TreeID,
T.YearPlanted,
MIN(B.TasteRank)
OVER(PARTITION BY B.TreeID) AS MinTaste
FROM Ape.Banana B
JOIN Ape.BananaTree T ON B.TreeID = T.TreeID
) S
WHERE S.MinTaste != 1;
Chapter 4
Independent development
In this chapter, our goal is to develop the three most important skills for
working independently on SQL queries: reading documentation, creating
data, and developing test cases. Reading SQL documentation includes un-
derstanding the esoteric ‘Syntax Conventions’. Creating test data involves
creating tables, altering tables, inserting data, and updating data. Devel-
oping test cases involves taking a structured approach to thinking about
how a SQL query might produce incorrect results.
119
CHAPTER 4. INDEPENDENT DEVELOPMENT 120
Those reference sheets allow you to start making sense of the rest of the
documentation. The essential parts are only a page long, and many con-
ventions are shared by both dialects (T-SQL and MySQL). We will explain
the conventions in detail in this section.
Once you have a reference sheet of conventions, start by heading over
for a quick peek at the Data Manipulation Language (DML) pages, via
the pair of links below. DML is the part of SQL that is devoted to inserting,
updating, deleting and querying data. DML exists in contrast to the Data
Definition Language (DDL) – the part of SQL that is dedicated to creating,
altering and dropping whole tables. Here are links to the DML pages:
You can poke around there, for a while, being overwhelmed by it all.
And, once you’ve done that, you can head over to the most important part,
for most people. The SELECT documentation:
If you scroll to the large syntax box (that has something resembling
code in it), the first thing you’ll notice is, the description of SELECT looks
like a completely horrible mess. The second thing you may notice is, if you
look very carefully, almost every clause that we’ve covered so far is actually
in that mess. You can find SELECT, FROM, WHERE, JOIN, GROUP BY, HAVING,
and ORDER BY, all dispersed within a terrifying medley of symbols, that only
reflect the cold indifference of a world of computer scientists and engineers,
who exist only to communicate with machines, and who, no doubt, com-
pletely despise us for wanting to fly too close to the sun.
In reality, I have much respect for the people who thought of this suc-
cinct way to describe large parts of the language. With a little work, hope-
fully, you’ll see why the Syntax Conventions are useful. Indeed, these peo-
ple do not hate us entirely. In the case of T-SQL, you’ll see our friends at
Microsoft have put some thought into simplifying it, with a kind of simpli-
fied summary box, at the top of the SELECT docs page. Here’s my own ver-
sion of that summary box, with a few slight modifications, focusing mostly
on things that we’ve seen so far:
[ INTO <new_table> ]
[ WHERE <search_condition> ]
[ HAVING <search_condition> ]
Figure 4.1: A simplified version of the SELECT documentation, showing a mixture of T-SQL
and MySQL syntax conventions.
Syntax Conventions indicate the kinds of SQL statements that are valid.
So, we can reference Figure 4.1 to derive a large number of acceptable SQL
statements. In particular, the Syntax Conventions are a tool to indicate:
which clauses are optional versus necessary, which expressions can be re-
peated multiple times, and the order to write the clauses in. What Syntax
Conventions don’t do, is they don’t explain what each valid SQL statement
actually does. They give you a clue for what you can write, not why you
should write it.
CHAPTER 4. INDEPENDENT DEVELOPMENT 122
I’ve intentionally made Figure 4.1 a little more confusing than it needs
to be, because I’ve mixed MySQL and T-SQL Syntax Conventions, for a chal-
lenge! This approach will give us material to work with below, and will ul-
timately help us get comfortable with the conventions faster.
In Figure 4.1, there are a few keywords we haven’t seen yet (ALL, DISTINCT,
and INTO). There are also a few lower-case words, vertical lines, square
brackets, and curly braces. The Sytax Conventions table, below, explains
those symbols:
Table 4.1: The most important Syntax Conventions in MySQL and T-SQL.
Example 4.1.1. For the first example, consider the first line from Figure 4.1:
The parts in square brackets are optional. So, the above tells us we must
start with the word SELECT, and this must be followed with something called
a select_expr. The select_expr is a placeholder. The MySQL SELECT docs
define it with a sentence:
Figure 4.2: The definition of select_expr from the MySQL SELECT docs.
The part [, select_expr] ... means we can optionally add as many column
names as we want, separated by commas. So, we can validly write:
CHAPTER 4. INDEPENDENT DEVELOPMENT 123
SELECT FirstName
We are also given the option [ALL | DISTINCT]. This means we can option-
ally write either ALL or DISTINCT, but not both. Despite not having seen
these two keywords before, we now know that we can validly write:
Searching far down the MySQL SELECT docs, we get the explanation of
these two options:
Figure 4.3: The explanation for [ALL DISTINCT] from the MySQL SELECT docs.
We will practice applying DISTINCT during the exercises. Keep in mind, the
above explanation in MySQL also applies to T-SQL, for this and many other
clauses, being part of the SQL standard.
Example 4.1.2. As a second example, consider the excerpt from Figure 4.1:
We already dissected the first line, in Example 4.1.1. The second line tells us
we can optionally include FROM <table_source>. Then, the T-SQL conven-
tion [,...n] tells us we can repeat <table_source>, separated by commas.
The MySQL would have been [, <table_source>] ...
Finding the exact meaning of the <table_source> placeholder requires
a bit of digging. In the T-SQL SELECT docs there are links to the T-SQL
CHAPTER 4. INDEPENDENT DEVELOPMENT 124
Figure 4.4: The definition of <table_source>, from the T-SQL FROM clause page.
<table_source> ::=
{
complicated stuff
}
The symbol ::= is used by T-SQL to indicate the beginning of a definition for
the valid set of statements that can replace a placeholder. So, simplifying
the complicated stuff, we might see:
<table_source> ::=
{
table_or_view_name [ [ AS ] table_alias ]
| user_defined_function [ [ AS ] table_alias ]
}
ceded by the keyword AS. Indeed, these are the alias and optional AS key-
word that we learned about in Section 2.3.
Example 4.1.3. For practice, here’s a toy example, that applies a few basic
Syntax Conventions to an English language sentence, rather than SQL:
<greeting> ::= Hello. [Do you {love | hate} reading the docs?]
• Hello.
The curly braces, together with the vertical bar, have insisted that we
choose between ‘love’ and ‘hate.’
Figure 4.5: A simplified excerpt from the T-SQL SELECT Clause page.
Some things jumped out at me. For example, it says I can write:
So, knowing in advance that SELECT can be paired with the FROM and JOIN
clauses, I can write:
Knowing also what SELECT * does, I guessed that the query above would
give me all columns of the Friends table, along with just the PetName col-
umn from the Pets table (after joining the two tables). At that point, I just
executed the above query on my database, and found that, indeed, that’s
what it does!
That’s not all I found. Looking at Figure 4.5 again, I saw I could write:
Executing the above query on my T-SQL database confirms that this ex-
tracts the first two rows of the Friends table. Executing it in MySQL returns
an error, confirming that this syntax is specific to T-SQL, and so it must not
be in the SQL standard.
That’s quite enough of reading documentation for now! A large part of de-
CHAPTER 4. INDEPENDENT DEVELOPMENT 127
USE Sandpit;
However, it’s often useful to create your own fresh database from scratch,
so that you can experiment without changing the structure of an existing
database. The following will create a database named MyExperiments:
USE MyExperiments;
Once you’re done experimenting, or just want to start fresh, you can delete
the entire MyExperiments database, with the following. Warning: drop-
ping a database deletes all tables and data in that database.
USE MyExperiments;
GO
USE MyExperiments;
Here are some useful comments on the syntax shown in Figure 4.6:
• The condition not null, appearing after the data types for FriendID
and PetID, ensures that these columns can never contain NULL values.
In the case of the FriendID in the Pets table, this ensures that we never
store the details of a pet whose owner is unknown. Excluding NULL
values is also necessary for any column that we plan to make into a
primary key, since primary keys must never be NULL .
• The line PRIMARY KEY (FriendID) specifies that the column FriendID
is a primary key. Lastly, the line FOREIGN KEY (FreindID) REFERENCES
Friends (FriendID) specifies that the FriendID in the Pets table is a
foreign key ‘pointing at’ (i.e., referencing) the FriendID in the Friends
table.
Now that you’ve seen the example syntax in Figure 4.6 and the correspond-
ing comments, here is a simplified general CREATE TABLE syntax, making use
CHAPTER 4. INDEPENDENT DEVELOPMENT 130
Figure 4.7: The basic general syntax for creating tables in SQL.
When you need to store numeric data, it is almost always best to use a
DECIMAL(p,s) data type. For example, DECIMAL(5,2) stores 5 digit numbers
with 2 decimal places, meaning the largest number that can be stored in
DECIMAL(5,2) is 999.99. If you are only storing whole numbers, and you
don’t plan to aggregate the numbers, then it may be fine to use INT instead
(see Section 3.2.2 for a discussion on the perils of aggregating INT data).
Aside from the need to learn more about data types, such as DATE and
TIME (which we will do in Section 4.3.2), the syntax in Figure 4.7 gives you
everything you need, to create most kinds of tables. However, there are
many more fancy things that can be done with the CREATE TABLE statement,
that we don’t cover in these notes. If you take a moment to briefly check
out the MySQL or T-SQL docs for creating tables, you’ll see that the CREATE
TABLE syntax can get much more complicated. Regardless, I have rarely
needed to go beyond the basic syntax described in Figure 4.7.
After executing the statement above, the new columns will automatically
be filled with NULL values. So, Friends will look like this:
Friends
FriendID FirstName LastName FavColour StartDate StartTime
1 X A red NULL NULL
2 Y B blue NULL NULL
3 Z C NULL NULL NULL
Figure 4.8: The Friends table with new StartDate and StartTime columns.
We will see how to update those NULL values, in the Section 4.3.2, but for
now we’ll keep exploring what ALTER TABLE can do. We’ve seen that adding
a column is straightforward. Dropping a column is usually straightforward.
The following statement deletes the StartDate column that we just created.
The above works fine, but what if we try to drop the foreign key FriendID
from the Pets table?
Figure 4.9: Error dropping the foreign key FriendID from Pets.
The error in Figure 4.9 has prevented us from accidentally losing the re-
lationship between Friends and Pets, since this relationship is essentially
stored in the foreign key FriendID. The name pets_ibfk_1 was automat-
ically assigned by the database management system when we created the
foreign key FriendID, so it’s not something we’ve seen before. Regardless,
now that we’ve seen the above error, we can get around it by specifically
CHAPTER 4. INDEPENDENT DEVELOPMENT 132
dropping the foreign key constraint pets_ibfk_1. This implies we are in-
tentionally removing the relationship between Friends and Pets. The fol-
lowing statement uses ALTER TABLE to drop a constraint.
Aside from adding and deleting columns, the ALTER TABLE clause can also
be used to change the data type of an existing column. Recall from Fig-
ure 4.6 that the data type for FirstName is VARCHAR(20), meaning we can
store names up to 20 characters in length. If we meet a friend with a longer
first name, we might want to increase that capacity. We’ll alter the First-
Name column, changing the data type, to double its maximum length (to
40 characters). Annoyingly, this statement has a slightly different syntax
in T-SQL to MySQL:
Altering data types can cause problems when the new data type is not com-
patible with the old data type, though this issue only arises if the column
already contains data that aren’t NULL . The database management system
will automatically try to convert the existing data to the new data type, and
if it fails to do the conversion without data loss, it will produce an error.
For example, suppose we try to convert FavColour to a VARCHAR of length 3
(keeping in mind that the word ’blue’ is already in the FavColour column,
and has length 4):
Figure 4.10: An error produced when we try to reduce the FavColour VARCHAR length to
something shorter than the word ‘blue’
The error in Figure 4.10 is returned, refusing to truncate the word ‘blue’ to
only 3 letters. This error prevents unexpected data loss. In the next section,
we’ll see how we can first manually truncate the data using UPDATE, to avoid
CHAPTER 4. INDEPENDENT DEVELOPMENT 133
this error.
• Warning: dropping a table deletes the whole table and all the data in
the table, often without warning.
Figure 4.11: Error dropping the Friends table, because Pets has a foreign key pointing at it.
The error in Figure 4.11 is produced by the above code, because the Pets
table has the foreign key friendID, pointing at the Friends table. We can
circumvent this issue by first dropping the foreign key constraint from the
Pets table, using what we learned in Section 4.2.4:
We’ve learned how to create, alter and drop tables. We’ve also seen that
altering and deleting tables can sometimes get a little hairy (e.g., when
foreign key constraints are involved or when new data types are not com-
patible with existing data). When you’re experimenting with SQL (creating
tables to test your own queries) it’s often easiest to use the commands in
Section 4.2.1 to create a whole new database, use it for your tests or experi-
ments, then drop the whole database when you’re done. We’ll see examples
of the whole process in Section 4.4.
CHAPTER 4. INDEPENDENT DEVELOPMENT 134
We now have two new friends, Kimmy and Jimmy Jenkins! The Friends table
now looks like this:
Friends
FriendID FirstName LastName FavColour
1 X A red
2 Y B blue
3 Z C NULL
4 Kimmy Jenkins yellow
5 Jimmy Jenkins NULL
Figure 4.12: The Friends table with two new rows of data.
• The data must be written in the same order as the column names:
if we wrote ('Kimmy', 4, 'Jenkins', 'yellow'), then we would have
set the FriendID to 'Kimmy' and the FirstName to 4.
• Character strings like 'Kimmy' must be written with quote marks, while
numbers like 4 should be written without quote marks. Entries for
DATE and TIME also require quote marks, but they must be written in
CHAPTER 4. INDEPENDENT DEVELOPMENT 135
the right format, which we discuss in Section 4.3.2 (and the format
was also given in the description column of Table 2.1).
• Inserted data must match the data type of the column. If we tried
to insert 'Kimmy' for the FriendID, then we’d get an error: Incorrect
integer value: 'Kimmy' for column 'FriendID'. This is because the
FriendID is an integer (defined using INT in Section 4.2.3), while 'Kimmy'
is a character string (defined using VARCHAR).
When people first learn how to insert data in SQL, it’s usually not long be-
fore they ask how to insert a whole Excel file or CSV. I’ll pre-empt that ques-
tion with an important clarification: standard SQL doesn’t have that fea-
ture! Don’t get me wrong, there are tools for it. Most decent MySQL or
T-SQL code editors will almost definitely have a built-in interface (or so-
called ‘wizard’) for importing CSV or Excel files and inserting the data into
SQL tables. Those tools are not SQL though.
If you have a CSV or Excel data file, and you want (or need) to use stan-
dard SQL code to insert the data, there is a really handy online conversion
tool available at konbert.com/convert/csv/to/sql. It converts CSV files to
SQL INSERT statements. If you’re using Excel, you’ll need to export the Ex-
cel file as a CSV first (which can be done easily in Excel).
Now we’re going to update StartDate and StartTime. When inserting or up-
dating DATE data, the date should be specified as a character string with the
format 'YYYY-MM-DD'. For TIME data, the format is 'HH:MM:SS'. The follow-
ing will set StartDate to December 30th , 1999, and StartTime to 4:30pm.
Warning: update statements like the below will happily overwrite ev-
ery entry in whatever columns we specify! This can easily lead to un-
expected data loss.
UPDATE Friends
SET StartDate = '1999-12-30', StartTime = '16:30:00';
Friends
FriendID FirstName LastName FavColour StartDate StartTime
1 X A red 1999-12-30 16:30:00
2 Y B blue 1999-12-30 16:30:00
3 Z C NULL 1999-12-30 16:30:00
Figure 4.13: The same StartDate and StartTime for every row in Friends.
Our UPDATE statement overwrote the StartDate and StartTime for every
row in the Friends table. That’s not usually desirable. The UPDATE clause can
CHAPTER 4. INDEPENDENT DEVELOPMENT 137
use a search condition (in a WHERE clause) to determine where to insert data.
The correct way to use UPDATE is usually to specify just one row to update.
The best way to specify one row, is via the primary key. In the following, we
update the StartDate and StartTime, just for our friend X, who has primary
key entry 1:
UPDATE Friends
SET StartDate = '2000-01-03', StartTime = '08:00:00'
WHERE FriendID = 1;
Friends
FriendID FirstName LastName FavColour StartDate StartTime
1 X A red 2000-01-03 08:00:00
2 Y B blue 1999-12-30 16:30:00
3 Z C NULL 1999-12-30 16:30:00
Figure 4.14: Friends after updating StartDate and StartTime for our friend X.
The DELETE statement works much like UPDATE: if no WHERE clause is given
in a DELETE statement, then every row of the table will be deleted. As
with UPDATE, it is usually best to use match a single primary key value when
deleting data:
Since there is currently no FriendID equal to 999, the above statement will
have no effect on the Friends table.
If we executed the above query on the 30th of July 2021, at 2:15pm, the
result would have looked like this:
RESULT
MemberCount ExecutionDateTime
5 2021-07-30 14:15:00
If we want to run the above query and save the result as a table, then in
MySQL we can use the CREATE TABLE ... SELECT statement, below. Subse-
quently, each month, we can perform the query again, and save the results
with the INSERT INTO ... SELECT statement. In T-SQL, on the other hand,
both tasks are achieved with a single statement, called SELECT INTO.
-- T-SQL only: initial table creation (and execute once per month)
SELECT COUNT(*) AS MemberCount, SYSDATE() AS ExecutionDateTime
INTO MemberCountHistory
FROM Membership;
Now, any time I want to access the above results, I can use the FriendsPets
view, rather than typing out the whole above query.
SELECT *
FROM FriendsPets;
RESULT
FirstName PetName
Y Chikin
Z Cauchy
Z Gauss
SausageSizzle
SaleId SaleDate Product Quantity FriendId Paid ($)
1 1999-12-31 pork 1 NULL 3
2 1999-12-31 veggie 3 NULL 9
3 1999-12-31 pork 2 1 2
4 2000-01-01 veggie 4 NULL 4
5 2000-01-01 veggie 2 2 6
Each time someone approaches the store and buys some sausages, the
sale date, type of sausage (product), and quantity of product, are recorded.
So, if someone buys 3 veggie sausages and 1 pork sausage, then two rows
will need to be added to the table. The table also includes a FriendID foreign
key column, to keep track of any sales to friends (since those sales were dis-
counted). Finally, there is a column ‘Paid ($)’ to indicate how much money
changed hands during the sale.
CHAPTER 4. INDEPENDENT DEVELOPMENT 141
Now, any time we want to see daily quantities of sausage sales, we can just
SELECT * FROM SausageSizzleSummary, returning the table in Figure 3.23.
CHAPTER 4. INDEPENDENT DEVELOPMENT 142
4.4.1 Our first real dataset (peek with TOP and LIMIT)
To illustrate testing and development, we’ll use our first real dataset: the
Stack Exchange Data Explorer platform. This database includes (at the time
of writing) over 37 million answers and 100 million comments, for more
than 26 million questions, from Stack Exchange – an excellent collection of
community Q&A sites. You can browse the Stack Exchange Data Explorer
platform and find thousands of user generated SQL queries, making it a
great learning resource. We’ll focus on a smaller subset of this database,
containing questions and answers from the Statistical Analysis Site.
CHAPTER 4. INDEPENDENT DEVELOPMENT 143
Our goal
In statistics, two dominant schools of reasoning exist that have apparently
conflicting philosophical interpretations of probability. These two schools
are referred to as ‘Bayesians’ and ‘frequentists’. Our goal is to query the
Stack Exchange Statistical Analysis posts to compare community appreci-
ation for posts mentioning the word ‘frequentist’, to those mentioning the
word ‘Bayesian’, for each month, across all years prior to (and including)
2020. So, grouping by month, we want to see the average score, average
number of views, and total number of posts, in each of the two categories,
‘frequentist’ versus ‘Bayesian’. Note that by no means does this constitute
a thorough investigation, or even a thorough research question – we’re just
doing some good old exploration.
Head over now to the Statistical Analysis Site query page, and execute the
above (T-SQL code only) to view the first 10 rows of the Posts table. Note,
you may want to create an account and log in first, so you don’t have to
keep completing the CAPTCHA each time you execute a query. The columns
that we’re interested in from Posts are Id, CreationDate, Score, ViewCount
and Body. On the right of that page, you can see database schema details,
giving the data type for each column in each table. Note there is also a data
dictionary available on the site.
CHAPTER 4. INDEPENDENT DEVELOPMENT 144
USE StackOverflowTesting;
GO -- only use GO in T-SQL, remove GO for MySQL
Note the Stack Exchange database schema uses NVARCHAR for Body, which
can can fit more types of special characters (like smiley faces). We have
used VARCHAR instead, because our test data will be simpler.
Checking the output of the above query, we see that Category is NULL when-
ever Body does not contain ‘frequentist’ or ‘Bayesian’. This is fine, since we
plan to discard those rows soon anyway. After becoming confident that
MONTH and CASE WHEN are working as expected, we can add a WHERE clause to
filter out posts we’re not interested in. We’ll exclude all years after 2020,
and remove rows where Body does not contain ‘frequentist’ or ‘Bayesian’.
Note that, annoyingly, we can’t access Category in the WHERE clause, be-
cause it is created in the SELECT clause (see Section 2.2.7 for a refresher on
this). To check the WHERE clause is working well, we’ll add a row from 2021.
The result includes a row for the group January/frequentist. That’s a prob-
lem, because our test data includes only one January 2020 row, which has
CHAPTER 4. INDEPENDENT DEVELOPMENT 146
no ‘frequentist’ in the Body. Since rows are not being excluded properly, we
can narrow the problem down to the WHERE clause. Can you find it? Hint:
see Section 2.4.4, and note the search condition in the WHERE clause is:
YEAR(CreationDate) <= 2020 AND
Body LIKE '%bayesian%' OR Body LIKE '%frequentist%'
The above gives us a side-by-side comparison of our query results with our
test data. I encourage you to run the code yourself, to look at the outputs.
You’ll notice that our query returns an AvgScore of 1 for every row. This is a
problem, because we expect to see an AvgScore of 1.5 in the March/Bayesian
group. Take a moment to try to solve the problem. Hint: see Section 3.2.2.
After some head scratching, or by finding an online post for a similar
CHAPTER 4. INDEPENDENT DEVELOPMENT 147
problem, you may realise: since Score is an INT, the output of AVG(Score)
is being truncated to an INT. We can fix this by ‘casting’ Score to a DECIMAL
data type, with CAST(Score AS DECIMAL). Along with this fix, we’ll add the
AVG ViewCount, and we’ll also COUNT the number of answers:
Running the above query, it looks like our AVG problem is fixed. Our query
is starting to look ready for a more thorough approach to testing. However,
the query has a lot of repetition, and is getting difficult to manage. We
should first remedy some of that via the WITH clause.
Here, cte_name is the alias given to subquery, and subquery is the query
that you want to give an alias to. The subquery should be followed by a
SELECT clause. Here’s a simple example, using the name PostCats in place
of cte_name.
CHAPTER 4. INDEPENDENT DEVELOPMENT 148
WITH PostCats AS (
SELECT MONTH(CreationDate) AS CreationMonth,
CASE WHEN Body LIKE '%frequentist%' THEN 'F'
WHEN Body LIKE '%bayesian%' THEN 'B'
END AS Category
FROM Posts
WHERE YEAR(CreationDate) <= 2020
)
SELECT *
FROM PostCats;
The above causes the PostCats alias to behave like a table with two columns:
CreationMonth and Category. It then simply displays the whole PostCats
table with SELECT *. One advantage of creating PostCats is that we can now
create a GROUP BY clause that uses Category, instead of repeating the whole
CASE WHEN expression (as we were forced to do previously). Similarly, we
can use Category in the WHERE clause to filter out posts that don’t contain
‘Bayesian’ or ‘frequentist’:
WITH PostCats AS (
SELECT MONTH(CreationDate) AS CreationMonth,
CASE WHEN Body LIKE '%frequentist%' THEN 'F'
WHEN Body LIKE '%bayesian%' THEN 'B'
END AS Category
FROM Posts
WHERE YEAR(CreationDate) <= 2020
)
SELECT CreationMonth, Category
FROM PostCats
WHERE Category IS NOT NULL
GROUP BY CreationMonth, Category;
WITH PostCats AS (
SELECT MONTH(CreationDate) AS CreationMonth,
CASE WHEN Body LIKE '%frequentist%' THEN 'F'
WHEN Body LIKE '%bayesian%' THEN 'B'
END AS Category,
CAST(Score AS DECIMAL) AS Score,
CAST(ViewCount AS DECIMAL) AS ViewCount
FROM Posts
WHERE YEAR(CreationDate) <= 2020
)
SELECT CreationMonth, Category,
AVG(Score) AS AvgScore,
AVG(ViewCount) AS AvgViews,
COUNT(*) AS NumPosts
FROM PostCats
WHERE Category IS NOT NULL
GROUP BY CreationMonth, Category;
Our query now involves much less repetition. Another benefit is that
the subquery PostCats is a whole separate query that we can independently
check during validity testing, if we need to. This can sometimes separate
testing into more manageable components.
For the most part, our query is done! For some satisfaction and a sanity
check, we can run this query on real data, in a web browser, on the Stack
Exchange data platform. However, if this query is to be taken seriously,
we’ll also need to test it more thoroughly.
I know, it doesn’t look very neat. The syntax could be a lot nicer. Regardless,
we carry on. There are a few things to point out about the above code. The
columns Id, CreationDate, Score, ViewCount and Body must all be named
next to the Posts alias at the start of the WITH clause. Then, each row of test
data is preceded by SELECT, and followed by UNION ALL. After the test data, a
second alias (PostCats) is defined – we are using WITH to define two aliases:
one called Posts (holding the test data) and one for PostCats. The rest of
the query is the same as the one we developed in Section 4.4.3. It is large
and messy, so from now on, to avoid taking up too much space in the notes,
I will write the test data only, so the above will look like:
Example 4.4.1. For this example, we’ll test the behaviour about the ‘bound-
ary’ year, 2020, for the CreationDate column. Notice that, for all columns
that we aren’t testing, we use the same (simple) values in every row. This
way we avoid complicating the output. We also throw in a NULL Creation-
Date, for good measure.
Given the above test data, we can easily determine the expected output.
We should see the third row get filtered out (having a year above 2020), and
CHAPTER 4. INDEPENDENT DEVELOPMENT 152
the fourth row should be filtered out (having NULL year). The other two
rows should appear in the result (both in the January/frequentist group).
Comparing these expectations to the output tells us the query is behaving
well:
RESULT
CreationMonth Category AvgScore AvgViews NumPosts
1 F 1.0 200.0 2
Figure 4.18: The output of our validity test for the boundary year 2020
Example 4.4.2. For this example, we’ll test the Body column to make sure
the LIKE operator is picking up on lowercase and uppercase letters, as well
as not being affected by the position of the target word. We will also include
a row with NULL Body.
Given the above test data, our expected output has 3 posts in the group
January/frequentist. The output we receive tells us the query is behaving
well:
RESULT
CreationMonth Category AvgScore AvgViews NumPosts
1 F 1.0 200.0 3
Figure 4.19: The output of our validity test for the boundary year 2020
SELECT
year(IDI_Clean_20181020.moh_clean.PRIMHD.moh_mhd_activity_start_date)
AS StartYear,
IDI_Clean_20181020.moh_clean.PRIMHD.snz_moh_uid
FROM
IDI_Clean_20181020.moh_clean.PRIMHD
WHERE
IDI_Clean_20181020.moh_clean.PRIMHD.moh_mhd_activity_type_code !=
'T35'
GROUP BY
year(IDI_Clean_20181020.moh_clean.PRIMHD.moh_mhd_activity_start_date),
IDI_Clean_20181020.moh_clean.PRIMHD.snz_moh_uid
ORDER BY
year(IDI_Clean_20181020.moh_clean.PRIMHD.moh_mhd_activity_start_date);
Often, a query can be simplified just by introducing aliases. The FROM clause
in the above query contains one table name:
FROM IDI_Clean_20181020.moh_clean.PRIMHD
The above table name was not given an alias, so the full table name is being
reused as a prefix to every column name in the query. Introducing a short
alias can make the query easier on the eyes. In many cases, it’s sufficient
to stop there. However, in more complicated queries, the lengths of the
column names can still be a burden. So, it can help to also give short aliases
to the column names. To be able to use aliases for the column names, we
need to assign them via a WITH clause (see Section 4.4.3). Here is the query
after being simplified via short aliases for the table and column names:
WITH Shortened AS (
SELECT M.moh_mhd_activity_start_date AS astart,
M.snz_moh_uid AS muid,
M.moh_mhd_activity_type_code AS activity
FROM IDI_Clean_20181020.moh_clean.PRIMHD AS M
)
SELECT year(astart) AS StartYear, M.muid
FROM Shortened
WHERE M.activity != 'T35'
GROUP BY year(astart), M.muid
ORDER BY year(astart);
The above query is now much easier to read than the original. Furthermore,
all of the alias assignments happen in the WITH clause, which now serves as
a reminder of the original table and column names.
CHAPTER 4. INDEPENDENT DEVELOPMENT 154
Exercise 4.7.1
In this exercise, we will investigate the perils of joining tables that are not
related by primary/foreign key pairs.
CHAPTER 4. INDEPENDENT DEVELOPMENT 155
2.
SELECT H.house_id, H.house_address, H.post_code
FROM Notes.Suburbs S
JOIN Notes.Houses H ON S.post_code = H.post_code
GROUP BY H.house_ID, H.house_address, H.post_code
HAVING COUNT(*) > 1;
CHAPTER 4. INDEPENDENT DEVELOPMENT 156
3.
In MySQL, you should replace STRING_AGG(suburb_name, ', ') with
GROUP_CONCAT(suburb_name SEPARATOR ', ').
SELECT H.house_id,
H.house_address,
H.post_code,
STRING_AGG(suburb_name, ', ') AS potential_suburbs
FROM Notes.Suburbs S
JOIN Notes.Houses H ON S.post_code = H.post_code
GROUP BY H.house_ID, H.house_address, H.post_code
HAVING COUNT(*) > 1;
4.
SELECT H.house_id, H.house_address, H.post_code, S.suburb_name
FROM Notes.Houses H
LEFT JOIN Notes.Suburbs S ON S.post_code = H.post_code
WHERE S.suburb_name IS NULL;
5. Simply place UNION between the two queries (but don’t forget to re-
move the semi-colon from the first query).
SELECT H.house_id,
H.house_address,
H.post_code,
string_agg(suburb_name, ', ') AS potential_suburbs
FROM Notes.Suburbs S
JOIN Notes.Houses H ON S.post_code = H.post_code
GROUP BY H.house_ID, H.house_address, H.post_code
HAVING COUNT(*) > 1
UNION
SELECT H.house_id, H.house_address, H.post_code, S.suburb_name
FROM Notes.Houses H
LEFT JOIN Notes.Suburbs S ON S.post_code = H.post_code
WHERE S.suburb_name IS NULL;
6. Simply place the entire query from solution 5 into a WITH clause (see
Section 4.4.3), and then count the number of rows with COUNT(*):
CHAPTER 4. INDEPENDENT DEVELOPMENT 157
WITH T AS (
SELECT H.house_id,
H.house_address,
H.post_code,
string_agg(suburb_name, ', ') AS potential_suburbs
FROM Notes.Suburbs S
JOIN Notes.Houses H ON S.post_code = H.post_code
GROUP BY H.house_ID, H.house_address, H.post_code
HAVING COUNT(*) > 1
UNION
SELECT H.house_id, H.house_address, H.post_code,
S.suburb_name
FROM Notes.Houses H
LEFT JOIN Notes.Suburbs S ON S.post_code = H.post_code
WHERE S.suburb_name IS NULL;
)
SELECT COUNT(*) AS NumHouses
FROM T;
Exercise 4.7.2
In this exercise, we will create a database. I will call it MyDatabase.
Animals
AnimalID AnimalName StatusID
1 Black Rhino 1
2 Saola 1
3 Asian Elephant 2
4 Green Turtle 2
5 Dugong 3
6 Giant Panda 3
Endangered
StatusID Specification
1 Critically endangered
2 Endangered
3 Vulnerable
CHAPTER 4. INDEPENDENT DEVELOPMENT 158
4. Execute the following two statements and explain what they achieve.
6. Add a primary key constraint to the Endangered table, and then add
the appropriate foreign key constraint to Animals.
2.
CHAPTER 4. INDEPENDENT DEVELOPMENT 159
Exercise 4.7.3
In this exercise, we will continue working with the database created in Ex-
ercise 4.7.2.
1. Write a query that joins the Animals and Endangered tables, and re-
trieves only the columns AnimalName and Specification. Save the
result using CREATE VIEW.
2. Explain why the following produces an error.
CHAPTER 4. INDEPENDENT DEVELOPMENT 160
5. A view is a stored query that behaves like a table. So, the results of
the view are always up to date.
-- display the contents of the view
SELECT *
FROM MyView;
Exercise 4.7.4
We’re now going to extend and improve on the Orders table, which we first
saw way back in Figure 2.13. You can look at the Orders table in the Notes
schema of the Sandpit database. It’s a strange table, really. For one, there
are no table numbers. Also, the items and their prices are stored alongside
CHAPTER 4. INDEPENDENT DEVELOPMENT 161
each order, which will lead to data redundancy when the same item is or-
dered twice. We’re going to create a separate table for items, and we’ll add
a table number column into Orders.
-- T-SQL syntax
SELECT Item, Price
INTO Restaurant.Menu
FROM Restaurant.Orders;
3.
-- T-SQL syntax
ALTER TABLE Restaurant.Orders
DROP COLUMN Price;
-- MySQL syntax
ALTER TABLE Restaurant_Orders
DROP COLUMN Price;
4. This syntax works in both MySQL and T-SQL. In MySQL, just replace
Restaurant.Orders with Restaurant_Orders.
ALTER TABLE Restaurant.Orders
ADD TableNo INT;
UPDATE Restaurant.Orders
SET TableNo = 1
WHERE OrderID IN (1,2);
5. We need to add the not null constraint, add the primary key con-
straint, and then finally add the foreign key constraint.
-- In MySQL replace 'ALTER COLUMN' with 'MODIFY COLUMN'
ALTER TABLE Restaurant.Menu
ALTER COLUMN Item varchar(30) not null;
Exercise 4.7.5
In this exercise, we will work with a pair of practice databases modelled af-
ter a small section of the New Zealand Integrated Data Infrastructure (IDI).
Note, many of the tables in this fake IDI database are empty. They are just
there to simulate table and schema naming conventions. You will need to
CHAPTER 4. INDEPENDENT DEVELOPMENT 163
use the ACC_Clean data dictionary. The IDI_Clean database contains indi-
vidual level data, and the IDI_Metadata database contains non-individual
level data.
1. The real IDI contains very large tables. To have a peek at these tables,
it would be very inefficient to simply execute SELECT * FROM MyTable.
Write a query that looks at the first 3 rows of the Serious_Injury table
in ACC_Clean.
2. Explain what the query below does (and, if you are using T-SQL, then
try running it). You may have to search online for details about the
T-SQL ‘Information Schema’.
SELECT *
FROM IDI_Clean.Information_Schema.Columns
WHERE Table_Catalog = 'IDI_Clean';
4. How many injuries is each employer involved in? Write a query that
lists the number of injuries caused by each employer. Use the key
named snz_employer_ird_uid to identify employers, since you do not
have access to the employer names. Also display the total amount of
claim costs related to each employer.
5. Of those people who have engaged with the ACC, how many have en-
CHAPTER 4. INDEPENDENT DEVELOPMENT 164
-- MySQL Syntax
SELECT *
FROM IDI_Clean.ACC_Clean_Serious_Injury
LIMIT 3;
5.
SELECT COUNT(*) AS NumPeople
FROM Security.concordance
WHERE snz_acc_uid IS NOT NULL
AND snz_moh_uid IS NOT NULL;
6. For the query below, in MySQL, use the GROUP_CONCAT function in-
stead of STRING_AGG.
WITH E AS (
SELECT snz_uid, 1 AS ethnicity_code
FROM IDI_Clean.ACC_Clean.Serious_Injury S
WHERE acc_cla_ethnic_grp1_snz_uid = 1
UNION
SELECT snz_uid, 2 AS ethnicity_code
FROM IDI_Clean.ACC_Clean.Serious_Injury S
WHERE acc_cla_ethnic_grp2_snz_uid = 1
UNION
SELECT snz_uid, 3 AS ethnicity_code
FROM IDI_Clean.ACC_Clean.Serious_Injury S
WHERE acc_cla_ethnic_grp3_snz_uid = 1
UNION
SELECT snz_uid, 4 AS ethnicity_code
FROM IDI_Clean.ACC_Clean.Serious_Injury S
WHERE acc_cla_ethnic_grp4_snz_uid = 1
UNION
SELECT snz_uid, 5 AS ethnicity_code
FROM IDI_Clean.ACC_Clean.Serious_Injury S
WHERE acc_cla_ethnic_grp5_snz_uid = 1
UNION
SELECT snz_uid, 6 AS ethnicity_code
FROM IDI_Clean.ACC_Clean.Serious_Injury S
WHERE acc_cla_ethnic_grp6_snz_uid = 1
), M AS (
SELECT *
FROM
IDI_Metadata.clean_read_CLASSIFICATIONS.acc_ethnicity_code
)
SELECT snz_uid, STRING_AGG(description, ', ') AS ethnicities
FROM E JOIN M
ON E.ethnicity_code = M.ethnic_grp
GROUP BY snz_uid;
Bibliography
[Codd, 1970] Codd, E. F. (1970). A relational model of data for large shared
data banks. Communications of the ACM, 13(6):377–387.
166
Glossary
relational model The framework for how SQL “thinks of” data. It de-
scribes tables, and relationships between tables (e.g., with primary
and foreign key pairs). 7, 11
miniworld The part of the real world (or any imaginary world) that a given
database is designed to capture or represent. 9
entry One data value, in one particular row and column of a table. 12, 13
attribute One column of a table. The columns are essentially a set of labels
that define, conceptually, the data to be contained in the table. 12
167
Glossary 168
domain The collection of possible values for each attribute in a table. The
domain tells us what type of data (e.g., person names, phone num-
bers, country names etc) that we can store in each column of the ta-
ble. 13
primary key A primary key is any column (or collection of columns) that
has (or have, together) been chosen to uniquely identify the rows of
the table it belongs to. The entries in a primary key must be unique.
16
one-to-one relationship When one record (i.e., row) in a table can be as-
sociated with at most one record in another table, via a primary and
foreign key pair. 21
data type Data types are specified when SQL tables are created. They de-
termine the kinds of data that can be stored in a given column. Exam-
ples include INT (positive or negative whole numbers), FLOAT (approx-
imate decimal numbers), VARCHAR (variable lengths character strings),
DATE (dates) and TIME (times). 33
filtering The act of choosing rows from a table, based on meeting a search
condition, is often referred to as filtering. In SQL, this is achieved by
the WHERE clause (or, when filtering based on aggregation functions, it
is achieved by the HAVING clause). 38
tuple entries Tuple entries may be better referred to as ’arrays’, but I use
this terminology instead to keep things (hopefully) less confusing for
beginners. Tuple entries are collections of multiple values that oc-
cupy a single entry in a table. These cannot be returned by SELECT in
standard SQL. 77
window function A window function is any function that can work with
the OVER clause. These include aggregation functions, ranking func-
tions, and analytic functions. 96
Syntax Conventions The Syntax Conventions are a set of symbols that are
used in MySQL and T-SQL documentation to help communicate the
kinds of syntax that form valid SQL statements. 120
Data Manipulation Language (DML) DML is the part of the SQL language
that is dedicated to inserting, updating, deleting and querying data.
Contrast this with Data Definition Language (DDL). 120
Data Definition Language (DDL) DDL is the part of the SQL language that
is dedicated to defining objects, such as tables. DDL is used to create,
alter and drop whole tables. 120
fastball testing Fastball testing is the process of writing quick and dirty
test data to run through a query as early as possible. Fastball testing
makes sure that queries execute without error and produce the most
basic desired output. 144