SQL For Data Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 63
At a glance
Powered by AI
Some of the key takeaways from the workshop introduction are that SQL is a simple yet powerful language for working with relational databases and performing data analytics. The workshop aims to introduce relational database concepts and provide hands-on experience with real-world datasets from the City of Edmonton Open Data Portal.

The goals of the workshop are to introduce relational database concepts, provide hands-on experience with real-world datasets from the City of Edmonton Open Data Portal, and foster a collaborative learning environment where participants can ask questions.

Some advantages of using a RDBMS include establishing a centralized and logical view of data, minimizing data duplication, promoting data accuracy and integrity, retrieving information quickly, and enabling interoperability between systems.

Workshop

Introducing SQL:
A Foundation of Data Analytics

Robb Sombach
University of Alberta
Alberta School of Business

1
Agenda
• Introduction • SQL
• Why SQL? • Data Definition
• What about Python? R? Language (DDL)
• Data Analytics • Exercise 2
• Data Manipulation
• Relational Database Language (DML)
• What is a database? • Exercise 3
• Terminology
• SQLite • Open Data Portal
• Exercise 1 • How I prepared for
today

2
Robb Sombach
• Work Experience
• 15+ years working in the IT industry
• 10+ years Self-Employed IT Consultant
• IT Positions
• Systems Analyst / Business Analyst
• Database Administrator (Oracle / SQL Server)
• Network Administrator
• Developer

3
Robb Sombach
• Teaching Experience
• 5 years teaching at NAIT
• Computer Systems Technology (CST)
• Digital Media and Information Technology (DMIT)
• 6+ years teaching at University of Alberta
• Technology Training Centre
• Alberta School of Business

4
Resources

All Workshop files can be downloaded here

http://bit.ly/odd_2019

5
Introduction
Workshop
Introducing SQL: Foundation of Data Analytics

6
Goals
• Introduce relational database concepts
• Provides hands-on, real world database experience
using data from the City of Edmonton Open Data
Portal
• Foster a collaborative workshop
• Please interupt and ask questions

7
Why SQL?
• Simple
• Accessible
• Applicable
• Powerful
• Pervasive
• Valuable
• Universal

8
Why not Python? R?
• Difficult for beginners
• Complicated syntax
• Requires programming
knowledge (logic,
algorithms)
• Is SQL better than Python
or R?
• SQL is good for some things
• Python/R is good for other
things
• Compliment each other
• SQL is a great starting
point

9
Data Analytics
• Analytics is the discovery, interpretation, and
communication of meaningful patterns in data; and
the process of applying those patterns towards
effective decision making
• Organizations may apply analytics to business data
to describe, predict, and improve business
performance
• https://en.wikipedia.org/wiki/Analytics

10
Relational Database
Workshop
Introducing SQL: Foundation of Data Analytics

11
What is a database?
• A relational “database”
management system
(RDBMS) organizes data
• The logical structure of
the database is based
upon the information
needs of an organization
• Entities (“things” of
interest to the
organization),
AND
• Relationships (how the
Entities are associated
with each other)

12
Advantages of a RDBMS
• Establish a centralized,
logical view of data
• Minimizes data duplication
(i.e. “redundancy”)
• Promote data accuracy and
integrity
• Capacity of database
• Superior multi-user or
concurrent access
• Security
• Retrieve information
quickly https://www.bespokesoftwaredevelopment.com/blog/advantages-database-development-business/

• Inter-operability
13
Database Terminology
• Table, Entity, Relation,
(similar to an Excel
Worksheet)
• Row, Record, Instance
• Column, Field, Attribute
• Primary Key – unique and
mandatory
• Foreign Key – a cross-
reference between tables
because it references the
primary key of another
table
• Relationship – created
though foreign keys
14
How to introduce SQL?
• Microsoft Access
• https://products.office.com/en-
ca/access
• Microsoft SQL Server
• https://www.microsoft.com/en-
us/sql-server/sql-server-2017
• MariaDB, MySQL
• https://mariadb.org/
• https://www.mysql.com/
• Postgresql
• https://www.postgresql.org/
• Oracle
• https://www.oracle.com/database/
• Hadoop, Spark, Hive, Pig
• https://hadoop.apache.org/

15
A database that …
• Has full-featured SQL • Aviation-grade quality and
• Has billions and billions of testing
deployments • Zero-configuration
• Is a single-file database • Has ACID (Atomic,
• Has public domain source Consistent, Isolated, and
code Durable) transactions, even
after power loss
• Small footprint • Has a stable, enduring file
• Has a max DB size of 140 format
terabytes • Is has extensive, detailed
• Has a max row size of 1 documentation
gigabyte • Has long-term support (to
• Is faster than direct file the year 2050)
access
https://www.sqlite.org/about.html
16
SQLite
• “SQLite is the most widely deployed database in
the world with more applications than we can
count, including several high-profile projects”
• https://www.sqlite.org/famous.html
• “SQLite is an in-process library that implements a
self-contained, serverless, zero-configuration,
transactional SQL database engine”
• https://www.sqlite.org/about.html
• Perfect for learning SQL (the foundation of data
analytics)
17
Exercise 1: Download and Run
SQLite BD Browser
• Download SQLite
• Download SQLite DB Browser Portable
• https://sqlitebrowser.org/dl/

18
Exercise 1: Download and
Run SQLite
• Extract the ZIP archive to the Desktop
• Start SQLite
• SQLiteDatabaseBrowserPortable.exe
• Create a New database
• open_data_day_2019.db
• Save the database in the Data folder
• Click Cancel when prompted to create a table
• Done!

19
Exercise 1: Completed

20
SQL
Workshop
Introducing SQL: Foundation of Data Analytics

21
What is SQL?
• SQL stands for Structured Query Language
• SQL is pronounced S-Q-L or sequel
• SQL is a standard language for managing, manipulating
and querying databases
• Developed at IBM in the early 1970’s
• In 1986, ANSI and ISO standard groups officially adopted
the standard “Database Language SQL” definition
• Most SQL databases have their own proprietary
extensions in addition to the SQL standard
• SQL is the language used to ask questions (query)
of a database which will return answers (results)
22
Why is SQL the foundation
of Data Analytics?
• Data engineers and database administrators will
use SQL to ensure that everybody in their
organization has access to the data they need
• Data scientists will use SQL to load data into their
models
• Data analysts will use SQL to query tables of data
and derive insights from it

23
Components of SQL
• SQL consists of three
components which
offer everything
required to manage,
maintain and use a
database
1. Data Definition
Language
2. Data Manipulation
Language
3. Data Control
Language

24
Data Definition Language (DDL)
• This component is used to define the structure (or
schema) of the database
• For tables there are three main commands:

• CREATE TABLE table_name


• To create a table in the database
• ALTER TABLE table_name
• To add or remove columns from a table in the database
• DROP TABLE table_name
• To remove a table from the database

25
Exercise 2: Data Definition
Language
• Select the Execute SQL tab in SQLite
• Type or copy/paste the CREATE TABLE statement into
the empty SQLite Execute SQL window
• Click the Execute SQL button on the toolbar
• If the table is created successfully, you should receive
the following message:
• Query executed successfully: CREATE TABLE
"MOSQUITO_TRAP_DATA“
• Click Write Changes to make commit the changes
permanent
• View the changes in the Database Structure tab

26
CREATE TABLE "MOSQUITO_TRAP_DATA" (
`SAMPLEID` INTEGER PRIMARY KEY AUTOINCREMENT,
`TRAP_DATE` NUMERIC,
`GENUS` TEXT,
`SPECIES` TEXT,
`TYPE` TEXT,
`GENDER` TEXT
);

https://www.sqlite.org/lang_createtable.html 27
Exercise 2: Data Definition
Language
• Select the Execute SQL tab in SQLite
• Type or copy/paste the ALTER TABLE statements into
the empty SQLite Execute SQL window
• Click the Execute SQL button on the toolbar
• If the table is created successfully, you should receive
the following message:
• Query executed successfully: ALTER TABLE
"MOSQUITO_TRAP_DATA“
• Click Write Changes to make commit the changes
permanent
• View the changes in the Database Structure tab

28
ALTER TABLE "MOSQUITO_TRAP_DATA" ADD COLUMN `RURALNORTHWEST` INTEGER;
ALTER TABLE "MOSQUITO_TRAP_DATA" ADD COLUMN `RURALNORTHEAST` INTEGER;
ALTER TABLE "MOSQUITO_TRAP_DATA" ADD COLUMN `RURALSOUTHEAST` INTEGER;
ALTER TABLE "MOSQUITO_TRAP_DATA" ADD COLUMN `RIVERVALLEYEAST` INTEGER;
ALTER TABLE "MOSQUITO_TRAP_DATA" ADD COLUMN `RIVERVALLEYWEST` INTEGER;
ALTER TABLE "MOSQUITO_TRAP_DATA" ADD COLUMN `RESIDENTIALNORTH` INTEGER;
ALTER TABLE "MOSQUITO_TRAP_DATA" ADD COLUMN `RURALSOUTHWEST` INTEGER;
ALTER TABLE "MOSQUITO_TRAP_DATA" ADD COLUMN `LAGOON` INTEGER;
ALTER TABLE "MOSQUITO_TRAP_DATA" ADD COLUMN `GOLFCOURSE` INTEGER;
ALTER TABLE "MOSQUITO_TRAP_DATA" ADD COLUMN `INDUSTRIALPARK` INTEGER;
ALTER TABLE "MOSQUITO_TRAP_DATA" ADD COLUMN `RESIDENTIALSOUTH` INTEGER;
ALTER TABLE "MOSQUITO_TRAP_DATA" ADD COLUMN `TOTAL` INTEGER;

https://www.sqlite.org/lang_altertable.html 29
Exercise 2: Data Definition
Language
• Select the Execute SQL tab in SQLite
• Type or copy/paste the DROP TABLE statement into the
empty SQLite Execute SQL window
• Click the Execute SQL button on the toolbar
• If the table is created successfully, you should receive
the following message:
• Query executed successfully: DROP TABLE
"MOSQUITO_TRAP_DATA"
• Click Write Changes to make commit the changes
permanent
• View the changes in the Database Structure tab

30
DROP TABLE "MOSQUITO_TRAP_DATA";

https://www.sqlite.org/lang_droptable.html 31
Exercise 2: Data Definition
Language
• Create the MOSQUITO_TRAP_DATA table again
using the DDL on the next slide
• Click Write Changes to make commit the changes
permanent
• View the changes in the Database Structure tab
• Done!

32
CREATE TABLE "MOSQUITO_TRAP_DATA" (
`SAMPLEID` INTEGER PRIMARY KEY AUTOINCREMENT,
`TRAP_DATE` NUMERIC,
`GENUS` TEXT,
`SPECIES` TEXT,
`TYPE` TEXT,
`GENDER` TEXT,
`RURALNORTHWEST` INTEGER,
`RURALNORTHEAST` INTEGER,
`RURALSOUTHEAST` INTEGER,
`RIVERVALLEYEAST` INTEGER,
`RIVERVALLEYWEST` INTEGER,
`RESIDENTIALNORTH` INTEGER,
`RURALSOUTHWEST` INTEGER,
`LAGOON` INTEGER,
`GOLFCOURSE` INTEGER,
`INDUSTRIALPARK` INTEGER,
`RESIDENTIALSOUTH` INTEGER,
`TOTAL` INTEGER
)

https://www.sqlite.org/lang_createtable.html 33
Exercise 1: Completed

34
Data Manipulation Language
• This component is used to manipulate data within a
table
• There are four main commands:

• SELECT
• To select rows of data from a table
• INSERT
• To insert rows of data into a table
• UPDATE
• To change rows of data in a table
• DELETE
• To remove rows of data from a table

35
Exercise 3: SELECT
Data Manipulation Language
• Select the Execute SQL tab in SQLite
• Type or copy/paste the SELECT statement into the
empty SQLite Execute SQL window
• SELECT COUNT(*) FROM MOSQUITO_TRAP_DATA;
• Click the Execute SQL button on the toolbar
• Do you get an answer? Why not?

https://www.sqlite.org/lang_select.html 36
Exercise 3: INSERT
Data Manipulation Language
• Add some data to the MOSQUITO_TRAP_DATA
table created in Exercise 2
• Type or copy/paste the INSERT statement into the
empty SQLite Execute SQL window
• Click the Execute SQL button on the toolbar
• Click Write Changes to make commit the changes
permanent
• View the changes in the Browse Data tab
• The MOSQUITO_TRAP_DATA table now has seven
rows of data

37
INSERT INTO "MOSQUITO_TRAP_DATA" (TRAP_DATE, GENUS, SPECIES, TYPE, GENDER, RURALNORTHWEST,
RURALNORTHEAST, RURALSOUTHEAST, RIVERVALLEYEAST, RIVERVALLEYWEST, RESIDENTIALNORTH,
RURALSOUTHWEST, LAGOON, GOLFCOURSE, INDUSTRIALPARK, RESIDENTIALSOUTH, TOTAL) VALUES ('2014-
07-01','Aedes','spencerii','Black legs','Female',0,0,0,0,0,1,0,0,0,1,1,3);
INSERT INTO "MOSQUITO_TRAP_DATA" (TRAP_DATE, GENUS, SPECIES, TYPE, GENDER, RURALNORTHWEST,
RURALNORTHEAST, RURALSOUTHEAST, RIVERVALLEYEAST, RIVERVALLEYWEST, RESIDENTIALNORTH,
RURALSOUTHWEST, LAGOON, GOLFCOURSE, INDUSTRIALPARK, RESIDENTIALSOUTH, TOTAL) VALUES ('2014-
07-01','Aedes','dorsalis','Banded legs','Female',0,1,0,0,0,0,2,0,0,0,0,3);
INSERT INTO "MOSQUITO_TRAP_DATA" (TRAP_DATE, GENUS, SPECIES, TYPE, GENDER, RURALNORTHWEST,
RURALNORTHEAST, RURALSOUTHEAST, RIVERVALLEYEAST, RIVERVALLEYWEST, RESIDENTIALNORTH,
RURALSOUTHWEST, LAGOON, GOLFCOURSE, INDUSTRIALPARK, RESIDENTIALSOUTH, TOTAL) VALUES ('2014-
07-01','Aedes','euedes','Banded legs','Female',1,1,0,0,2,0,0,0,0,0,0,4);
INSERT INTO "MOSQUITO_TRAP_DATA" (TRAP_DATE, GENUS, SPECIES, TYPE, GENDER, RURALNORTHWEST,
RURALNORTHEAST, RURALSOUTHEAST, RIVERVALLEYEAST, RIVERVALLEYWEST, RESIDENTIALNORTH,
RURALSOUTHWEST, LAGOON, GOLFCOURSE, INDUSTRIALPARK, RESIDENTIALSOUTH, TOTAL) VALUES ('2014-
07-01','Aedes','excrucians','Banded legs','Female',1,2,0,0,2,1,0,0,0,1,0,7);
INSERT INTO "MOSQUITO_TRAP_DATA" (TRAP_DATE, GENUS, SPECIES, TYPE, GENDER, RURALNORTHWEST,
RURALNORTHEAST, RURALSOUTHEAST, RIVERVALLEYEAST, RIVERVALLEYWEST, RESIDENTIALNORTH,
RURALSOUTHWEST, LAGOON, GOLFCOURSE, INDUSTRIALPARK, RESIDENTIALSOUTH, TOTAL) VALUES ('2014-
07-01','Aedes','fitchii','Banded legs','Female',0,2,0,0,1,0,0,0,0,0,4,7);
INSERT INTO "MOSQUITO_TRAP_DATA" (TRAP_DATE, GENUS, SPECIES, TYPE, GENDER, RURALNORTHWEST,
RURALNORTHEAST, RURALSOUTHEAST, RIVERVALLEYEAST, RIVERVALLEYWEST, RESIDENTIALNORTH,
RURALSOUTHWEST, LAGOON, GOLFCOURSE, INDUSTRIALPARK, RESIDENTIALSOUTH, TOTAL) VALUES ('2014-
07-01','Aedes','flavescens','Banded legs','Female',6,5,8,0,0,0,5,0,0,3,1,28);
INSERT INTO "MOSQUITO_TRAP_DATA" (TRAP_DATE, GENUS, SPECIES, TYPE, GENDER, RURALNORTHWEST,
RURALNORTHEAST, RURALSOUTHEAST, RIVERVALLEYEAST, RIVERVALLEYWEST, RESIDENTIALNORTH,
RURALSOUTHWEST, LAGOON, GOLFCOURSE, INDUSTRIALPARK, RESIDENTIALSOUTH, TOTAL) VALUES ('2014-
07-01','Aedes','vexans','Banded legs','Female',3,168,1,21,38,8,16,0,0,3,32,290);

https://www.sqlite.org/lang_insert.html 38
Exercise 3: SELECT
Data Manipulation Language
• Type or copy/paste the SELECT statement into the
empty SQLite Execute SQL window
• SELECT COUNT(*) FROM MOSQUITO_TRAP_DATA;
• Click the Execute SQL button on the toolbar
• When you execute the query, you are asking the
database a question
• Can you tell me the number of rows in the
MOSQUITO_TRAP_DATA table?
• The database gives you an answer (the result) and you
should have received the following message:
• 7 rows returned in 1ms from: SELECT * FROM
MOSQUITO_TRAP_DATA;

https://www.sqlite.org/lang_select.html 39
Exercise 3: SELECT
Data Manipulation Language
• What if you want to see all the rows in your
database?
• SELECT * FROM MOSQUITO_TRAP_DATA;
• Returns all columns and rows in a table
• What if you only want to see the Genus, Species
and Total of each row?
• SELECT GENUS, SPECIES, TOTAL FROM
MOSQUITO_TRAP_DATA;
• Returns only the GENUS, SPECIES, TOTAL columns for
each row in a table

https://www.sqlite.org/lang_select.html 40
Data Manipulation Language
• The WHERE clause Operator Description
• Uses operators to extract = Equal
only those records that <> Not equal. Note: In some versions of
fulfill a specified condition SQL this operator may be written as !=

• Used to ask more > Greater than

complicated questions < Less than

• SQL will do exactly what >= Greater than or equal

you ask, not always what <= Less than or equal

you expect BETWEEN Between a certain range

• “I do not think it means LIKE Search for a pattern

what you think it means” IN To specify multiple possible values for a


column
• Inigo Montoya

https://www.sqlite.org/lang_select.html 41
Exercise 3: SELECT
Data Manipulation Language
• Show the rows that have a mosquito TYPE of “Black
legs”
• SELECT * FROM MOSQUITO_TRAP_DATA WHERE TYPE =
'Black legs';

YOUR TURN
• Write and execute a DML statement to answer the
question below:
• Which mosquito species’ were caught in the traps
placed in the west river valley?

https://www.sqlite.org/lang_select.html 42
Exercise 3: UPDATE
Data Manipulation Language
• Select the Execute SQL tab in SQLite
• Type or copy/paste the UPDATE statement into an
empty SQLite Execute SQL window
• Click the Execute SQL button on the toolbar
• You should receive the following message:
• Query executed successfully: … (took 1ms, 4 rows
affected)

https://www.sqlite.org/lang_update.html 43
UPDATE MOSQUITO_TRAP_DATA
SET GENDER = 'Male‘
WHERE SAMPLEID IN (1,3,5,7);

https://www.sqlite.org/lang_update.html 44
Data Manipulation Language
• The GROUP BY clause Function Description

• Used in collaboration AVG Calculates the average of a set of


values
with the SELECT COUNT Counts rows in a specified table or view
statement to arrange MAX Gets the minimum value in a set of
identical data into values

groups MIN Gets the maximum value in a set of


values

• The GROUP BY SUM Calculates the sum of values

statement is often used


with aggregate
functions

https://www.sqlite.org/lang_aggfunc.html 45
Exercise 3: SELECT
Data Manipulation Language
YOUR TURN
• Write and execute a DML statement to answer the
question below:
• How many mosquitos of each gender were caught in
traps throughout the city?

SELECT GENDER, TOTAL FROM MOSQUITO_TRAP_DATA


GROUP BY GENDER;

https://www.sqlite.org/lang_select.html 46
Exercise 3: DELETE
Data Manipulation Language
• Select the Execute SQL tab in SQLite
• Type or copy/paste the DELETE statement into an
empty SQLite Execute SQL window
• Click the Execute SQL button on the toolbar
• You should receive the following message:
• Query executed successfully: … (took 0ms, 4 rows
affected)

https://www.sqlite.org/lang_update.html 47
DELETE FROM
MOSQUITO_TRAP_DATA WHERE
GENDER = "Male";

https://www.sqlite.org/lang_delete.html 48
Exercise 3: SELECT
Data Manipulation Language
YOUR TURN
• Write and execute a DML statement to answer the
question below:
• At which traps were more mosquitos caught? Rural
north east or rural north west?
• Done!

SELECT SUM(RURALNORTHWEST) AS 'RURAL_WEST',


SUM(RURALNORTHEAST) AS 'RURAL_EAST' FROM
MOSQUITO_TRAP_DATA;

https://www.sqlite.org/lang_select.html 49
Advanced SQL
• The MOSQUITO database only has one table
• Databases with more than one table require tables
to be joined
• Foreign keys create relationships between tables
and must be joined in a DML statement

50
• Download the LED Streetlight Conversion database
called odd_streetlight.db
• Execute the query below

SELECT LED_STREETLIGHT.STREETLIGHT_ID, LED_STREETLIGHT.TYPE,


LOCATION.LOCATION
FROM LED_STREETLIGHT, LOCATION
WHERE LED_STREETLIGHT.STREETLIGHT_ID = LOCATION.STREETLIGHT_ID
AND LED_STREETLIGHT.STREETLIGHT_ID = 12;

https://www.sqlite.org/lang_select.html 51
City of Edmonton
Open Data Portal
Workshop
Introducing SQL: Foundation of Data Analytics

52
Using the Open Data Portal
• https://data.edmonton.ca/
• Data sets are usually available in comma separated
value (CSV) format
• To use the dataset requires cleaning, importing,
exploring and understand the data set
• Workshop: Exploring & Cleaning Data with OpenRefine
• Requires work

53
Data Work Flow

http://fouryears.eu/wp-content/uploads/2018/11/pipeline.png 54
How I prepared the data
sets for today
• Selected data sets from the Open Data Portal
• Downloaded the CSV and surveyed in Google
Sheets
• Cleaned the data set
• E.g. reformatted dates from MMM DD YYYY to YYYY-
MM-DD
• Imported into directly into SQLite tables
• Added primary keys
• Explored data set using DML

55
Some “Mosquitoes Trap
Data” questions
• How many mosquitos caught in 2014?
SELECT strftime('%Y', TRAP_DATE) as YEAR, SUM(TOTAL)
FROM MOSQUITO_TRAP_DATA
WHERE TOTAL <> ''
AND TOTAL > 0
GROUP BY YEAR;
• How many mosquitos of each species were caught?
• Which traps caught the most mosquitos?

https://www.sqlite.org/lang_datefunc.html 56
Some “LED Streetlight
Conversion” questions
• How many total streetlights?
• How many streetlights are converted to LED?
• How many streetlights were converted by year?
SELECT strftime('%Y', STARTDATE) as YEAR, TYPE,
COUNT(STREETLIGHT_ID)
FROM LED_STREETLIGHT
WHERE TYPE = "LED"
GROUP BY YEAR;

https://www.sqlite.org/lang_datefunc.html 57
SQL and Climate Change
• Connecting and linking various data sets
• Builds an understanding of what that data means

•Data is a universal language,


climate change is a global
problem

58
Next steps
• Playing with data and SQL forces you to think and
understand the data (builds knowledge)
• The relationships between data
• The meaning of those relationships
• The validity of the data
• SQL is iterative, often a “trial and error” process
• Don’t be afraid to make mistakes
• Team sport – discuss, share, question, collaborate
• Data is everywhere which raises questions of
privacy, security and ethics
59
Experiment

https://www.manchester.ac.uk/discover/news/major-leap-towards-storing-data-at-the-molecular-level/
60
If there’s time … (I talked too fast)

• Let’s (democratically):
1. Choose a dataset not discussed during the workshops
2. Formulate a question related to the dataset
3. Load the data into SQLite
4. Execute some DML to answer the question

61
Thank you!
• Robb Sombach
[email protected]
[email protected]
• LinkedIn

62
References
• https://opendataday.org/
• https://data36.com/sql-for-data-analysis-tutorial-
beginners/
• https://www.datascience.com/blog/to-sql-or-not-
to-sql-that-is-the-question
• https://codebeautify.org/sqlformatter

63

You might also like