IDAB Assignment 3: 1. Explain SQL Subqueries
IDAB Assignment 3: 1. Explain SQL Subqueries
IDAB Assignment 3: 1. Explain SQL Subqueries
IDAB Assignment 3
A Subquery or Inner query or a Nested query is a query within another SQL query and
embedded within the WHERE clause.
A subquery is used to return data that will be used in the main query as a condition to
further restrict the data to be retrieved.
Subqueries can be used with the SELECT, INSERT, UPDATE, and DELETE statements
along with the operators like =, <, >, >=, <=, IN, BETWEEN, etc.
There are a few rules that subqueries must follow −
Subqueries must be enclosed within parentheses.
A subquery can have only one column in the SELECT clause, unless multiple
columns are in the main query for the subquery to compare its selected
columns.
An ORDER BY command cannot be used in a subquery, although the main query
can use an ORDER BY. The GROUP BY command can be used to perform the
same function as the ORDER BY in a subquery.
Subqueries that return more than one row can only be used with multiple value
operators such as the IN operator.
The SELECT list cannot include any references to values that evaluate to a BLOB,
ARRAY, CLOB, or NCLOB.
A subquery cannot be immediately enclosed in a set function.
The BETWEEN operator cannot be used with a subquery. However, the BETWEEN
operator can be used within the subquery.
Subqueries are most frequently used with the SELECT statement. The basic syntax is as
follows –
SELECT column_name [, column_name ]
FROM table1 [, table2 ]
WHERE column_name OPERATOR
(SELECT column_name [, column_name ]
FROM table1 [, table2 ]
[WHERE])
Subqueries also can be used with INSERT statements. The INSERT statement uses the
data returned from the subquery to insert into another table. The selected data in the
subquery can be modified with any of the character, date or number functions.
The basic syntax is as follows:
INSERT INTO table_name [ (column1 [, column2 ]) ]
SELECT [ *|column1 [, column2 ]
FROM table1 [, table2 ]
[ WHERE VALUE OPERATOR ]
The subquery can be used in conjunction with the UPDATE statement. Either single or
multiple columns in a table can be updated when using a subquery with the UPDATE
statement.
The basic syntax is as follows:
UPDATE table
SET column_name = new_value
[ WHERE OPERATOR [ VALUE ]
(SELECT COLUMN_NAME
FROM TABLE_NAME)
[ WHERE) ]
The subquery can be used in conjunction with the DELETE statement like with any other
statements mentioned above.
The basic syntax is as follows:
DELETE FROM TABLE_NAME
[ WHERE OPERATOR [ VALUE ]
(SELECT COLUMN_NAME
FROM TABLE_NAME)
[ WHERE) ]
An aggregation basically takes the values in multiple rows of data and returns one value.
Effectively collapsing all those rows into a single row containing a measure of interest. In
the world of SQL we're really talking about one of the following types of familiar
operations. Each of which operates on a table field. The sum function calculates the
arithmetic sum across the sum of field values. Similarly, the average or AVG function
calculates the average across the sum of field values. Both the sum and average
functions require that the data in the field be numeric. They won't work on other types
of data like strings or dates. This is not the case for the min and the max functions,
which return the minimum or maximum values from a set of field values. The min and
max functions will work with most data types, including numbers, strings, or dates. The
count function, returns the number of values present in a field of values. Some SQL
versions provide a few more options for aggregate functions. But the ones we
introduced here are pretty universal across all versions you're likely to encounter. For all
these functions the syntax is similar.
If we're interested in aggregate values across an entire table, the syntax is pretty simple.
We just use an aggregate function instead of a field name in our select statement. For
example, consider a transactions table. Since each row represents one transaction, if we
wanted to count the total number of transactions from the table we'd use this query. As
you can see we just select count star from transactions. When we run this code we get
the following which tells me there are nine rows in this table. Depending on the specific
version of SQL you are using the query may or may not return a column name for the
aggregation in the output data set. In this case the column name might be blank, or it
might be the whole function COUNT(*). SQL Does provide us a way to specify or change
a column name in our query through the use of something called an alias.
We saw earlier that we can get the total number of purchases in our table by using the
COUNT (*) function. To get the total dollar values of purchases or use the sum function
on the price column to get the average dollar value of purchases, I use the average
function, also on the price column. To get this aggregate by product, I'll include then on
this query using the group by statement. Here, I select the product as the grouping field
then my three aggregate functions as aggregate fields I've added aliases for each
aggregate function so the output data function makes sense. I've called COUNT(*)
PURCHASES, the SUM(PRICE) TOTAL_SALES, and the Average (Price) AVG_SALES. I then
have, FROM TRANSACTIONs, and finally, I GROUP BY PRODUCT.
When I run this query I get the following output. Notice that the resulting data set has
one row per product, with each of my aggregates as columns: purchases, total sales, and
average sales. Now let's say I wanted to aggregate my data, but I wanted to filter the
output data set in some way based on the values of the aggregates themselves.
Previously when we wanted to filter rows of data we used the where command. But
when we want to filter rows of data after they have been aggregated we use the having
command. The syntax for the having command looks like this. Everything is the same as
for the group by command, but now I've added a having command to the end of the
query, followed by some logical conditions that qualify what to filter. The logical
conditions themselves are basically the same as those we use with the where command.
Except that all the fields I reference are aggregate fields versus fields from the source
table.
Data extraction is the process of obtaining data from a database or SaaS platform so
that it can be replicated to a destination — such as a data warehouse — designed to
support online analytical processing (OLAP).
Extraction jobs may be scheduled, or analysts may extract data on demand as dictated
by business needs and analysis goals. Data can be extracted in three primary ways:
Update notification
The easiest way to extract data from a source system is to have that system issue a
notification when a record has been changed. Most databases provide a mechanism for
this so that they can support database replication (change data capture or binary logs),
and many SaaS applications provide webhooks, which offer conceptually similar
functionality.
Incremental extraction
Some data sources are unable to provide notification that an update has occurred, but
they are able to identify which records have been modified and provide an extract of
those records. During subsequent ETL steps, the data extraction code needs to identify
and propagate changes. One drawback of incremental extraction is that it may not be
able to detect deleted records in source data, because there’s no way to see a record
that’s no longer there.
Full extraction
The first time you replicate any source you have to do a full extraction, and some data
sources have no way to identify data that has been changed, so reloading a whole table
may be the only way to get data from that source. Because full extraction involves high
data transfer volumes, which can put a load on the network, it’s not the best option if
you can avoid it.
1. Check for changes to the structure of the data, including the addition of new
tables and columns. Changed data structures have to be dealt with
programmatically.
2. Retrieve the target tables and fields from the records specified by the
integration’s replication scheme.
Extracted data is loaded into a destination that serves as a platform for BI reporting,
such as a cloud data warehouse like Amazon Redshift, Microsoft Azure SQL Data
Warehouse, Snowflake, or Google BigQuery. The load process needs to be specific to
the destination.
API-specific challenges
While it may be possible to extract data from a database using SQL, the extraction
process for SaaS products relies on each platform’s application programming interface
(API). Working with APIs can be challenging:
Many APIs are not well documented. Even APIs from reputable, developer-
friendly companies sometimes have poor documentation.
APIs change over time. For example, Facebook’s “move fast and break things”
approach means the company frequently updates its reporting APIs – and Facebook
doesn’t always notify API users in advance.