Snowflake Flatten PDF
Snowflake Flatten PDF
Snowflake Flatten PDF
1
What’s inside
3 Semi-structured brings new insights to business
4 Schema? No need!
14 Aggregations
16 Schema-on-read is a reality
2
THE CHAMPION GUIDES
Semi-structured brings
new insights to business
If you’re an experienced data architect, data engineer One of the key differentiators in Snowflake, the data
or data analyst, you’ve probably been exposed to warehouse built for the cloud, is the ability to natively
semi-structured data such as JSON. IoT devices, social ingest semi-structured data such as JSON, store it
media sites and mobile devices all generate endless efficiently and then access it quickly using simple
streams of JSON log files. Handling JSON data is extensions to standard SQL. This eBook will give you
unavoidable, but it can’t be managed the same way as a modern approach to produce analytics from JSON
more familiar structured data. Yet, to thrive in today’s data using SQL, easily and affordably.
world of data, knowing how to manage and derive
value from this form of semi-structured data is crucial
to delivering valuable insight to your organization.
3
THE CHAMPION GUIDES
Schema? No need!
Load your semi-structured data directly into a relational table
Over the last several years, we have all heard the INSTANTLY QUERY SEMI-STRUCTURED DATA
phrase “schema-on-read” to explain the benefit of Not so with the only modern data warehouse
loading semi-structured data, such as JSON, into built for the cloud. With Snowflake, you can load
a NoSQL platform such as Hadoop. The idea here: your semi-structured data directly into a relational
data modeling and schema design could be delayed table. Then, you can query that data with a SQL
until long after you loaded the data. This avoids statement and join it to other structured data, while The idea here: data
the slowdown of getting the data into a repository not fretting about future changes to the “schema” modeling and schema
because you had to wait for a data modeler to first of that data. Snowflake keeps track of the self- design could be delayed
design the tables. describing schema so you don’t have to; no ETL or
until long after you loaded
fancy parsing algorithms required.
“Schema-on-read” implies there is a knowable the data. This avoids the
schema. So, even though organizations can quickly The built-in support to load and query semi- slowdown of getting the
load semi-structured data into Hadoop or a NoSQL structured data—including JSON, XML and AVRO— data into a repository
platform, there is still more work required to actually is one of the remarkable benefits of Snowflake. because you had to wait
parse the data into an understandable schema With most of today’s traditional, on-premises
before it can be analyzed with a standard SQL-based
for a data modeler to first
and cloud-washed data warehouses, and big data
tool. Experienced data professionals often have environments, you have to first load this type of
design the tables.
the burden of determining the schema and writing data to a Hadoop or NoSQL platform. Then you
code to extract the data. Unlike structured data in need to parse it, for example, with MapReduce
a relational database, this requirement impedes an in order to load it into columns in a relational
organization’s ability to access and utilize semi- database. Then, and only then, can you run SQL
structured data in a timely manner. queries or a BI/analytics tool against that data. All
of this means more time, money and headache for
you to allow business users to see that data.
4
THE CHAMPION GUIDES
No Hadoop or NoSQL
needed in your data
warehouse architecture
It’s simple. Snowflake invented a new data type called time, with new attributes, nesting, or arrays, there’s no THE DATA WAREHOUSE, REIMAGINED FOR THE CLOUD
VARIANT that allows semi-structured data to be need to re-code ETL or ELT code. The VARIANT data No other solution, on-premises or cloud-washed,
loaded, as is, into a column in a relational table. type does not care if the schema varies. offers Snowflake’s optimized level of support for
processing semi-structured data. Even though some
When Snowflake loads semi-structured data, it DATA IN, INSIGHT OUT traditional vendors have added features to store and
optimizes how it stores that data internally by But that’s only half the equation. Once the data is access JSON and XML, those are add-ons to legacy
automatically discovering the attributes and structure in, how do you get the insight out? Snowflake has code, using existing data types such as CLOBS, and
that exist in the data, and using that knowledge to created extensions to SQL to reference the internal are not natively optimized.
optimize how the data is stored. Snowflake also looks schema of the data. Because it’s self-describing,
for repeated attributes across records, organizing you can query the components and join the data to With these solutions, getting any kind of performance
and storing those repeated attributes separately. This columns in other tables, as if you already parsed it into optimization requires additional DBA performance
enables better compression and faster access, similar a standard relational format. Except there is no coding, tuning. For example, in its documentation, one of
to the way that a columnar database optimizes storage ETL or other parsing required to prep the data. the newer, cloud-washed data warehouse providers
of columns of data. states that customers should not try to use their
In addition, statistics about the sub-columns are JSON feature at scale. This is yet another example of
The upshot: No Hadoop or NoSQL is needed in your also collected, calculated and stored in Snowflake’s how cloud-washed legacy code can’t magically solve
data warehouse architecture, for the sole purpose of metadata repository. This gives Snowflake’s advanced data problems.
holding semi-structured data. The result is a modern query optimizer metadata about the semi-structured
data warehouse built for the cloud that uses SQL, data, to optimize access to it. The collected statistics How does Snowflake do it? First, we reimagined a
which you and your staff already know how to write. allow the optimizer to use pruning to minimize the data warehouse that could handle big data. Then we
And as the data source evolves and changes over amount of data needed for access, thus speeding did the hard work to invent an entirely new data type.
the return of data.
5
THE CHAMPION GUIDES
I already have a Snowflake account, database and multi-cluster warehouse set up, so Now I load a sample JSON document using an INSERT and Snowflake’s
just like I would in any other database, I simply issue a create table DDL statement: PARSE_JSON function. We’re not simply loading the document as text but
rather storing it as an object in the VARIANT data type, while at the same time
create or replace table json_demo (v variant); converting it to an optimized columnar format (to query later):
Now I have a table with one column (“v”) with a declared data type of VARIANT.
6
THE CHAMPION GUIDES
While this approach is useful for testing, normally JSON would be loaded into a 4. CASTING THE DATA
Snowflake table from your Snowflake staging area (S3) using a simple COPY command. Usually we don’t want to see the double quotes around the data in the report output
unless we’re going to create an extract file. Instead, we can format it as a string and
copy into myjsontable give it a nicer column alias, similar to what we would do with a normal column:
from @my_json_stage/tutorials/dataloading/contacts.json
on_error = ‘skip_file’;
select v:fullName::string as full_name
from json_demo;
For more details on the many options and features of the COPY command, see
Snowflake’s data loading tutorial. 1 row produced
row# FULL_NAME
3. START PULLING DATA OUT
1 Johnny Appleseed
Now that we’ve loaded an example, let’s work through how we access it. We’ll start
with just getting the data from the name sub-column:
Next, let’s look at a bit more of the data using the same syntax from above:
select v:fullName from json_demo;
1 row produced
Where:
v = the column name in the json_demo table (from our create table command) row# FULL_NAME AGE GENDER
Similar to the table.column notation all SQL people are familiar with, in Snowflake, we Again, simple SQL and the output are similar to the results from any table you
added the ability to effectively specify a column within the column––a sub-column. might have built in a traditional data warehouse.
However, we cannot leverage dot as our separator, as SQL syntax has already claimed
that. So, the Snowflake team chose the next obvious thing: a colon to reference At this point, you could look at a table in Snowflake with a VARIANT column and
the JSON sub-columns and navigate that hierarchy. This structural information is quickly start “shredding” the JSON with SQL. You can now query semi-structured
dynamically derived based on the schema definition embedded in the JSON string. data without learning a new programming language or framework required with
Snowflake’s advanced metadata engine records this information at the time it loads Hadoop or NoSQL. Instead, you have a much lower learning curve to get the
the JSON document into the table. same result.
7
THE CHAMPION GUIDES
Yes, those examples are very simple. So let’s look at something a bit more complex.
{
Notice that the original sample document contains some nesting of the data:.
“fullName”: “Johnny Appleseed”,
“age”: 42,
{ “gender”: “Male”,
“fullName”: “Johnny Appleseed”, “phoneNumber”: {
“age”: 42, “areaCode”: “415”,
“gender”: “Male”, “subscriberNumber”: “5551234”,
“phoneNumber”: { “extensionNumber”: “24”
“areaCode”: “415”, },
“subscriberNumber”: “5551234” ...
},
...
A new attribute, extensionNumber, was added to the phoneNumber! What
happens to the load? Nothing. It keeps working because we ingest the string into
How do we pull that apart? With a very familiar table.column dot notation: the VARIANT column in the table.
select You may ask, “What about the ETL/ELT code?” What code? There is no code,
v:phoneNumber.areaCode::string as area_code, so there’s nothing to break. And existing reports? They keep working, too. The
v:phoneNumber.subscriberNumber::string as subscriber_number
previous query will work just fine. If you want to see the new column, the SQL
from json_demo;
needs to be refactored to account for the change:
Just as fullName, age and gender are sub-columns, so too is phoneNumber. And select
subsequently areaCode and subscriberNumber are sub-columns of the sub-column. v:phoneNumber.areaCode::string as area_code,
v:phoneNumber.subscriberNumber::string as subscriber_number,
We can pull apart nested objects like this, and easily adapt if the schema changes and
v:phoneNumber.extensionNumber::string as extension_number
we add another sub-column. from json_demo;
WHAT HAPPENS IF THE STRUCTURE CHANGES? In addition, if the reverse happens and an attribute is dropped, the query will not
One of the benefits of storing data in JSON is that the schema can easily change. But fail. Instead, it simply returns a NULL value. In this way, Snowflake insulates all the
imagine if, in a subsequent load, the data provider changed the specification to this: code you write from these types of dynamic changes.
8
THE CHAMPION GUIDES
One of JSON’s many cool features is the ability to specify and embed an array of data The function array_size determines it for us. To pull the data for each row in the array,
within the document. In this example, one such array is children: we can use the previous dot notation, but with the added specification for the row
number of the array located inside the brackets:
“children”: [
{ “name”: “Jayden”, “gender”: “Male”, “age”: “10” },
select v:children[0].name from json_demo
{ “name”: “Emma”, “gender”: “Female”, “age”: “8” },
union all
{ “name”: “Madelyn”, “gender”: “Female”, “age”: “6” }
select v:children[1].name from json_demo
]
union all
select v:children[2].name from json_demo;
You will notice there are three rows in the array and each row has three sub-columns—
name, gender and age. Each of those rows constitutes the value of that array entry, 3 rows produced
which includes all the sub-column labels and data. (Remember this for later.) So how do row# V:CHILDREN[0].NAME
you know how many rows there are if you don’t have access to the raw data?
1 “Jayden”
Like this:
2 “Emma”
3 “Madelyn”
select array_size(v:children) from json_demo;
9
THE CHAMPION GUIDES
Admittedly, this is interesting but not very practical. You need to know, in advance,
how many values are in the array in order to construct the SQL with the right 3 rows produced
number of “union all” statements to traverse the entire array. Since JSON schemas
row# F.VALUE:NAME
are flexible, e,g., they can easily change, you really need a method to extract the
data that is more dynamic, and able to determine how many rows are in the array 1 “Jayden”
for any given record, at any time. 2 “Emma”
We solve that problem with another, new, extended SQL function called FLATTEN, 3 “Madelyn”
which takes an array and returns a row for each element in the array. You can
select all the data in the array as though it were in table rows; there’s no need to
If another element is added to the array, such as a fourth child, we will not have to
figure out how many elements there are.
change the SQL. FLATTEN allows us to determine the structure and content of the
Instead of doing the set of UNION ALLs, we add the FLATTEN into the FROM array on the fly. This makes the SQL resilient to changes in the JSON document.
clause and give it a table alias:
You can now get all the array sub-columns and format them just like a relational table:
select f.value:name
from json_demo, table(flatten(v:children)) f; select
f.value:name::string as child_name,
f.value:gender::string as child_gender,
This syntax allows us to create an inline virtual table in the FROM clause. In the f.value:age::string as child_age
SELECT, you can then reference it like a table. Notice the notation f.value:name: from json_demo, table(flatten(v:children)) f;
f = the alias for the virtual table from the children array 3 rows produced
value = the contents of the element returned by the FLATTEN function
row# CHILD_NAME CHILD_GENDER CHILD_AGE
name = the label of the specific sub-column we want to extract from the value
1 Jayden Male 10
The results, in this case, are the same as the SELECT with the UNIONs. But the
2 Emma Female 8
output column header reflects the different syntax, since we have yet to add any
column aliases. 3 Madelyn Female 6
10
THE CHAMPION GUIDES
Putting all this together, you can write a query to get the parent’s name and the
children’s names:
select
v:fullName::string as parent_name,
f.value:name::string as child_name,
f.value:gender::string as child_gender,
f.value:age::string as child_age
from json_demo, table(flatten(v:children)) f;
3 rows produced
If you just want a quick count of children by parent, you do not need FLATTEN but
instead refer back to the array_size:
select
v:fullName::string as Parent_Name,
array_size(v:children) as Number_of_Children
from json_demo;
1 row produced
1 Johnny Appleseed 3
Notice there is no “group by” clause needed because the nested structure of the
JSON has naturally grouped the data for us.
11
THE CHAMPION GUIDES
You may recall there are multiple arrays in the sample JSON string. You can pull from To pull that data out, we add a second FLATTEN clause that transforms the
several arrays at once with no problem: yearsLived array within the FLATTENed citiesLived array.
select select
v:fullName::string as Parent_Name, cl.value:cityName::string as city_name,
array_size(v:citiesLived) as Cities_lived_in, yl.value::string as year_lived
array_size(v:children) as Number_of_Children from json_demo,
from json_demo; table(flatten(v:citiesLived)) cl,
table(flatten(cl.value:yearsLived)) yl;
1 row produced
row# PARENT_NAME CITIES_LIVED_IN NUMBER_OF_CHILDREN In this case the 2nd FLATTEN (alias “yl”) transforms, or pivots, the yearsLived array
for each value returned from the first FLATTEN of the citiesLived array (“cl”).
1 Johnny Appleseed 4 3
The resulting output shows Year Lived by City:
What about an array within an array? Snowflake can handle that, too. From the
16 rows produced
sample data you can see yearsLived is an array nested inside the array described
row# CITY_NAME Year_Lived
by citiesLived:
1 London 1989
2 London 1993
“citiesLived”: [
{ “cityName”: “London”, 3 London 1998
“yearsLived”: [ “1989”, “1993”, “1998”, “2002” ] 4 London 2002
}, San Francisco 1990
5
{ “cityName”: “San Francisco”,
“yearsLived”: [ “1990”, “1993”, “1998”, “2008” ] 6 San Francisco 1993
}, 7 San Francisco 1998
{ “cityName”: “Portland”, 8 San Francisco 2008
“yearsLived”: [ “1993”, “1998”, “2003”, “2005” ]
9 Portland 1993
},
{ “cityName”: “Austin”, 10 Portland 1998
“yearsLived”: [ “1973”, “1998”, “2001”, “2005” ] 11 Portland 2003
}
12 Portland 2005
]
12
THE CHAMPION GUIDES
Similar to the previous example, you can augment this result by adding the name to
show who lived where:
select
v:fullName::string as parent_name,
cl.value:cityName::string as city_name,
yl.value::string as year_lived
from json_demo,
table(flatten(v:citiesLived)) cl,
table(flatten(tf.value:yearLived)) yl;
16 rows produced
row# PARENT_NAME CITY_NAME Year_Lived
13
THE CHAMPION GUIDES
Aggregations
How to execute standard SQL
aggregations on semi-structured data
To answer the question you’re probably thinking: Yes! You can even execute standard
SQL aggregations on the semi-structured data. So, just like ANSI SQL, you can do a
count(*) and a group by:
select
cl.value:cityName::string as city_name,
count(*) as year_lived
from json_demo,
table(flatten(v:citiesLived)) cl,
table(flatten(tf.value:yearLived)) yl
group by 1;
4 rows produced
row# CITY_NAME Year_Lived
1 London 4
2 San Francisco 4
3 Portland 4
4 Austin 4
You can also create much more complex analyses using the library of standard SQL
aggregation and windowing functions including LEAD, LAG, RANK and STDDEV.
14
THE CHAMPION GUIDES
What if you don’t want to return every row in an array? Similar to standard SQL, you
add a WHERE clause:
select
cl.value:cityName::string as city_name,
count(*) as years_lived
from json_demo,
table(flatten(v:citiesLived)) cl,
table(flatten(tf.value:yearsLived)) yl
where city_name = ‘Portland’
group by 1;
4 rows produced
row# CITY_NAME Year_Lived
1 Portland 4
To make it easier to read the SQL, notice you can even reference the sub-column
alias city_name in the predicate. You can also use the full, sub-column specification
cl.value:cityName.
15
Schema-on-read is a reality
Get the latest technology without affecting your warehouse performance
The examples we’ve walked through show how very data warehouse, you get the bonus of on-demand
easy it is to load and analyze information from semi- resource scalability that no traditional or cloud-
structured data with SQL, using Snowflake as both washed data warehouse solution delivers.
With these features,
your big data and data warehouse solution. Snowflake
Snowflake gives you a fast
invented a brand new, optimized data type, VARIANT, With these features, Snowflake gives you a fast
which lives in a relational table structure in a relational path to the enterprise end game: the true ability to path to the enterprise end
database. VARIANT offers native support for querying quickly and easily load semi-structured data into a game: the true ability to
JSON without the need to analyze the structure modern data warehouse and make it available for quickly and easily load
ahead of time, or design appropriate database tables immediate analysis.
semi-structured data into
and columns, subsequently parsing the data string
a modern data warehouse
into that predefined schema.
and make it available for
VARIANT provides the same performance as all the immediate analysis.
standard relational data types. In the examples, you
saw easy-to-learn extensions to ANSI-standard SQL
for accessing that data in a very flexible, resilient
manner. With Snowflake’s built-for-the-cloud
16
THE CHAMPION GUIDES
17