17 SparkSQL
17 SparkSQL
17 SparkSQL
201509
Course
Chapters
10
Spark
Basics
11
Working
with
RDDs
in
Spark
12
AggregaKng
Data
with
Pair
RDDs
13
WriKng
and
Deploying
Spark
ApplicaKons
Distributed
Data
Processing
with
14
Parallel
Processing
in
Spark
Spark
15
Spark
RDD
Persistence
16
Common
PaHerns
in
Spark
Data
Processing
17
Spark
SQL
and
DataFrames
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-2
DataFrames
and
SparkSQL
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-3
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-4
What
is
Spark
SQL?
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-5
SQL
Context
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-6
CreaKng
a
SQL
Context
import org.apache.spark.sql.SQLContext
Scala
val sqlCtx = new SQLContext(sc)
import sqlCtx._
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-7
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-8
DataFrames
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-9
CreaKng
DataFrames
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-10
Example:
CreaKng
a
DataFrame
from
a
JSON
File
File:
people.json
age name pcode
{"name":"Alice", "pcode":"94304"} null Alice 94304
{"name":"Brayden", "age":30, "pcode":"94304"} 30 Brayden 94304
{"name":"Carla", "age":19, "pcode":"10036"}
{"name":"Diana", "age":46} 19 Carla 10036
{"name":"tienne", "pcode":"94104"}
46 Diana null
null tienne 94104
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-11
CreaKng
a
DataFrame
from
a
Data
Source
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-12
Data
Sources
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-13
Generic
Load
FuncKon
Example:
JDBC
accountsDF = sqlCtx.load(source="jdbc", \
url="jdbc:mysql://dbhost/dbname?user=&password=", \
dbtable="accounts)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-14
Generic
Load
FuncKon
Example:
Third-party
or
Custom
Sources
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-15
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-16
DataFrame
Basic
OperaKons
(1)
Basic
OperaIons
deal
with
DataFrame
metadata
(rather
than
its
data),
e.g.
schema
returns
a
Schema
object
describing
the
data
printSchema
displays
the
schema
as
a
visual
tree
cache
/
persist
persists
the
DataFrame
to
disk
or
memory
columns
returns
an
array
containing
the
names
of
the
columns
dtypes
returns
an
array
of
(column-name,type)
pairs
explain
prints
debug
informaKon
about
the
DataFrame
to
the
console
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-17
DataFrame
Basic
OperaKons
(2)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-18
Working
with
Data
in
a
DataFrame
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-19
DataFrame
AcKons
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-20
DataFrame
Queries
(1)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-21
DataFrame
Queries
(2)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-22
DataFrame
Query
Strings
(1)
o p l e DF. e")
46
age name pcode pe t ( " a g
selec
null
null Alice 94304
30 Brayden 94304
name age
19 Carla 10036
people
Alice null
46 Diana null DF. Brayden 30
select
null tienne 94104 ("name
" ,"age" Carla 19
)
Diana 46
tienne null
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-23
DataFrame
Query
Strings
(2)
Example:
where
peopleDF.
where("age > 21")
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-24
Querying
DataFrames
using
Columns
(1)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-25
Querying
DataFrames
using
Columns
(2)
OR
age
val ageDF = peopleDF.select(peopleDF("age"))
null
30
19
46
null
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-26
Querying
DataFrames
using
Columns
(3)
peopleDF.select(peopleDF.name,peopleDF.age+10)
peopleDF.select(peopleDF("name"),peopleDF("age")+10)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-27
Querying
DataFrames
using
Columns
(4)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-28
SQL
Queries
peopleDF.registerTempTable("people")
sqlCtx.sql("""SELECT * FROM people WHERE name LIKE "A%" """)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-29
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-30
Saving
DataFrames
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-31
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-32
DataFrames
and
RDDs
(1)
peopleRDD = peopleDF.rdd
peopleDF peopleRDD
age name pcode Row[null,Alice,94304]
null Alice 94304 Row[30,Brayden,94304]
30 Brayden 94304 Row[19,Carla,10036]
19 Carla 10036 Row[46,Diana,null]
46 Diana null Row[null,tienne,94104]
null tienne 94104
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-33
DataFrames
and
RDDs
(2)
Row
RDDs
have
all
the
standard
Spark
acIons
and
transformaIons
AcKons
collect,
take,
count,
etc.
TransformaKons
map,
flatMap,
filter,
etc.
Row
RDDs
can
be
transformed
into
PairRDDs
to
use
map-reduce
methods
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-34
Working
with
Row
Objects
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-35
Example:
ExtracKng
Data
from
Rows
Row[null,Alice,94304]
Extract
data
from
Rows
Row[30,Brayden,94304]
Row[19,Carla,10036]
peopleRDD = peopleDF.rdd Row[46,Diana,null]
peopleByPCode = peopleRDD \
Row[null,tienne,94104]
.map(lambda row(row.pcode,row.name)) \
.groupByKey()
(94304,Alice)
(94304,Brayden)
(10036,Carla)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-36
ConverKng
RDDs
to
DataFrames
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-37
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-38
Comparing
Impala
to
Spark
SQL
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-39
Comparing
Spark
SQL
with
Hive
on
Spark
Spark
SQL
Provides
the
DataFrame
API
to
allow
structured
data
processing
in
a
Spark
applica-on
Programmers
can
mix
SQL
with
procedural
processing
Hive-on-Spark
Hive
provides
a
SQL
abstracKon
layer
over
MapReduce
or
Spark
Allows
non-programmers
to
analyze
data
using
familiar
SQL
Hive-on-Spark
replaces
MapReduce
as
the
engine
underlying
Hive
Does
not
aect
the
user
experience
of
Hive
Except
many
Kmes
faster
queries!
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-40
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-41
EssenKal
Points
Spark
SQL
is
a
Spark
API
for
handling
structured
and
semi-structured
data
Entry
point
is
a
SQLContext
DataFrames
are
the
key
unit
of
data
DataFrames
are
based
on
an
underlying
RDD
of
Row
objects
DataFrames
query
methods
return
new
DataFrames;
similar
to
RDD
transformaIons
The
full
Spark
API
can
be
used
with
Spark
SQL
Data
by
accessing
the
underlying
RDD
Spark
SQL
is
not
a
replacement
for
a
database,
or
a
specialized
SQL
engine
like
Impala
Spark
SQL
is
most
useful
for
ETL
or
incorporaKng
structured
data
into
other
applicaKons
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-42
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriHen
consent
from
Cloudera.
17-43
Homework:
Use
Spark
SQL
for
ETL
Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriHen consent from Cloudera. 17-44