Andrew Psaltis - Sparkstreaming
Andrew Psaltis - Sparkstreaming
Andrew Psaltis - Sparkstreaming
Andrew Psaltis
About Me
Recently started working at
Ensighten on Agile Maketing Platform
Prior 4.5 years worked Webtrends on
Streaming and Realtime Visitor
Analytics where I first fell in love
with Spark
Browser
Collection
Tier
Browser
Message
Queueing Tier
Analysis Tier
In-Memory
Data Store
Data Access
Tier
Log Replayer
Browser
Kafka
Spark
Streaming
Kafka
WebSocket
Server
Apache Kafka
Overview
An Apache project initially developed at LinkedIn
Distributed publish-subscribe messaging system
Specifically designed for real time activity streams
Does not follow JMS Standards nor uses JMS APIs
Key Features
Persistent messaging
High throughput, low overhead
Uses ZooKeeper for forming a cluster of nodes
Supports both queue and topic semantics
Programming Model
A Discretized Stream or DStream is a series of
RDDs representing a stream of data
API very similar to RDDs
Input - DStreams can be created
Either from live streaming data
Or by transforming other Dstreams
Operations
Transformations
Output Operations
Operations - Transformations
Allows you to build new streams from
existing streams
RDD-like operations
map, flatMap, filter, countByValue, reduce,
groupByKey, reduceByKey, sortByKey, join
etc.
Browser
(msnbc)
Log
Replayer
Processed Results
Browser
Kafka
Spark Streaming
Spark
Kafka
WebSocket
Server
Clickstream Examples
PageViews per batch
PageViews by Url over time
Top N PageViews over time
Keeping a current session up to date
Joining the current session with
historical
messages
DStream
events
DStream
batch @ t
batch @
t+1
batch @
t+2
createStrea
m
createStrea
m
createStrea
m
map
map
map
stored in memory as an
RDD (immutable,
distributed)
batch @
t+2
map
map
map
countByValue
countByValue
countByValue
}}).countByVal
batch ue(
@ t);
events
DStream
pageCounts
DStream
window length
}});
DStream of data
sliding interval
Function2< List< PageView > , O ptional< PageView > , O ptional< Session> > u p d ateFu n ction = new
Function2< List< PageView > , O ptional< PageView > ,
O ptional< Session> > () {
@ O verride public O ptional< Session> call(List< PageView > values,
state) {
Session updatedSession = ... // update the session
return O ptional.of(updatedSession)
}
}
O ptional< Session>
Browser
(msnbc)
Log
Replayer
Processed Results
Browser
Kafka
Spark Streaming
Spark
Kafka
WebSocket
Server
Log
R eplayer
Brow ser
Processed R esults
Brow ser
(m snbc)
Kafka
Kafka
W ebSocket
Server
Example - foreachRDD
sortedCounts.foreach R D D (
new Function< JavaPairRD D < Long,String> , Void> () {
public Void call(JavaPairRD D < Long,String> rdd) {
M ap< String,Long> top10 = new H ashM ap< > ();
for (Tuple2< Long,String> t : rdd.take(10)) {
top10List.put(t._2(),t._1());
}
kafkaProducer.sendTopN (top10List);
return null;
}
}
);
WebSockets
Provides a standard way to get data
out
When the client connects
Read from Kafka and start streaming
Summary
Spark Streaming works well for
ClickStream Analytics
But
Still no good out of the box output
operations for a stream.
Multi tenancy needs to be
thought through.
How do you stop a Job?