Streamsets: By: Avleen Kaur

STREAMSETS
By: Avleen Kaur
Sensitivity: Internal & Restricted

What is StreamSets?
 StreamSets is an open source, enterprise-grade, continuous big data ingest infrastructure

that accelerates time to analysis by bringing unprecedented transparency and processing to
data in motion.
 It is a system for creating, executing and operating continuous dataflows that connect
various parts of data infrastructure.
 It is basically a “data operations platform” that approaches ETL differently, providing

ETL upon ingestion of streaming data, rather than as a separate three-step process.

Components of StreamSets
 Data Collector
 Dataflow Performance Manager(DPM)
 Control Hub
 Data Protector
 Data Collector Edge
 Transformer

1) StreamSets Data Collector (SDC)
 An execution engine that streams data in real time
 Like a pipe for a data stream
 Provides the crucial connection between the hops in the stream of data that is
moved, collected, and processed on the way to its destination

Pipeline:
 Flow of data from origin to destination
 Three stages of pipeline
1. Origin
2. Processor
3. Destination
ORIGIN PROCESSOR DESTINATION
 Single origin stage is used to represent origin system.

Sample pipeline:
 Data passes through the pipeline in batches.

 Merging and branching streams can be done in pipeline.

Origin:
 Origin represents the source of the pipeline.
 JDBC Query consumer
 Directory
 Hadoop FS
 JDBC multitable consumer etc.,

JDBC QUERY CONSUMER
 Reads database data using a user defined SQL query through JDBC connection.
JDBC Property Description
JDBC Connection String Connection string used to connect to the database.
Incremental Mode Default is incremental mode.
SQL Query SQL query is used to read data from the database.
Initial Offset Offset value to use when the pipeline starts.

Required in incremental mode.
Offset Column Column to use for the offset value.
Query Interval Amount of time to wait between queries.

JDBC MULTITABLE CONSUMER
 Reads database data from multiple tables through a JDBC connection.
 Creates multiple threads to enable parallel processing in a multithreaded pipeline.
Tables Property Description

Schema Pattern of the schema names included in this
table configuration.
Table name pattern Pattern of the table name
JDBC Property Description
Number of threads Number of threads the origin generates and

uses for multithreaded processing.

Directory
 Reads fully written files from a directory.
Property Description
Name Stage name
Description Optional description.
Produce events Generates event records when events occur. Use for event
handling.
On Record Error •Error record handling for the stage:Discard - Discards the
record.
•Send to Error - Sends the record to the pipeline for error
handling.
•Stop Pipeline - Stops the pipeline.

Features of SDC
 Build adaptable pipelines with minimal coding and maximum flexibility
 Easy to use GUI, built in transformations, rapid troubleshooting
 Operate continuously in the face of constant change
 Zero downtime for upgrades, direct integration with big data governance tools
 Execute pipelines wherever you need to
 Flexible deployment, 100% in-memory operation for high throughput and low latency
 Monitor pipeline performance and data quality
 Customizable runtime metrics, real-time early warning of anomalies and outliers

2) StreamSets Dataflow Performance
Manager(DPM)
 Management console for data in motion
 Multiple dataflows can be mapped in a single visual topology and changes to

the dataflows over the time can be tracked.
 Provides real-time statistics to measure dataflow performance across each

topology, from end-to-end or point-to-point

Features of DPM
 Live metrics for dataflow topologies
 Point-in-flow Key Performance Indicators(KPIs) for data availability and accuracy
 Detect and remediate violations to support Data SLAs
 Simplified problem diagnosis as historical metrics are used for comparing dataflow
performance over time

3) StreamSets Control Hub
 Central point of control for all of dataflow pipelines
 Allows to build and execute large numbers of complex dataflows at scale
 A shared repository allows groups of teams to publish, subscribe and collaborate on

pipeline development
 Offers full automation and provisioning capabilities regardless of system location

Features of Control Hub
 Cloud based design tool & shared pipeline repository
 Architecture wide visibility and control
 End-to-end topology view
 Integrates with StreamSets Dataflow Performance Manager for live dataflow metrics and SLA
enforcement
 Automated deployment and provisioning
 Data governance support

4) StreamSets Data Protector
 Provides software as a service to discover, secure and govern movement of sensitive data
as it arrives from a source or moves between compute platforms
 Enables in-stream discovery of data in motion and provides a range of capabilities to

implement complex data protection policies

Features of Data Protector
 Discover: In-stream detection of sensitive data
 Secure: Rules-based data protection
 Govern: Data protection policy management

5) StreamSets Data Collector Edge
 An ultralight yet powerful data ingestion solution for constrained systems
 Used to read data from an edge device or to receive data from another pipeline and then act
on that data to control an edge device
 Written in Go, SDC Edge provides a single solution across a broad range of edge
hardware platforms
 Can act as a simple data forwarder or can be configured to perform transformations and
analytics on edge

Features of SDC Edge
 Lightweight agent runs anywhere
 Less than 5MB installation footprint, Low memory and CPU (1-2%) utilization
 On-edge transformations and bi-directional dataflows
 Supports Structured and semi-structured data
 Complete operational control
 Automate deployment and maintenance of pipelines at scale
 Manage real-time pipeline and dataflow topology performance with StreamSets DPM
 Platform and protocol agnostic
 Support for a broad range of communications protocols

6) StreamSets Transformer
 An execution engine within the StreamSets DataOps platform that allows any developer to
create data processing pipelines that execute on Spark
 Enables users to solve their core business problems by abstracting away the complexity of
operating the Spark cluster
 Can execute both batch or streaming operations, mixing and matching as required

Features of StreamSets Transformer
 Perform next-generation ETL and machine learning with no hand coding
 Easy-to-use interface and rich tools democratize the process of data transformation
 Achieve continuous data and continuous monitoring
 Extend Spark capabilities to the entire data team
 Take advantage of rich data processing capabilities

Conclusion
 The StreamSets Data Operations Platform is designed to simplify the entire dataflow
lifecycle, including how to build, execute and operate enterprise dataflows at scale.
 Developers can design batch and streaming pipelines with a minimum of code, while
operators can aggregate dataflows into topologies for centralized provisioning and
performance management.

THANK YOU!

Streamsets: By: Avleen Kaur

Uploaded by

Copyright:

Available Formats

Streamsets: By: Avleen Kaur

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Streamsets: By: Avleen Kaur

Uploaded by

Copyright:

Available Formats

What are the main components of StreamSets?

What are the main components of StreamSets?

What are the main stages of a StreamSets pipeline?

What are the main stages of a StreamSets pipeline?

STREAMSETS

By: Avleen Kaur

Sensitivity: Internal & Restricted

 StreamSets is an open source, enterprise-grade, continuous big data ingest infrastructure

 It is basically a “data operations platform” that approaches ETL differently, providing

Sensitivity: Internal & Restricted

 Dataflow Performance Manager(DPM)

 Data Collector Edge

Sensitivity: Internal & Restricted

 An execution engine that streams data in real time

 Like a pipe for a data stream

moved, collected, and processed on the way to its destination

Sensitivity: Internal & Restricted

ORIGIN PROCESSOR DESTINATION

 Single origin stage is used to represent origin system.

Sensitivity: Internal & Restricted

 Data passes through the pipeline in batches.

Sensitivity: Internal & Restricted

Sensitivity: Internal & Restricted

Incremental Mode Default is incremental mode.

SQL Query SQL query is used to read data from the database.

Initial Offset Offset value to use when the pipeline starts.

Query Interval Amount of time to wait between queries.

Sensitivity: Internal & Restricted

Tables Property Description

Number of threads Number of threads the origin generates and

Sensitivity: Internal & Restricted

Sensitivity: Internal & Restricted

 Easy to use GUI, built in transformations, rapid troubleshooting

 Operate continuously in the face of constant change

 Execute pipelines wherever you need to

 Monitor pipeline performance and data quality

 Customizable runtime metrics, real-time early warning of anomalies and outliers

Sensitivity: Internal & Restricted

 Multiple dataflows can be mapped in a single visual topology and changes to

 Provides real-time statistics to measure dataflow performance across each

Sensitivity: Internal & Restricted

 Live metrics for dataflow topologies

 Point-in-flow Key Performance Indicators(KPIs) for data availability and accuracy

 Detect and remediate violations to support Data SLAs

Sensitivity: Internal & Restricted

 Central point of control for all of dataflow pipelines

 Allows to build and execute large numbers of complex dataflows at scale

 A shared repository allows groups of teams to publish, subscribe and collaborate on

 Offers full automation and provisioning capabilities regardless of system location

Sensitivity: Internal & Restricted

 Cloud based design tool & shared pipeline repository

 Architecture wide visibility and control

 End-to-end topology view

 Automated deployment and provisioning

 Data governance support

Sensitivity: Internal & Restricted

 Enables in-stream discovery of data in motion and provides a range of capabilities to

Sensitivity: Internal & Restricted