Streamsets: By: Avleen Kaur

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 23
At a glance
Powered by AI
StreamSets is an open source data ingestion platform that provides ETL capabilities for streaming data. It allows users to build continuous data flows to connect different parts of a data infrastructure.

The main components of StreamSets are the StreamSets Data Collector, Dataflow Performance Manager, Control Hub, Data Protector and Data Collector Edge.

The three main stages of a StreamSets pipeline are origin, processor and destination. The origin represents the source, processor performs transformations and the destination is where the data flows to.

STREAMSETS

By: Avleen Kaur

Sensitivity: Internal & Restricted


What is StreamSets?

 StreamSets is an open source, enterprise-grade, continuous big data ingest infrastructure


that accelerates time to analysis by bringing unprecedented transparency and processing to
data in motion.

 It is a system for creating, executing and operating continuous dataflows that connect
various parts of data infrastructure.

 It is basically a “data operations platform” that approaches ETL differently, providing


ETL upon ingestion of streaming data, rather than as a separate three-step process.

Sensitivity: Internal & Restricted


Components of StreamSets

 Data Collector

 Dataflow Performance Manager(DPM)

 Control Hub

 Data Protector

 Data Collector Edge

 Transformer

Sensitivity: Internal & Restricted


1) StreamSets Data Collector (SDC)

 An execution engine that streams data in real time

 Like a pipe for a data stream

 Provides the crucial connection between the hops in the stream of data that is

moved, collected, and processed on the way to its destination

Sensitivity: Internal & Restricted


Pipeline:
 Flow of data from origin to destination
 Three stages of pipeline
1. Origin
2. Processor
3. Destination

ORIGIN PROCESSOR DESTINATION

 Single origin stage is used to represent origin system.

Sensitivity: Internal & Restricted


Sample pipeline:

 Data passes through the pipeline in batches.


 Merging and branching streams can be done in pipeline.

Sensitivity: Internal & Restricted


Origin:
 Origin represents the source of the pipeline.
 JDBC Query consumer
 Directory
 Hadoop FS
 JDBC multitable consumer etc.,

Sensitivity: Internal & Restricted


JDBC QUERY CONSUMER
 Reads database data using a user defined SQL query through JDBC connection.

JDBC Property Description
JDBC Connection String Connection string used to connect to the database.

Incremental Mode  Default is incremental mode.

SQL Query  SQL query is used to read data from the database.

Initial Offset Offset value to use when the pipeline starts.


Required in incremental mode.
Offset Column Column to use for the offset value.

Query Interval Amount of time to wait between queries.

Sensitivity: Internal & Restricted


JDBC MULTITABLE CONSUMER
 Reads database data from multiple tables through a JDBC connection.
 Creates multiple threads to enable parallel processing in a multithreaded pipeline.

Tables Property Description


Schema Pattern of the schema names included in this
table configuration.
Table name pattern Pattern of the table name

JDBC Property Description

Number of threads Number of threads the origin generates and


uses for multithreaded processing.

Sensitivity: Internal & Restricted


Directory
 Reads fully written files from a directory.

Property Description
Name Stage name
Description Optional description.
Produce events Generates event records when events occur. Use for event
handling.
On Record Error •Error record handling for the stage:Discard - Discards the
record.
•Send to Error - Sends the record to the pipeline for error
handling.
•Stop Pipeline - Stops the pipeline.

Sensitivity: Internal & Restricted


Features of SDC
 Build adaptable pipelines with minimal coding and maximum flexibility

 Easy to use GUI, built in transformations, rapid troubleshooting

 Operate continuously in the face of constant change

 Zero downtime for upgrades, direct integration with big data governance tools

 Execute pipelines wherever you need to

 Flexible deployment, 100% in-memory operation for high throughput and low latency

 Monitor pipeline performance and data quality

 Customizable runtime metrics, real-time early warning of anomalies and outliers

Sensitivity: Internal & Restricted


2) StreamSets Dataflow Performance
Manager(DPM)
 Management console for data in motion

 Multiple dataflows can be mapped in a single visual topology and changes to


the dataflows over the time can be tracked. 

 Provides real-time statistics to measure dataflow performance across each


topology, from end-to-end or point-to-point

Sensitivity: Internal & Restricted


Features of DPM

 Live metrics for dataflow topologies

 Point-in-flow Key Performance Indicators(KPIs) for data availability and accuracy

 Detect and remediate violations to support Data SLAs

 Simplified problem diagnosis as historical metrics are used for comparing dataflow
performance over time

Sensitivity: Internal & Restricted


3) StreamSets Control Hub

 Central point of control for all of dataflow pipelines

 Allows to build and execute large numbers of complex dataflows at scale

 A shared repository allows groups of teams to publish, subscribe and collaborate on


pipeline development

 Offers full automation and provisioning capabilities regardless of system location

Sensitivity: Internal & Restricted


Features of Control Hub

 Cloud based design tool & shared pipeline repository

 Architecture wide visibility and control

 End-to-end topology view

 Integrates with StreamSets Dataflow Performance Manager for live dataflow metrics and SLA
enforcement

 Automated deployment and provisioning

 Data governance support

Sensitivity: Internal & Restricted


4) StreamSets Data Protector

 Provides software as a service to discover, secure and govern movement of sensitive data
as it arrives from a source or moves between compute platforms

 Enables in-stream discovery of data in motion and provides a range of capabilities to


implement complex data protection policies

Sensitivity: Internal & Restricted


Features of Data Protector

 Discover: In-stream detection of sensitive data

 Secure: Rules-based data protection

 Govern: Data protection policy management

Sensitivity: Internal & Restricted


5) StreamSets Data Collector Edge

 An ultralight yet powerful data ingestion solution for constrained systems

 Used to read data from an edge device or to receive data from another pipeline and then act
on that data to control an edge device

 Written in Go, SDC Edge provides a single solution across a broad range of edge
hardware platforms

 Can act as a simple data forwarder or can be configured to perform transformations and
analytics on edge

Sensitivity: Internal & Restricted


Features of SDC Edge
 Lightweight agent runs anywhere

 Less than 5MB installation footprint, Low memory and CPU (1-2%) utilization

 On-edge transformations and bi-directional dataflows

 Supports Structured and semi-structured data

 Complete operational control

 Automate deployment and maintenance of pipelines at scale

 Manage real-time pipeline and dataflow topology performance with StreamSets DPM

 Platform and protocol agnostic

 Support for a broad range of communications protocols

Sensitivity: Internal & Restricted


6) StreamSets Transformer
 An execution engine within the StreamSets DataOps platform that allows any developer to
create data processing pipelines that execute on Spark

 Enables users to solve their core business problems by abstracting away the complexity of
operating the Spark cluster

 Can execute both batch or streaming operations, mixing and matching as required

Sensitivity: Internal & Restricted


Features of StreamSets Transformer

 Perform next-generation ETL and machine learning with no hand coding

 Easy-to-use interface and rich tools democratize the process of data transformation

 Achieve continuous data and continuous monitoring

 Extend Spark capabilities to the entire data team

 Take advantage of rich data processing capabilities

Sensitivity: Internal & Restricted


Conclusion
 The StreamSets Data Operations Platform is designed to simplify the entire dataflow

lifecycle, including how to build, execute and operate enterprise dataflows at scale.

 Developers can design batch and streaming pipelines with a minimum of code, while

operators can aggregate dataflows into topologies for centralized provisioning and

performance management.

Sensitivity: Internal & Restricted


THANK YOU!

Sensitivity: Internal & Restricted

You might also like