Kettle ETL Tool
Kettle ETL Tool
Kettle ETL Tool
Sreenivas K
04/17/14
Introduction
ETL Process
Pentahos Kettle
Transformations
Jobs
04/17/14
4 major components:
Extracting
Gathering raw data from source systems and storing it in ETL
staging environment
Data profiling
Identifying data that changed since last load
Transforming- Cleaning and Conforming
Processing data to improve its quality, format it, merge from
multiple sources, enforce conformed dimensions
Data cleansing
Recording error events
Audit dimensions
Creating and maintaining conformed dimensions and facts
04/17/14
Data filtering
Is not null, greater than, less than, includes
Field manipulation
Trimming, padding, upper and lowercase conversion
Data calculations
+ - X / , average, absolute value, arctangent, natural logarithm
Date manipulation
First day of month, Last day of month, add months, week of year,
day of year
Looking up date
Look up in a database, in a text file, an excel sheet,
04/17/14
Loading
Loading data into data warehouse tables
04/17/14
04/17/14
04/17/14
Custom Code
Low upfront cost
Support grows as business requirements changes
04/17/14
Tool
Vendor
Oracle
Business Objects
IBM
SAS Institute
PowerCenter
Informatica
Oracle
Data Migrator
Information Builders
Integration Services
Microsoft
Talend
DataFlow
Data Integrator
Pervasive
Transformation Server
DataMirror
Transformation Manager
Data Manager
Cognos
DT/Studio
Embarcadero Technologies
ETL4ALL
IKAN
IBM
Jitterbit
Jitterbit
Pentaho
MaxQDPro: Kettle- ETL Tool
04/17/14
04/17/14
10
Data is everywhere
Data is inconsistent
Performance issues
Running queries to summarize data for stipulated
long period takes operating system for task
Brings the OS on max load
04/17/14
11
04/17/14
12
Prerequisites
Java Runtime
Environment 1.5 and
above
Compatible with
almost any platform
Compatible with wide
range of Databases
technologies.
Recent Releases
04/17/14
13
Pan
A program to execute transformations designed by Spoon
in XML or database repository.
Transformations are scheduled in batch mode to be run
automatically at regular intervals
Carte
Simple web server to execute transformations and jobs
remotely.
Accept an XML (small servlet) that contains
transformation to execute and the execution
configuration.
Allows to remotely monitor, start and stop the
transformations and jobs
Server running in Carte is a Slave Server
MaxQDPro: Kettle- ETL Tool
04/17/14
14
Spoon
GUI that allows you to design transformations and
jobs that can be run with the Kettle tools Pan and
Kitchen
Transformations and Jobs can describe themselves
using an XML file or can be put in a Kettle database
repository.
Spoon is available as executable script and batch file
to make use of tool in heterogeneous environment.
Latest version of Spoon is 3.2 beta version.
Kitchen
Execute jobs designed by Spoon in XML or database
repository
MaxQDPro: Kettle- ETL Tool
04/17/14
15
Installing
Ensure JRE 1.5 is
installed.
Unzip the binary
distribution in any folder
Launching
spoon.bat in windows
platform
spoon.sh in Unix like
platform
Supported platform
Microsoft Windows
including Vista
Linux GTK: on i386 and
x86_64 processors
Apple's OSX: works both
on PowerPC and Intel
machines
Solaris: using a Motif
interface
AIX, HP-UX, FreeBSD
Works on most of OS
MaxQDPro: Kettle- ETL Tool
04/17/14
16
JDBC -Database
connectivity Java tool.
Comes in four different
types
Type1: JDBC-ODBC Bridge
Type 2 : Native API partial
Java driver
Type 3 : Middleware Java
Drivers
Type 4: Direct to DB Java
Drivers
Microsoft Based DB
like MS Access rely on
Type 1drivers
Oracle, Mysql can be
connected with other
types. But traditionally
used is the Type 4
driver.
JDBC can also operate
in Distributed
environment.
04/17/14
17
04/17/14
18
04/17/14
19
Key Improvement
Execution Results Pane for logs, metrics and
performance graph
Improved Database Connection dialog
Snap to grid (graphical workspace)
Zoom (Graphical Workspace)
Easier to use left panel for the objects palette
Over 30 new or improved Transformation Steps
13 new or improved Job Entries
Support for four new database types - MonetDB,
KingbaseES, Vertica, and HP NeoView
Improved translations
MaxQDPro: Kettle- ETL Tool
04/17/14
20
Login
By default PDI provides login username and
password ad admin.
It strictly advised to change default password to
avoid any security vulnerablity.
04/17/14
21
04/17/14
22
04/17/14
23
04/17/14
24
Transformation
Value: Values are part of a row
and can contain any type of
data
Row: a row exists of 0 or more
values
Output stream: an output
stream is a stack of rows that
leaves a step.
Input stream: an input
stream is a stack of rows that
enters a step.
Hop: A hop is a graphical
representation of one or more
data streams between 2 steps.
Note: A note is a piece of
information that can be added
to a transformation
04/17/14
25
Jobs
Job Entry: A job entry is
one part of a job and
performs a certain
Hop: A hop is a
graphical representation
of one or more data
streams between 2 steps
04/17/14
26
Input Steps
Output Steps
Lookup Steps
Transformation
Steps
Join Steps
Job Steps
DW Steps
Mapping Steps
04/17/14
27
04/17/14
28
04/17/14
29
04/17/14
30
04/17/14
31
04/17/14
32
04/17/14
33
04/17/14
34
04/17/14
35
04/17/14
36
Besides the execution order, it specifies the condition for next job entry
Unconditional - next job entry will be executed regardless of the result
of the originating job entry.
Follow when result is true - next job entry will only be executed when
the result of the originating job entry is true,
Follow when result is false - next job entry will only be executed when
the result of the originating job entry was false
04/17/14
37
04/17/14
38
04/17/14
39
04/17/14
40
Components
Pan
Carte
Kitchen
Spoon
04/17/14
41
kettle.pentaho.org
Kettle project homepage
kettle.javaforge.com
Kettle community website: forum, source, documentation, tech tips,
samples,
www.pentaho.org/download/
All Pentaho modules, pre-configured with sample data
Developer forums, documentation
Ventana Research Open Source BI Survey
www.mysql.com
White paper http://dev.mysql.com/tech-resources/articles/mysql_5.0_pentaho.htm
l
Kettle Webinar http://www.mysql.com/news-and-events/on-demand-webinars/penta
ho-2006-09-19.php
Roland Bouman blog on Pentaho Data Integration and MySQL
http://rpbouman.blogspot.com/2006/06/pentaho-data-integration-k
ettle-turns.html
MaxQDPro: Kettle- ETL Tool
04/17/14
42