Pdi Lab Guide

Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

Lab Guide for Pentaho Data Integration (V-4.0.

1)
Popularly known as Kettle

PDI Lab Guide

HOW TO INSTALL AND OPEN THE PENTAHO DATA INTEGRATION IDE: ................................. 2
HOW TO CREATE A NEW REPOSITORY:.......................................................................................... 2
HOW TO CONNECT TO A REPOSITORY: .......................................................................................... 8
CREATION OF AN ODBC CONNECTION: ....................................................................................... 9
CREATING THE FIRST TRANSFORMATION: ................................................................................. 10
USING THE SELECT VALUES TRANSFORMATION: .. ........................................................ 29
CREATING A JOB:.............................................................................................................................. 42
QUESTIONS: ....................................................................................................................................... 43

How to Install and Open the Pentaho Data Integration IDE


Step 1: First install Java Runtime Environment 1.4 or a higher version in your system.
Step 2: Go to the website http://www.pentaho.com and download Pentaho Data Integration 4.0.1.
Step 3: Unzip the downloaded PDI zip file, open the data integration folder and double click on spoon.bat
file.

How to Create a New Repository


Concept of Repository: The Kettle repository is a workspace that the data integrator works on. This
workspace is a physical region of the hard-drive that is designated exclusively for Kettle. In the repository,
all information about transformations, jobs, schedules, etc. is stored. The repository concept promotes reusability, which in turn saves time and effort. We can create a new repository in two ways:
1. Kettle database repository: This repository uses a central relational database to store the ETL
(Extraction Transformation Load) metadata.
2. Kettle file repository: This is a repository stored in a file in a certain folder.

Creating a repository through Kettle database repository


Step 1: In the Repository Connection dialog box click on the
button. The Select the repository type
dialog box will appear.
Step 2: Select Kettle database repository and click on OK.
Step 3: Now in the Repository information dialog box click on the New button.

Step 4: In the Database Connection dialog box give a name in the Connection Name textbox, select MS
SQL Server in the Connection Type list, and select Native (JDBC) in the Access list. In the
Settings level, provide Host Name (Example: localhost/machine name/servername), Database
Name, Instance Name (optional), User Name and Password.
Step 5: Now click on the Test button to see the connection status.

Step 6: In the Repository information dialog box, select the database connection from the drop down
box, provide ID (to be displayed in spoon title bar) and Name (the actual name of the repository)
and click on the Create and Upgrade button.
Step 7: In the OK dialog box, in response to the question, Are you sure you want to create the
repository on the specified database connection?, if you click on the No button, everything will
be created in the database and if you click on the Yes button, it will ask for a dry run.
Step 8: In the Dry run dialog box click on Yes, if you want to see the SQL statements as well as make
some modification. Else click on the No button.

Creating a repository using the Kettle file repository


Step 1: In the Repository Connection dialog box click on the [
] button. The Select the repository type
dialog box will appear.
Step 2: Select Kettle file repository and click on OK.
Step 3: Now in the File repository settings dialog box click on the Browse button, select a folder, fill the ID
and Name, and click on the OK button.

Click on the + button to create a new repository.

Select the second option (kettle file repository) and click ok.

How to Connect to a Repository


Step 1: Normally the Repository Connection dialog box automatically opens with the opening of the Spoon
window. If the Repository Connection dialog box does not open with the Spoon window or suppose
you want to open it manually or you want to change your Repository, then:
In the Spoon window from menu bar, click on Tools and select the Repository option and click on
Connect to open the Repository Connection dialog box.
OR
8

Press CTRL+R from your keyboard to open the Repository Connection dialog box.
Step 2: To connect with your desired repository, select the corresponding repository name from the
Repository Connection dialog box and click on OK.
Note: In case of a database repository connection, you have to give User Name and Password (both
are by default admin) and in case of a file repository connection, User Name and Password are not
required.

Creation of an ODBC Connection


Step 1: Click on Start
Control Panel
Administrative Tools
Data Sources (ODBC). Then in the
ODBC Data Source Administrator dialog box select the User DSN tab.
Step 2: Click on the Add button, select SQL Server Native Client, and click on Finish. Then in the Create
a New Data Source to SQL Server dialog box provide a name for the data source, select server (.
for local host), click on the Next button twice, select the check box and select the Northwind
database. Then click on the Next button followed by the Finish button.

Creating the First Transformation


Learning Objective: To create a simple PDI transformation for transferring data from a database table to a
flat file.
Source: The Products table from the Northwind database (MS SQL SERVER 2008).
Destination: The Products flat file.
Step 1: First connect to a repository (either to the database repository or file repository).
Step 2: From the menu bar click on File
New
Transformation.

10

Step 3: In the Design mode click on the input folder, drag the Table input item and drop it on the center
pane/Transformation page.
Step 4: On the Transformation page, double click on the Table input to open its properties/configuration
dialog box.
Step 5: In the name text box rename Table input to PRODUCTS_TABLE and click on the New button for
establishing the connection with the MS SQL server.
Step 6: Then in the Database Connection dialog box, select the General tab from the left pane, type a name
in the Connection Name text box, select MS SQL Server (Native) from the list in the Connection
Type list box, select ODBC from the Access list box, type the DSN name in the ODBC DSN
Source Name, provide the User Name and Password, click on the Test button to check the status
and click on the OK button twice to return to the Database Connection pane. (You can also
establish database connection through Native (JDBC).)

Step 7: Now click on the Get SQL select statement button, select the Products table from the database
explorer, click on OK to close the database explorer, click on Yes from question dialog box to
11

include the field-names in SQL and click on OK to close the properties dialog box of the Table
input.

12

Click on the Preview button to see the contents of the table.

13

Step 8: Click on the Output folder in left pane, drag the Text file output item and drop it on the
transformation pane.
Step 9: For transferring data from source to destination, Hop is used.
To create a hop:
a)

Click on the PRODUCTS_TABLE and press the <SHIFT> key and draw a line to the Text
file output.
OR

b)

Place your mouse pointer on the PRODUCTS_TABLE until the hover menu appears. Then
drag the hop output connector to Text file output.

OR
c)

Place your mouse pointer on PRODUCTS_TABLE, press the middle button of the mouse and
drag the hop pointer and release on Text file output.

Step 10: Double click on Text file output. The following window will appear:

14

Step 11: In the Text file output properties window, type PRODUCTS_TEXTFILE in the Step name text
box, and click on the Browse button to set the path of the text file.
Step 12: Click on the Content tab (fields under the Content tab allow you to define how your data is
formatted).

Step 13: Click on the Fields tab to get fields and set their length, type, format, etc., and finally click on OK.

15

Step 14: Save the transformation by clicking File


Save from the menu bar. The Transformation
Properties window will appear.
Step 15: Type a name in the Transformation name text box, click on
(folder icon) to select a repository
folder (you can also create a new directory by right clicking on the root directory) where you will
save your transformation.Then click on OK twice.

16

Step 16: After saving, execute by pressing F9 or click on Action


a transformation window will appear. Click on the Launch button.
Click on this button for execution of your transformation

17

Run from menu bar. Now the Execute

This panel shows the details of your transformation.

Transformation has finished.

Step 17: Open the text file and you will observe the following output.
18

Learning Objective: To create a simple PDI transformation for transferring data from a database table to
an Excel file.
Source: The Products table from the Northwind database (MS SQL SERVER 2008)
Destination: The Products Excel file.
Step 1: Follow steps 16 from the previous transformation for the PRODUCTS_TABLE source.
Step 2: After executing steps 16, click on the output folder in the left pane, drag the Excel output item and
drop it on the transformation pane.
Step 3: Now drag the hop from PRODUCTS_TABLE to the Excel output and double click on Excel output.
Step 4: In the Excel output configuration window, type PRODUCTS_EXCEL in the step name text field,
then click on the browse button to set the path of the Excel file.

19

You can also give a name to the sheet in the Excel file (by default, it is sheet1).

Step 5: Now click on the Fields tab to get fields and set their type, format, etc., and click on OK.

20

Step 6: Save your transformation and go for execution.


Step 7: Open the Excel file and you will observe the following output:

Learning Objective: To create a simple PDI transformation for transferring data from an Excel file to a
database table.
Source: The PRODUCTS_EXCEL Excel file.
21

Destination: The TBL_PRODUCTS table in the Northwind database (MS SQL SERVER 2008).
Step 1: From the menu bar click on File
New
Transformation. In the Design tab click on the input
folder, drag the Excel input item and drop it in the transformation window.
Step 2: Double click on Excel input to open the Excel input configuration window.
Step 3: In the Excel configuration window, type PRODUCTS_EXCELFILE in the Step name text box,
click on the Browse button to select the PRODUCTS_EXCEL.xls and then click on the Add button.

Step 4: Now click on the Get sheetnames(s) button under the Sheets tab, select PRODUCTS from the
available list and click on OK. Then, to get fieldnames from the header row, click on the Get fields
from header row button under the Fields tab.
Step 5: Under the design tab, expand the output folder, drag the Table output item and drop it in the
transformation pane.
Step 6: Create a hop in between PRODUCTS_EXCELFILE and Table output then double click on Table
output.

22

23

Click on this button to see the contents of the file

Step 7: In the Table output configuration dialog box, type PRODUCTS in the Step name text box, establish
the connection with the MS SQL server in the Connection text box, type TBL_PRODUCTS in the
Target table text box, click on the SQL button to open the Simple SQL editor dialog box, click on
the Execute button for the creation of a table in the MS SQL server and click on the Close button.
Step 8: Now click on the OK button in table output configuration dialog box, save your transformation and
go for execution.

24

Learning Objective: To create a simple PDI transformation for transferring data from a text file to a
database table.
Source: PRODUCTS.txt
Destination: TABLE_PRODUCTS table in the Northwind database.
Step 1: Create a new transformation by expanding the input folder under the Design Tab. Drag the Text file
input item and drop it in the transformation pane.
Step 2: Double click on Text file input to open its configuration window, rename Text file input to
PRODUCTS_TEXTFILE in the Step name text box, then click on the Browse button to select the
PRODUCTS.txt file and click on the Add button.

Step 3: Now click on the Content tab and observe the Separator and Enclosure text fields.

25

Step 4: Then open the Fields tab, click on Get fields and enter 0 to see the scan results of the flat file and
click on the Close button.
Step 5: You can also see the text file contents by clicking on the Preview rows button.
Step 6: Now from the Design tab expand the output folder, drag and drop the Table output in the
transformation pane and so on.
Step 7: Finally save the transformation and execute it.

26

27

These are not the desired outputs.

Well the transformation has been successfully executed but the output is not the desired output!!!

28

Using the Select Values Transformation


Step 1: Now delete the hop in between PRODUCTS_TEXTFILE and TABLE_PRODUCTS by right
clicking on the hop and selecting Delete hop.
Step 2: From the design tab expand the Transform folder, drag the Select values item and drop it in the
transformation pane.
Step 3: Create a hop in between PRODUCTS_TEXTFILE and select values then double click on Select
values to open its configuration dialog box.
Step 4: In select/rename dialog box rename select values to CHANGE_TYPE and click on meta-data tab.
Step 5: In meta-data tab click on Get fields to change, then set each fieldname type accordingly and click
on OK button.
Step 6: Now create a hop in between CHANGE_TYPE item and TABLE_PRODUCTS item.
Step 7: Then delete the TABLE_PRODUCTS table from the database and configure the
TABLE_PRODUCTS once again.
Step 8: Finally save the transformation, execute it and open the TABLE_PRODUCTS table from the
database.

29

Learning Objective: To use the Filter rows for dividing the source data into two destinations.

Step 1: Here, we will split the data coming from products table and place it in two different files. We will
divide products based on price.
Step 2: From the Design tab expand the Flow folder, drag the Filter rows item and drop it in the database
to Excel transformation.
Step 3: Join the hop in between PRODUCTS_EXCEL and Filter rows. Then double click on Filter rows to
open its configuration dialog box.
Step 4: In the Filter rows dialog box, rename Filter rows to filter rows on UnitPrice, then in the condition
pane set the condition shown in the following snapshot.

30

Step 5: Now from the Design tab expand the Output folder, drag the two text file outputs and drop in
the transformation pane.
Step 6: Drag the hop from Filter rows on UnitPrice and drop on a text file output with the result being
TRUE. Similarly drag another hop from same Filter rows on UnitPrice and drop it on second
text file with result being FALSE.
Step 7: Now open the text file configuration dialog box for both these files and set their properties.
Step 8: Then save your transformation and go for execution.
Note: In the Filter rows item either both or none of the TRUE and the FALSE steps need to be
provided, otherwise it will show an error.
31

Learning Objective: To use the Add constants for adding another field in the destination file.
Step 1: In the transformation pane add PRODUCTS_TABLE as source and Excel output as destination.
Step 2: Then from the Design pane expand the Transformation folder, drag the Add constants item and
drop it in the transformation window.
Step 3: Join the hop in between the PRODUCTS_TABLE and Add constants.
Step 4: Now double click on the Add constants item to open its configuration window.
Step 5: In Add constants value configuration window, rename Add constants to ADD_NEW_FIELD and
type QUALITY in the Name column, String in the Type column, 7 in Length, GOOD in Value and
click on OK.
Step 6: Join the hop in between ADD_NEW_FIELD and Excel file output.
Step 7: Now configure the Excel output item, save the transformation and go for execution.
32

33

Learning Objective: To use the calculator for calculation of source data and to display it in destinations.
Step 1: Drag a Table input from the input folder, drop it in the transformation pane and configure it with the
order details table from the Northwind database.
Step 2: From the Design pane expand the Transform folder, drag the calculator item and drop it in the
Transformation pane.
Step 3: Join the hop in between ORDER_DETAILS_INPUT and calculator. Then double click on the
calculator item to open its configuration window.
Step 4: In the calculator dialog box, type TOTAL PRICE under New Field and click on the calculator field.
Step 5: Now, the Select the calculation type dialog box will appear. Here, select A*B and click on OK.
Step 6: Next in the Field A column, select UnitPrice, and in the Field B column select Quantity. In the
Value type column select Number and click on OK.

34

Step 7: Now drag an Excel output item,drop it in the transformation pane, join the hop in between the
calculator and Excel output item and configure it.
Step 8: Finally save the trnasformation and execute it.
35

Learning Objective: To implement a lookup using the Lookup Transformation


To implement a lookup transformation:
Step 1: Create a new Transformation with Table input as its source.
Step 2: Connect the Table input with the order details table through its properties dialog box.

36

Step 3: Now, in the Design pane expand the Lookup folder, drag the Database lookup item and drop it in
the Transformation pane.
Step 4: Then connect the hop in between ORDER_DETAILS and Database lookup. Now double click on
the Database lookup.
Step 5: In the Database value lookup editor establish the ODBC connection in the Connection text box. Go
to the Lookup table text box, click on the Browse button to select the Orders table for lookup, then
in the key(s) to look up select OrderID in both Table field and Field1 with comparator as =.

37

Step 6: Now in the Values to return from the lookup table area, select the Field column and set the
data types in Type field or click on the Get lookup fields button and click on OK.
Step 7: Then join the hop in between the Database lookup and Excel output with selection of
Main output of step.
Step 8: Now configure the Excel output, save the transformation and go for execution.

Learning Objective: Merge two record sets.


Source 1: Emp.xls
Source 2: Emp.txt
Destination: Employee table in MS SQL SERVER 2008
Emp_ID F_Name
M_Name L_Name Gender
12345
Shailendra Narayan Nayak
M
12346
Jairam
K
M
38

12347
12348

Vandana
Hema

Sharma
A.

F
F

Emp.xls

Emp.txt

Employee Table

Step 1: Create a new Transformation with one Text file input (for Emp.txt) and another Excel input
(Emp.xls).
Step 2: Configure both the inputs, drag the Select values item from the Transform folder and drop it in
the transformation pane.
Step 3: Connect the hops and configure both the Select Values (Field Name, Data types).
Step 4: Using Sort rows from the Transform folder, sort both the input records (because before merging
both the records must be sorted).
Step 5: Drag one Sorted Merge item from the Joins folder, place it in the transformation pane and
configure it accordingly.
Step 6: For changing M to Male or F to Female, use one Value Mapper item from the
Transform folder and configure it accordingly.
Step 7: Now put one Table output item from the Output folder and configure it.
Step 8: Save the transformation in the repository and go for execution.

39

[Select Values]

[Value Mapper]

Learning Objective: Take data from the flat file and perform basic data quality operation on the same and
load it back to a flat file (data quality operations such as trimming spaces and checking for null values).
Source file: Account.txt

[Account.txt]

Step 1: Create a new transformation with Text file input as its source.
Step 2: Connect the Text file input with Account.txt through its properties dialog box.
Step 3: Expand the scripting folder, drag the formula item and drop it in the Transformation pane.
Step 4: Join the hop in between the text file input and formula, and double click on formula to open its
dialog box.

40

Step 5: In the formula dialog box type Account_ID in New field, click on the formula column, type TRIM
[acct_id] in the upper part of the dialog box, click on OK and also set the value type as String.
Similarly, do this for TXN_ID and AMOUNT as shown in the below snapshots:

Delete these three fields(acct_id,txn_id,amt)

Step 6: Add a Text file output in the pane and configure it, save the transformation and go for execution.

41

Creating a Job
Step 1: First connect to a repository (either to a database repository or a file repository).
Step 2: From the menu bar click on File
New
Job.
Step 3: In the Design mode, click on the General folder, drag the START item and drop it on the centre
pane.
Step 4: In the centre pane, double click on START to open its scheduling dialog box and configure the Job
Scheduling.

Step 5: Click on the General folder in the left pane, drag the Transformation item and drop it on the
Transformation pane.
Step 6: Now drag the hop from START to Transformation and double click on the transformation item to
open its configuration dialog box.
Step 7: In the Job entry details for this transformation dialog box, select the right radio button and select the
transformation file or directory and click on OK.

42

Step 8: Save your job in a database repository or file repository (depending on your repository connection).
Step 9: Now press F5 or click on the execute button. Depending on your schedule as already set in START
the job will be executed (either once or multiple times).

Questions
1. What is the difference between a Database Repository and a File Repository in Kettle?
2. What is the difference between a Transformation and a Job in Kettle (PDI)?
3. What is the difference between connecting a database through JDBC and connecting it through
ODBC?
4. Create a transformation to import a table from MS SQL Server database to MS Access database.
5. Create a transformation to concatenate FirstName and LastName from the EMPLOYEES table in
the Northwind database and store it in a database in a field EMP_NAME.

43

You might also like