ETL Notes

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 25
At a glance
Powered by AI
The key takeaways are about ETL testing, data warehousing, and data warehouse testing processes.

ETL testing is testing performed on transformed or staged data before loading it into a data warehouse to ensure it meets requirements.

The different types of ETL testing include constraint testing, field to field testing, duplicate testing, etc.

What is ETL Testing?

ETL – ExtractTransformLoad. After extracting and applying transformations and


before loading data into data warehouse ETL testing comes into picture. We apply different
types of ETL testing types on the transformed or staged data mart(s) or table(s) to meet the
requirements of the client to analyse the business but not for business transactional
processing. This phenomenon is called ETL testing.

What is Data Warehouse?

We have two fathers to definethe above question.

Ralph Kimball – Data warehouse is a data base which is specifically designed for
analysing the business but not for business transactional processing.

WH. Inmon – Data Warehouse is a subject-oriented, integrated, non-volatile, time


variant collection of data in support of management decisions.

What is Data Warehousing?

ETL process: Extracting the data from different kinds of data bases, transforming or
staging the extracted data into a single required data format and loading the data into a
data warehouse is called ETL process.

Reporting:Since the data in the data warehouse is in the form of tables, we present
data to the end user in the form of graphs i.e. graphical representation of data, using
reporting tools to analyse the business.

The complete cycle of ETL process and Reporting is called Data Warehousing.

Note: The first data warehousing system is implemented in 1987 by W.H.Inmon

ETL Testing tools:

ETL

TESTING TOOLS

DEVELOPMENT TOOLS No tools

Informatica

Data stage

Ab-inition
What about QTP win runner, etc. tools?

These are GUI based testing tools this are not used for ETL testing. It is used to test the
data in the excel sheets.

What is the difference between ETL testing and general testing?

ETL Testing General Testing

It is CUI testing GUI Testing (or) page by page testing

Heterogeneous data Homogeneous data

Note: - ETL Testing does not support QTP.

What is the difference DB Testing and ETL Testing?

Data base Testing ETL Testing


Small in scale Large scale voluminous data
Usually used to test data at the source Includes several facets. Extraction,
instead of testing using the GUI Transformation and Loading mechanisms
being the major ones
Usually Homogeneous data Heterogeneous data
Normalized data De normalized data
CRUD Operations Usually Read-only operations
Consistent data Temporal data inconsistency

ETL testing challenges:There are many challenges we faced while performing data
warehouse testing.Here is the list of few ETL testing challenges.

 Incompatible and duplicate data.


 Loss of data during ETL process.
 Unavailability of inclusive test bed.
 Testers have no privileges to execute ETL mapping by their own.
 Volume and complexity of data is very huge.
 Fault in business process and procedures.
 Trouble acquiring and building test data.
 Missing business flow information.

ETL Testing Process

 Business and requirementunderstanding
 Validating
 Test estimation
 Test planning based on the inputs from test estimation and business requirements.
 Check list, tc design, tc exe, bug reporting ,sta report ,test close
 Designing test cases and test scenarios from all the available inputs.
 Once all the test cases are ready and are approved, testing team proceed to perform
pre-execution check and test data preparation for testing.
 Lastly execution is performed till exit criteria met.
 Upon successful completion summary report is prepared and closure process is
done.

Why we need Data warehouse?

 To store large volumes of historical detail data from Mission Critical Applications.
 Better business intelligence for end-users.
 Data Security - To prevent unauthorized access to sensitive data.
 Decision support systems.
 Reduction in time and analyse information.

ETL or Data warehouse testing is categorized into four different engagements irrespective


of technology or ETL tools used:
 New Data Warehouse Testing – New DW is built and verified from scratch. Data
input is taken from customer requirements and different data sources and new data
warehouse is build and verified with the help of ETL tools.
 Migration Testing – In this type of project, customer will have an existing DW and
ETL performing the job but they are looking to bag new tool in order to improve
efficiency.
 Change Request – In this type of project new data is added from different sources to
an existing DW. Also, there might be a condition where customer needs to change
their existing business rule or they might integrate the new rule.
 Supporting Testing –

Advantages of ETL Testing:-


 High query performance
 Queries not visible outside warehouse
 Can operate when sources unavailable
 Query data not stored in a DBMS
 Extra information at warehouse
o Modify, summarize (store aggregates)
o Add historical information
 Improves the quality and accessibility of data.
 Reduce the requirements of users to access operational data.
 Allows new reports and studies to be introduced without disrupting operational
systems.
 Increases the amount of information available to users.
Data moving different stages in DWH

Mainframe sources
(USA)

E
ERPsources (UK)
Extraction Transformation Loading
Data Warehouse

Relationalsources
(IND)

Extraction:
It is a process of reading the data from various types of sources
1. Such as relational sources,
2. ERP sources,
3. Mainframe sources
4. Cloud sources
5. Hadoop source
6. XML file
7. Flat files etc.

Data Transformation:
It is a process of cleaning the data and transforming the data into a required business
format.

Data Loading:
It is a process of inserting the data into a target system. There are two types of data loads.

Initial or Full Load:It is a process of loading all the required data at very first load.

Incremental or Delta Load: It is a process of loading only new records after initial load.

According to W.H.Inmon characteristic features of a Data warehouse:

• SubjectOriented
• Integrated
• Non-volatile
• TimeVariant
Integrated:A data warehouse is an integrated database which contains the business
information collected from various operational data sources.

OLTP System

Saving acc

E Extraction Loading
Integrated data
Current acc Account subject

E acc
Joint

Note:We integrated all accounts into one account type.

Time Variant:AData warehouse is a time variant database which allows you to analyse and
compare the business with respect to various time periods (Year, Quarter, Month, Week,
Daily) because which maintains historical data.

Transactional Storage Data Warehouse Storage

Data

Operational Warehouse

Current Value data Snapshot data


Time horizon: 60-90 days Time horizon: 5-10 years
Data warehouse stores
Historical data
Non-volatile:Adata warehouse is a non-volatile database. That means once the data entered
into data warehouse that cannot be changed. It doesn’t reflect to the changes taken place in
operational database. Hence the data is static.

VolatileNon- Volatile

Data

Operational Warehouse

Subject Oriented:The data warehouses are designed as a Subject-oriented, these are used
to analyse the business by top level management, or middle level management, or for an
individual department in an enterprise.

account SALE

HR loans

What is the difference OLTP and Data Warehouse?

OLTP Data Warehouse


It is design to support business transation It is design to support a decision making
processing processing

Design for office operation Design for managerial operation

Current Data Historical Data


Detailed Data Summarized Data
Dynamic Static
Relatively smaller database Large database size
Data is volatile Data is non-volatile
Isolated data Integrated data
Read &write Read only

DML Select
application –oriented Subject-oriented

More number of tables Less number of tables

DataAcquisition:It is a process of Extracting the relevant business information,


Transformingthe data into a required business format and Loadinginto the Data
Warehouse.It is defined with the following processes.

 Data Extraction
 Data Transformation
 Data Loading

Mainframe sources
(USA)

Extraction Transformation Loading


Data Warehouse
ERPsources (UK)

Relationalsources
(IND)

Data Extraction: It is a process of reading the data from various types of sources such as,

1. Such as relational sources,

2. ERP sources,

3. Mainframe sources

4. Cloud sources

5. Hadoop source

6. XML file

7. Flat files etc.


Data Transformation:It is a process of cleaning the data and transforming the data into a
required business format.The following data transformation activities take place in staging
area.

 Data Merging
 Data Cleansing
 Data Scrubbing
 Data Aggregation

Data Merging:It is a process of combining the data from multiple inputs and load into a
single output. There are two types of Data Merging Activities.

1. Join
2. Union

Data Merging(JOIN)
EMP

E EMP DEPT

T
EMPNO ENAME SAL DEPTNO
JOIN
7369 SMITH 800 10 L
EMPNOENAMESALDEPTNODNAMELOC
7369 SMITH 800 10 ETLTESTING DALLAR
DEPT

DEPTNO DNAME LOC


10 ETLTESTING DALLAR

Data Merging(union)
EMP

E T
EMPNOENAMESALDEPTNO EMP
UNION
7369 SMITH 800 10 L
EMPNOENAMESALDEPTNO
E 7369 SMITH 800 10
7456 ALLEN 400 20

EMP
EMPNOENAMESALDEPTNO
7456 ALLEN 400 20
Data Cleansing:It is a process of removing unwanted data from stagingorit is a process of
changing inconsistencies and inaccuracies

Example: InitCap() and Round() functions

DATA CLEANSING

SOURCESTAGINGTARGET

T
SALESAMOUNT
SALESAMOUNT E L
$ 1.88 ROUND $ 1.88

$ 2.00 $ 2.00

$ 3.4563 $ 3.45

Data Scrubbing:It is a process of deriving new data definitions using existing data.

Example: Concat (First Name+ Last Name), Sal Amount=QTY*Price

Data Aggregation:It’s process of calculating the summaries for a group of records using
aggregate functions.

E SUM ( ) L
Detailed Summarized
MAX ( )
Data Data

Example: Average, Max, Min etc….

Marts Data:Data Mart is known as High Performance Query Structure. There are two types
of DM

 Dependent DM(Top Down Approach)


 Independent DM(Bottom-Up Approach)
Top Down Approach or Dependent Data Marts (W.H.Inmon)

According to W.H.Inmon first we need to design an enterprise data warehouse then design
a small form of subject oriented department design specific DB known as Data Marts
Bottom-Up Approach or Independent Data Marts (Ralph Kimball)

According to Ralph Kimball first we need to design department specific database known as
Data Marts then integrate all data marts into Enterprise Data Marts.
What is a Dimension Table:
If a table contains primary keys and it gives the detailed information about business then
such a table is called dimension table.
Dimension Table -> Text + Primary Key
A Dimension table is a table which holds a list of attributes or qualities of the dimension
most often used in queries and reports.

What is a Fact table:A fact table which contains foreign keys to dimension tables and
numeric facts (called as measurements).
Fact Table -> Metrics + Foreign Key

Product ID Customer City date product Qty revenue profit


001 SONY JAPAN 31/1/201 IC 50 units $1000 100
3

Slowly Changing Dimensions (SCD):SCD captures the changes which takes place over the
period of time. There are three types of SCD’s.

 SCD Type 1
 SCD Type 2
 SCD Type 3
SCD Type1:Type 1 dimension keeps only the current values. Doesn’t maintain historical
data.
Example:

Customer_i Customer Address


d Name
100 RAVI HYD
200 KRISHNA PUNE
300 RAM MUMBAI

Customer_i Customer Address


d Name
100 RAVI CHENNAI
200 KRISHNA PUNE
300 RAM MUMBAI

Note: Ravi was working in Hyderabadpreviously/currently workingin Chennai. SCD 1


maintain only the current value i.e.Chennai).

SCD Type2:Type 2 dimension maintain the full history in the target. For each update it
inserts a new record in the target tables.There are 3 different methods (INDICATORS)
maintain in history.
1. Version
2. Flag
3. Date
VersionP k
Ex:

Customer_i Customer Address


d Name
100 RAVI HYD
200 KRISHNA PUNE
300 RAM MUMBAI

PK

S key Customer_id Customer Address Version


Name
1 100 RAVI HYD 0
2 200 KRISHNA PUNE 1
3 300 RAM MUMBAI 1
4 100 RAVI CHENNAI 1
5 100 RAVI BANG 2

NOTE: S KEY, it is system generatedkey

Flag

S key Customer_id Customer Address Flag


Name
1 100 RAVI HYD 0
2 200 KRISHNA PUNE 1
3 300 RAM MUMBAI 1
4 100 RAVI CHENNAI 0
5 100 RAVI BANG 1

DATE

S key Customer_id Customer Address SART DATE END DATE


Name
1 100 RAVI HYD 08/JAN/10 23/DEC/11
2 200 KRISHNA PUNE 08/JAN/12 1
3 300 RAM MUMBAI 08/JAN/12 1
4 100 RAVI CHENNAI 24/DEC/11 15/SEP/12
5 100 RAVI BANG 16/SEP/12 TILLDATE

SCD Type 3:Type 3 dimension maintains current and previous information (Partial History)

S key Customer_id Customer Address


Name
1 100 RAVI HYD
2 200 KRISHNA PUNE
3 300 RAM MUMBAI
4 100 RAVI CHEANNAI
5 100 RAVI BANG
ETL Test Life Cycle
Team ETL Testing Documents
High level document Project document Source, transformation, target,
Architecture, ETL Tools
Project level Test requirement Target (Tables)
document study document
(P.M, T.L)
Project level Check list (T.L) Mapping sheet, H.R, Queries, number
document of hours, team size etc. is mentioned.
Module level Test case design Team member, based on test
document requirement document, checklist
(dynamic & static), template (QA)
Module level Test case review(T.L)
document
Module level Test Execution SQL
document (Tester)
Module level Bug/Defect Report Test Review
document (Tester)
Module level Status Report &
document Traceability matrix
(Tester, Sr. tester,
Test Lead)
High level document Result Report (T.L,
P.M)
High Level document Test Closure (T.L,
P.M)
Sign off

 What is project document?


This document is prepared by project manager or Business Analyst. The main agenda of
project document is to cover the source information, type of transformations done at
source and target.And what type of ETL development, designing tools will be used in the
project is been chosen. Finally architecture diagram is provided.

 Test Requirement Study Document


It is prepared by project manager or team lead. The main agenda of this document is to
provide target information with tables.
 Checklist
Based on transformations, developers prepare mapping sheet document. This document is
forwarded to the test team. Based on the mapping sheet document test lead prepares the
checklist. Checklists are of two types, static and dynamic check lists.

Test case Design:


Dynamic Test Case: These are prepared based on mapping sheets (Transformations), a
relation of mapping sheet and check list.
Dynamic Test Case -> mapping sheet + Check list

Static Test Case:These are prepared only based on the check list.
Static checklist -> Check list

Test Case Review:Team lead who is not part of the project conducts the test case review.

Peer Review:A review conducted between the team members by exchanging the test cases
is called Peer Review.

Test Execution: Based on the data base team members will execute the test cases with the
help of queries.

Defect Report:While executing the test cases, if the tester finds any mismatches in the
target table against the requirements, he raises that corresponding mismatch as defect.

Status Report:In this report we mention the status of the every test case we execute i.e. we
include RTM (Requirement Traceability Matrix).

Result Report:After completion of the status report team lead prepares the result report.
How many test cases were passed, failed and under hold is mentioned in this report.

Test Closure: After completion of test execution levels and fixation of bugs, test lead will
conduct a review meeting to estimate completeness and correctness of testing process.

Types of ETL Testing’s:


1) Table Data Validation: In the phase of table data validation, tester will check whether
the corresponding table(s) exists in the target data base.
Ex: select * from table name;

2) Table Structure Validation: Generally a test engineer validates the Meta data or
structure of the table, to check the columns and rows size.
Ex: desc table name;

3) Table Look and Feel: To check the order of the table (columns and rows) or look and feel
of the table structure (Meta data).
Ex: set line 100;

4) ConstraintsTesting: In constraint testing tester will perform major validations like Null,
Not Null, Unique, Primary key, Foreign Key, Check, and Default values, Duplicate values.

Not Null:Cells in the table should contain some kind of data.


Ex: select empno, ename from emp
Where comm is not null;

Null:Represents a cell with nothing i.e. cell must be empty (not even space).

Primary Key:Column which does not allow any duplicates and nulls.

Foreign Key:Column which is used to establish a relation between tables.

Check:declares the length of the string or field.

Note: Constraint testing is to be performed on every column in the table.

5) Target Count Testing:It is used to compare the data in the target table with the source
table. It doesn’t matter whether the data is in ascending or descending order here we
consider the count of rows. This type of testing is been done only due to the lack of
time.
Ex: select count(*) from emp;
select job, count(*) from emp group by job;
SELECT COUNT(MGR) FROM EMP;
select deptno, count (*) from emp group by deptno;
select deptno, count (*) from emp group by deptno having count(*) > 1;
select JOB, count (*) from emp group by JOB having count(*) > 1;
6) Target Data Validation Testing/Decimal Factor Testing:Here a tester will validate each
and every point of the target data. Most of the financial projects come under this
category. A tester identifies the decimal factors of the table data.
Ex: desc table_name;

7) Threshold / Data Integrated Testing:Used to identify the ranges of the data. A tester
usually uses this to identify the population calculations and share marketing and
business finance analysis (Quarterly, Half Yearly, Yearly).
Ex: select e.*, d.* from emp e, dept d
where e.deptno = d.deptno;
Note: Two table’s common column values must be matched.
Union: Combining two tables, only when all columns of both the tables match.
Ex: select * from emp1
UNION
Select * from emp2;
Note: All columns of the tables must match

8) Field to Field Testing:In the field to field testing a test engineer will identify how much
space is been occupied for the target table in the data base. The data is integrated in the
table with columns and data types.
Note: Used to check the order of the columns of source and target tables and to
verify default values are not been populated unless required.

5% excessspace to the data base will be taken as a play gap, which is


used for the freeness of the project data in the data base.
DDB
Transaction ID or Surrogate Key: This is system generated key. It is
the combination of time + date + system IP

9) Duplicate Check Testing:Used to identify the duplicates.


Ex: select distinct deptno from emp;
Select deptno, count(*) from emp group by deptno having count(*) > 1;
select distinct job from emp;
Select job, count(*) from emp group by job having count(*) > 1;

10) Initialization Testing:A combination of hardware and software installed in a platform is


called initialization testing.
11) Error Handling: Setting predefined errors to handle them in the coming future.

12) Connectivity Testing: Testing the connectivity of all the peripherals.

13) Negative Testing: This is called testing the other way. Suppose if we have emp table as
target table and we check for dept table what would be the result. This way of testing is
called negative testing.

14) Incremental and Historical Process Testing (Team Lead Question):

Data marts A is loaded B is loaded

Data Warehouse
Here the testing is done in three ways:

a) After loading the data mart A, we test for the existence of mart or table A.
b) After loading the data mart B, we test whether the marts A and B are viewable. Here
the data mart A is historical data and B is incremental data.
c) Finally we look for the relation between the two marts A and B.

Ex: select * from table_name;


Select e.*, d.* from emp e, dept d where e.deptno = d.deptno;

15) Regression Testing: Suppose if a tester finds a mismatch, he re-tests it to confirm as a


defect. After confirmation he raises the defect to the concerned developer to fix it. Then
developer accepts the defect as bug and fixes the bug. Later he sends it back to
tester.The tester receives the fixed bug and tests it to perfection to confirm it is been
fixed. And the tester will test the correctness of the data affected by bug as well.
The process of testing the raised bug and the corresponding data affected by it as
well is called regression testing.

Note: Testing the raised fault again and again to confirm it as a defect before
informing it to the developer is called Retesting.
16) Volume Testing:To verify, the system will process the maximum expected quantity of
data for a given cycle in the time expected.

ETL Testing Terminology


Test Harness (Ready to start test):It is a ready to go situation. Here the test environment
and the completely done test cases will be available to start test.

Test Bed:Collection of test cases

Test Environment (H/W & S/W): It is a complete project setup to work.

Note: Test Harness = Test Environment + Test Bed

Point to remember: Test bed is not possible in ETL testing because here we don’t
have end to end data. Here data flows continuously.

Bug Density: Average number of bug testing team found in one module.

Ex: module1 - 10%

Module2 - 40%

Module3 - 30%

Module4 - 20%

100%

Regression testing in module 2

Defect Age: The time gap between reporting date and resolved date of a defect.

Kick of Meeting: It is a first meeting conducted with in development camp (High level
management like M.D, Board of Directors, CEO, Head of Operations & all Managers)
participates to have over view on the project, know the customer nature and to finalize the
project team.

QA Manager:A resource who manages all the projects in the company

Project Manager:A resource who works for a specific project.

What is the difference between QA & QC?

QA QC
Quality Analyst Test Engineers
Responsible for prevention of defects Responsible for detection of defects.
QA defines process QC follows the process
They improves process They improves product
QA prepares all templates like test plan, test They us all the templates prepared by QA
case, defect template etc.
QA is for entire organization QC is for across the organization
Verification Validation

Test Case:A test case is a software testing document which consists of events, action, input,
output, expected result and actual result.

Test Script: A test script is a short program written in a programming language used to test
part of the categories.

Quality:It is a metaphor which satisfies the requirements of a client. Otherwise a problem


free project or defect free project.

Standalone Environment:It is known as one tier architecture.

Ex: Our PC

Distributed Environment:It is a n-tier architecture, where business layer is distributed


among the machines. Generally RMI, EJB, J2EE stands in this environment.

Peer Review: Reviews carried out by peers or persons at the same level in the hierarchy.

Peer reviews are done during design stage, coding stage and testing stage. Peer
reviews help in improving the product quality.

Metric:Tomeasure software, “metrics” are used, to estimate the project complexity as well
as to measure the productivity of the developers and test engineers. If you want to manage
effectively, you need to use metrics.

Change Request:It is a process of requesting the development team to incorporate some


new changes to the existing module. Usually the customers will use change request
template.

ETL Development Life Cycle:

Requirement Analysis

High Level Design

Low Level Design


Development

Testing
System Integration Testing
(SIT)

Review
Testing

Production

Maintenance

Smoke Testing & Sanity Testing:


Development

Smoke Testing
Testing

Sanity Testing
User Acceptance Test

After completion of development, senior people in the development team tests the
build to confirm whether it is ready for testing, before releasing the build to testing
department. This is called Smoke Testing.

After completion of testing the build, the higher authority tests the build before it is
released to UAT. This is called Sanity Testing.

Issue Log and Clarification log


ISSUE LOG and CLARIFICATION LOG templates are same. The log, tester forward to the
client is ISSUE LOG and the log that client sends as a response to the ISSUE LOG is called
CLARIFICATION LOG.

Bug Life cycle & Defect tracking process


Detect Defect: At the time of execution if the testers identify any mismatches of the target
table. A tester can raise the detected defect.

Reproduced Defect:After finding a defect,we test it thoroughly with different scenario’s


before we forward it to the developer. This process is called Reproduced Defect. Even after
thorough examination if the defect exists it’s been forwarded to developer.

Reporting the Defect:Weuse bug reporting template to reportthe bugs. Through template,
tester should forward the defect to the concerned person.

Severity:Severity of the application or data is provided by technical people like test lead, sr.
testers.

Priority: Importance of customer point of view, impact of the any business logic.

Bug fixing: When developer accepts the defect, it changes as bug. Developer will fix the bug.

Bug Resolving: Fixation of bug into working condition.

Bug Closing: After bug fixation developer will forward to Tester. Tester can do retesting or
regression testing. After completion of above testing’s if he satisfies,tester approves the
working status of the bug by closing it.

Note: If tester doesn't satisfy with the modified bug again the bug will be forwarded to the
developer. It is known as Re-open.

Defect ID: The sequence of defect numbers or unique number

Description: Summary of the defect

Build Id: Corresponding build number

Version ID: Corresponding build number

Severity: Seriousness of the data/table (technical aspect)

1. High 2.Medium 3.Low

Priority: Importance of customer point of view (Client perspective)

1. High 2.Medium 3.Low

Note: Low severity high priority:

Sometimes client will be given high priority

High severity low priority


This says functionality is got corrupted, an application test can raise problem
to thedevelopment team and they give low priority.

Status: New & Re-open

New: First time (newly) tester can detect defect that status has new.

Re-open: Tester send the newly detected defect to the developer. Developer sends it back
after fixing the defect. The tester tests it, if he is not satisfied with the developer work, the
tester Re-open’s the bug by sending it back to the developer.

Assigned to:Mention the name of the developer to whom the defect should be assigned.
Project Manager& Project Lead can do this work.

Detected by:Mention the name of the tester who detects the defect.

Types of Bugs:

1) Duplicate bugs

2) Database connectivity bugs

3) Load bugs

4) Transformation bugs

5) Logical bugs

6) Recovery bugs

Roles and Responsibilities of team member:

 Having 3 years 6 months of experience in ETL or Data Warehouse and manual


testing.
 Good exposure to Data Warehouse concepts
 Handsome writing and execution of SQL queries.
 Has testing experience in information power centre 8.6.0
 Involved in various manual testing types like system testing, regression testing.
 Good knowledge on verification and validation process.
 Work experience in types of ETL testing like constraint testing, field to field testing,
duplicate testing etc.
 Experience in preparing test cases based on mapping sheets.
 Experience in executing test cases and bug tracking.

Roles and Responsibilities of Test Lead including the above mentioned:


 To prepare all types of templates.
 To prepare the check list document based on ETL development transformations.
 Review the test cases.
 To review the bug report document.
 To review the results and preparing the Traceability Requirements Metrics (T.R.M).
 To conduct the system integration testing and review of the module.
 To prepare the results documentation.
 To prepare the check list document, test plan, mapping document.
 To prepare module level closure document.

You might also like