ETL Notes
ETL Notes
ETL Notes
Ralph Kimball – Data warehouse is a data base which is specifically designed for
analysing the business but not for business transactional processing.
ETL process: Extracting the data from different kinds of data bases, transforming or
staging the extracted data into a single required data format and loading the data into a
data warehouse is called ETL process.
Reporting:Since the data in the data warehouse is in the form of tables, we present
data to the end user in the form of graphs i.e. graphical representation of data, using
reporting tools to analyse the business.
The complete cycle of ETL process and Reporting is called Data Warehousing.
ETL
TESTING TOOLS
Informatica
Data stage
Ab-inition
What about QTP win runner, etc. tools?
These are GUI based testing tools this are not used for ETL testing. It is used to test the
data in the excel sheets.
ETL testing challenges:There are many challenges we faced while performing data
warehouse testing.Here is the list of few ETL testing challenges.
Business and requirementunderstanding
Validating
Test estimation
Test planning based on the inputs from test estimation and business requirements.
Check list, tc design, tc exe, bug reporting ,sta report ,test close
Designing test cases and test scenarios from all the available inputs.
Once all the test cases are ready and are approved, testing team proceed to perform
pre-execution check and test data preparation for testing.
Lastly execution is performed till exit criteria met.
Upon successful completion summary report is prepared and closure process is
done.
To store large volumes of historical detail data from Mission Critical Applications.
Better business intelligence for end-users.
Data Security - To prevent unauthorized access to sensitive data.
Decision support systems.
Reduction in time and analyse information.
Mainframe sources
(USA)
E
ERPsources (UK)
Extraction Transformation Loading
Data Warehouse
Relationalsources
(IND)
Extraction:
It is a process of reading the data from various types of sources
1. Such as relational sources,
2. ERP sources,
3. Mainframe sources
4. Cloud sources
5. Hadoop source
6. XML file
7. Flat files etc.
Data Transformation:
It is a process of cleaning the data and transforming the data into a required business
format.
Data Loading:
It is a process of inserting the data into a target system. There are two types of data loads.
Initial or Full Load:It is a process of loading all the required data at very first load.
Incremental or Delta Load: It is a process of loading only new records after initial load.
• SubjectOriented
• Integrated
• Non-volatile
• TimeVariant
Integrated:A data warehouse is an integrated database which contains the business
information collected from various operational data sources.
OLTP System
Saving acc
E Extraction Loading
Integrated data
Current acc Account subject
E acc
Joint
Time Variant:AData warehouse is a time variant database which allows you to analyse and
compare the business with respect to various time periods (Year, Quarter, Month, Week,
Daily) because which maintains historical data.
Data
Operational Warehouse
VolatileNon- Volatile
Data
Operational Warehouse
Subject Oriented:The data warehouses are designed as a Subject-oriented, these are used
to analyse the business by top level management, or middle level management, or for an
individual department in an enterprise.
account SALE
HR loans
DML Select
application –oriented Subject-oriented
Data Extraction
Data Transformation
Data Loading
Mainframe sources
(USA)
Relationalsources
(IND)
Data Extraction: It is a process of reading the data from various types of sources such as,
2. ERP sources,
3. Mainframe sources
4. Cloud sources
5. Hadoop source
6. XML file
Data Merging
Data Cleansing
Data Scrubbing
Data Aggregation
Data Merging:It is a process of combining the data from multiple inputs and load into a
single output. There are two types of Data Merging Activities.
1. Join
2. Union
Data Merging(JOIN)
EMP
E EMP DEPT
T
EMPNO ENAME SAL DEPTNO
JOIN
7369 SMITH 800 10 L
EMPNOENAMESALDEPTNODNAMELOC
7369 SMITH 800 10 ETLTESTING DALLAR
DEPT
Data Merging(union)
EMP
E T
EMPNOENAMESALDEPTNO EMP
UNION
7369 SMITH 800 10 L
EMPNOENAMESALDEPTNO
E 7369 SMITH 800 10
7456 ALLEN 400 20
EMP
EMPNOENAMESALDEPTNO
7456 ALLEN 400 20
Data Cleansing:It is a process of removing unwanted data from stagingorit is a process of
changing inconsistencies and inaccuracies
DATA CLEANSING
SOURCESTAGINGTARGET
T
SALESAMOUNT
SALESAMOUNT E L
$ 1.88 ROUND $ 1.88
$ 2.00 $ 2.00
$ 3.4563 $ 3.45
Data Scrubbing:It is a process of deriving new data definitions using existing data.
Data Aggregation:It’s process of calculating the summaries for a group of records using
aggregate functions.
E SUM ( ) L
Detailed Summarized
MAX ( )
Data Data
Marts Data:Data Mart is known as High Performance Query Structure. There are two types
of DM
According to W.H.Inmon first we need to design an enterprise data warehouse then design
a small form of subject oriented department design specific DB known as Data Marts
Bottom-Up Approach or Independent Data Marts (Ralph Kimball)
According to Ralph Kimball first we need to design department specific database known as
Data Marts then integrate all data marts into Enterprise Data Marts.
What is a Dimension Table:
If a table contains primary keys and it gives the detailed information about business then
such a table is called dimension table.
Dimension Table -> Text + Primary Key
A Dimension table is a table which holds a list of attributes or qualities of the dimension
most often used in queries and reports.
What is a Fact table:A fact table which contains foreign keys to dimension tables and
numeric facts (called as measurements).
Fact Table -> Metrics + Foreign Key
Slowly Changing Dimensions (SCD):SCD captures the changes which takes place over the
period of time. There are three types of SCD’s.
SCD Type 1
SCD Type 2
SCD Type 3
SCD Type1:Type 1 dimension keeps only the current values. Doesn’t maintain historical
data.
Example:
SCD Type2:Type 2 dimension maintain the full history in the target. For each update it
inserts a new record in the target tables.There are 3 different methods (INDICATORS)
maintain in history.
1. Version
2. Flag
3. Date
VersionP k
Ex:
PK
Flag
DATE
SCD Type 3:Type 3 dimension maintains current and previous information (Partial History)
Static Test Case:These are prepared only based on the check list.
Static checklist -> Check list
Test Case Review:Team lead who is not part of the project conducts the test case review.
Peer Review:A review conducted between the team members by exchanging the test cases
is called Peer Review.
Test Execution: Based on the data base team members will execute the test cases with the
help of queries.
Defect Report:While executing the test cases, if the tester finds any mismatches in the
target table against the requirements, he raises that corresponding mismatch as defect.
Status Report:In this report we mention the status of the every test case we execute i.e. we
include RTM (Requirement Traceability Matrix).
Result Report:After completion of the status report team lead prepares the result report.
How many test cases were passed, failed and under hold is mentioned in this report.
Test Closure: After completion of test execution levels and fixation of bugs, test lead will
conduct a review meeting to estimate completeness and correctness of testing process.
2) Table Structure Validation: Generally a test engineer validates the Meta data or
structure of the table, to check the columns and rows size.
Ex: desc table name;
3) Table Look and Feel: To check the order of the table (columns and rows) or look and feel
of the table structure (Meta data).
Ex: set line 100;
4) ConstraintsTesting: In constraint testing tester will perform major validations like Null,
Not Null, Unique, Primary key, Foreign Key, Check, and Default values, Duplicate values.
Null:Represents a cell with nothing i.e. cell must be empty (not even space).
Primary Key:Column which does not allow any duplicates and nulls.
5) Target Count Testing:It is used to compare the data in the target table with the source
table. It doesn’t matter whether the data is in ascending or descending order here we
consider the count of rows. This type of testing is been done only due to the lack of
time.
Ex: select count(*) from emp;
select job, count(*) from emp group by job;
SELECT COUNT(MGR) FROM EMP;
select deptno, count (*) from emp group by deptno;
select deptno, count (*) from emp group by deptno having count(*) > 1;
select JOB, count (*) from emp group by JOB having count(*) > 1;
6) Target Data Validation Testing/Decimal Factor Testing:Here a tester will validate each
and every point of the target data. Most of the financial projects come under this
category. A tester identifies the decimal factors of the table data.
Ex: desc table_name;
7) Threshold / Data Integrated Testing:Used to identify the ranges of the data. A tester
usually uses this to identify the population calculations and share marketing and
business finance analysis (Quarterly, Half Yearly, Yearly).
Ex: select e.*, d.* from emp e, dept d
where e.deptno = d.deptno;
Note: Two table’s common column values must be matched.
Union: Combining two tables, only when all columns of both the tables match.
Ex: select * from emp1
UNION
Select * from emp2;
Note: All columns of the tables must match
8) Field to Field Testing:In the field to field testing a test engineer will identify how much
space is been occupied for the target table in the data base. The data is integrated in the
table with columns and data types.
Note: Used to check the order of the columns of source and target tables and to
verify default values are not been populated unless required.
13) Negative Testing: This is called testing the other way. Suppose if we have emp table as
target table and we check for dept table what would be the result. This way of testing is
called negative testing.
Data Warehouse
Here the testing is done in three ways:
a) After loading the data mart A, we test for the existence of mart or table A.
b) After loading the data mart B, we test whether the marts A and B are viewable. Here
the data mart A is historical data and B is incremental data.
c) Finally we look for the relation between the two marts A and B.
Note: Testing the raised fault again and again to confirm it as a defect before
informing it to the developer is called Retesting.
16) Volume Testing:To verify, the system will process the maximum expected quantity of
data for a given cycle in the time expected.
Point to remember: Test bed is not possible in ETL testing because here we don’t
have end to end data. Here data flows continuously.
Bug Density: Average number of bug testing team found in one module.
Module2 - 40%
Module3 - 30%
Module4 - 20%
100%
Defect Age: The time gap between reporting date and resolved date of a defect.
Kick of Meeting: It is a first meeting conducted with in development camp (High level
management like M.D, Board of Directors, CEO, Head of Operations & all Managers)
participates to have over view on the project, know the customer nature and to finalize the
project team.
QA QC
Quality Analyst Test Engineers
Responsible for prevention of defects Responsible for detection of defects.
QA defines process QC follows the process
They improves process They improves product
QA prepares all templates like test plan, test They us all the templates prepared by QA
case, defect template etc.
QA is for entire organization QC is for across the organization
Verification Validation
Test Case:A test case is a software testing document which consists of events, action, input,
output, expected result and actual result.
Test Script: A test script is a short program written in a programming language used to test
part of the categories.
Ex: Our PC
Peer Review: Reviews carried out by peers or persons at the same level in the hierarchy.
Peer reviews are done during design stage, coding stage and testing stage. Peer
reviews help in improving the product quality.
Metric:Tomeasure software, “metrics” are used, to estimate the project complexity as well
as to measure the productivity of the developers and test engineers. If you want to manage
effectively, you need to use metrics.
Requirement Analysis
Testing
System Integration Testing
(SIT)
Review
Testing
Production
Maintenance
Smoke Testing
Testing
Sanity Testing
User Acceptance Test
After completion of development, senior people in the development team tests the
build to confirm whether it is ready for testing, before releasing the build to testing
department. This is called Smoke Testing.
After completion of testing the build, the higher authority tests the build before it is
released to UAT. This is called Sanity Testing.
Reporting the Defect:Weuse bug reporting template to reportthe bugs. Through template,
tester should forward the defect to the concerned person.
Severity:Severity of the application or data is provided by technical people like test lead, sr.
testers.
Priority: Importance of customer point of view, impact of the any business logic.
Bug fixing: When developer accepts the defect, it changes as bug. Developer will fix the bug.
Bug Closing: After bug fixation developer will forward to Tester. Tester can do retesting or
regression testing. After completion of above testing’s if he satisfies,tester approves the
working status of the bug by closing it.
Note: If tester doesn't satisfy with the modified bug again the bug will be forwarded to the
developer. It is known as Re-open.
New: First time (newly) tester can detect defect that status has new.
Re-open: Tester send the newly detected defect to the developer. Developer sends it back
after fixing the defect. The tester tests it, if he is not satisfied with the developer work, the
tester Re-open’s the bug by sending it back to the developer.
Assigned to:Mention the name of the developer to whom the defect should be assigned.
Project Manager& Project Lead can do this work.
Detected by:Mention the name of the tester who detects the defect.
Types of Bugs:
1) Duplicate bugs
3) Load bugs
4) Transformation bugs
5) Logical bugs
6) Recovery bugs