Bods Interview

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 61
At a glance
Powered by AI
SAP BODS is a data integration, data management and text analytics software that provides extraction, integration, transformation and loading functionalities. It has advantages like supporting multiple data sources, tight integration with SAP systems and in-memory processing capabilities.

The main components of SAP BODS are Designer, Repository, BODS Services, Data Quality components and BODS Management Console. Designer is used to develop jobs, Repository stores metadata, BODS Services include Job Server, Data Engine and Access Server for processing batch and real-time jobs.

The different types of repositories in SAP BODS are local repository, central repository and profiler repository. The local repository stores source and target metadata. The central repository provides a shared library for development. The profiler repository stores data quality information.

SAP BOBJ Dataservices

Data services is an end-to-end Data Integration, Data Management and


Text Analytics Software.
Any software which provides Extraction, Integration, Transformation and
Loading functionalities are called Data Integration Software.
Data Management:A software which helps to cleanse master data is called
Data Management software. Any software which provides parsing, correcting,
standardization, enhancement, match and consolidation functionalities are
called Data Management software.
Text Analytics: Converting the unstructured text data into structured data
such a software is called text analytics software.
Advantages of BODS: One product, multiple services ; single window
application; supports unstructured sources(big database); tight integration
with SAP; supports in memory database (HANA); migration tool; content
management tool; supports real time extractions; price; less maintenance
cost.
Architecture
1) Designer: It is a desktop application and client tool; it is a developer tool
to develop BODS jobs; it supports batch job and real time jobs; it is only
aapplication , it does not store data.
2) Repository: is a storage location in any database to store BODS objects.
Repository is a group of BODS metadata tables. DB support BODS
repository are oracle; MySQL; MySQL; DB2; Sybase; Hana(4.1 onwards).
There are three types of repositories:
A local repository(known in Designer as the Local Object Library) is
used by an application designer to store definitions of source and target
metadata and Data Services objects.
A central repository(known in Designer as the Central Object Library)
is an optional component that can be used to support multiuser
development. The Central Object Library provides a shared library that
allows developers to check objects in and out for development.
A profiler repositoryis used to store information that is used to
determine the quality of data.
3) BODS Services: Job Server; Data Engine; Access Server.
I. Job Server: is a service which process BODS batch jobs as part of
processing. Whenever a batch is executed or triggered by user
immediately the assigned job server will study the design of job and
estimate the number of threads and process to be allocated.
II.
Data Engines: are services which takes responsibility in generating
threads and process as per job server estimation; Threads will involve
in extraction & loading and process will involve in transformation of
data; Collectively the Job server and data engines will take
responsibility to processing batch jobs.
III. Access Server:it is also called messaging broker; Access server is a
service which will take responsibility in processing real time jobs; To

manage real time jobs an extra configuration called real time


configuration to be done both sender & receiver side. This
configuration sends a message alert whenever data got updated in
sender side;Access server is a service which is running 24/7and waits
for a message from sender to trigger real time job; Once the real time
job is successfully finish, access server will send a message back to
sender as acknowledgement.
4) Data quality components:
I. Directories:They are also one kind of repositories which stores
cleansing packages. They are physical folders in OS level with name
called reference data under LINK DIR/data quality /reference _data
which holds country specific data; Directories enable match and
consolidate function as part of data cleansing.
II.
Dictionaries:are storage location in database which consists of data
classification, which is a format of an entity; dictionary helps in
enabling standardization of data.
III. Address server: is a service which will process data quality
transformation.
5) BODS Management Console: It is a administrator tool; It is a web
application; URL http://<BODS server name>:<WAS port>/ data services;
Its a secured application 3.x (admin U/N, pwd) from 4.x cmccredentials.
I.
Administrator Scheduling, monitoring, and executing batch jobs;
Configuring, starting, and stopping real-time services; Configuring Job
Server, Access Server, and repository usage; Configuring and
managing adapters; Managing users; Publishing batch jobs and realtime services via web services; Reporting on metadata.
II.
Auto Documentation:View, analyze, and print graphical
representations of all objects as depicted in Data Services Designer,
including their relationships, properties, and more.
III.
Data Validation:Evaluate the reliability of your target data based on
the validation rules you create in your Data Services batch jobs to
quickly review, assess, and identify potential inconsistencies or errors
in source data.
IV.
Impact and Lineage Analysis: Analyze end-to-end impact and
lineage for Data Services tables and columns,and SAP Business
Objects Business Intelligence platform objects such asuniverses,
business views, and reports.
V.
Operational Dashboard:View dashboards of status and performance
execution statistics of DataServices jobs for one or more repositories
over a given time period.
VI.
Data Quality Reports:Use data quality reports to view and export
SAP Crystal Reports for batch
and real-time jobs that include statistics-generating transforms. Report
typesinclude job summaries, transform-specific reports, and transform
groupreports.

6) CMC / IPS (central management console/ Information platform


services)CMC is BO BI platform like netweaver for sap; cmc is a web
application; it is an administrator application; is a secured and role based
application; In cmc w.r.t. BODS we can do repository management, user
management, security mgmt., license management. URL: http://
<BOE/CMC server name>:<WAS port>/BOE/CMC.
Installation of BODS 4.0: We need database(MySQL) and web application
server(Tomcat)
I. Non BOE customer: IPSBODS Server BODS Client.
II.
BOE Customer: BO CMC BODS Server BODS client.
III.
BOE Customer with dual platform: BO CMC; IPS BODS
Server BODS Client.
BODS Tools
1. CMC/IPS tools
I. Central configuration manager:Helps in managing of BO
services; desktop tool/ server tool/ administrator tool; we can
manage EIM adoptive processing service which helps in the creation
of RFC connection for sap extraction.
II.
Central management console:
III. Life cycle management console (LCM):Helps in deploying BOE
content (CMC content); version management tool;web app; server
tool ; administrator tool; URL: http:// <IPS/CMC server>:<WAS
port /BOE/ LCM>; from BOE 4.0 sp 5 onwards in cmc we have two
new options instead of LCM : promotion mgmt., subversion mgmt.
IV.
Upgrade Mgmt tool(UMT):Helps in upgrading BOE content and
applications from lower(4.0) version to higher(4.1) version; it is a
desktop tool; server tool; administrator tool;
V.
Web deployment tool (WDT):Helps in deploying web applications
i.,e when our web applications get corrupted we can deploy using
WDT i.,e deploy .WAR files into Web application server.
2. BODS Server tools
I. Repository Manager:It s a tool to create repositories; It helps to
check the version of the repository; upgrade repository. If BODS
version gets upgraded we will upgrade the repository. It is
administrator tool; desktop tool; server tool.
II.
Server Manager: It is a tool to create Job server, Access server;
Assignment of repository to job server is done here; It is server tool ;
administrator tool; desktop tool; we can do email configurations; we
can define pageble memory path settings; we can do email
configurations; we can do real time job configuration at receiver
side.
III.
BODS mgmt console:
IV.
Metadata integrator: It is a tool which will integrate BODS and BO
BI; here in BODS we will get all metadata of (universes, reports etc.)
and so the impact and linage analysis; It is desktop tool; Server tool;

administrator tool; we can schedule the metadata integrator to get


the data from cmsdb to BODS repo with regular intervals.
V.
License manager: We can add, delete, modify licenses; It is a
desktop tool; Administrator tool; server tool.
3. BODS Client tools
I.
Designer: It is a desktop application and client tool; it is a
developer tool to develop BODS jobs; it supports batch job and real
time jobs; it is only aapplication , it does not store data.
II.
Locale Selector :It will help you to select a default language of
designer; Developer tool; it is client tool; it is desktop tool.
III.
Documentation: Technical manual of the tool.
BODS Objects
BODS objects are two types
I. Reusable objects: A reusable object has a single definition; all calls
to the object refer to that definition. Accessreusable objects through
the local object library. A data flow, work flow, datastore, function etc
are example of reusable objects.
II.
Single-use objects: Some objects are defined only within the context
of a single job or data flow, for example scripts, project, loop, try-catch,
conditionsetc are single use objects.
Projects: A project is a reusable object that allows you to group jobs;A
project is the highest level of organization offered by the software; You can
use a project to group jobs that have schedules that depend on one another
or that you want to monitor together.
Jobs: A job is the only object you can execute. You can manually execute and
test jobs in development.
Dataflow: It is an object to define a ETL definition in BODS (what to extract,
how to extract and where to load).
Workflow: It is an object which helps in grouping of data flows (or) grouping
of set of ETL operations
Datastores: Datastores represent connection configurations between the
software and databases or applications:
There are three kinds of Datastores:
I.
Database Datastores:provide a simple way to import metadata
directly froma RDBMS.
II.
Application Datastores:let users easily import metadata from
mostEnterprise Resource Planning (ERP) systems.
III.
Adapter Datastores:can provide access to an applications data and
metadataor just metadata. For example, if the data source is SQLcompatible, theadapter might be designed to access metadata, while
Data Services extractsdata from or loads data directly to the
application.
Formats: It is an object which helps in connecting between file type sources
and targets in BODS.
Scripts: It is an object which enables BODS customization; We write code in
BODS scripiting language (similar to sql).A script is a single-use object that

is used to call functions and assign values ina work flow.A script can
contain: Function calls If statements While statements Assignment
statements Operators.
Transforms:It is a readymade logic provide by SAP along with this tool to
perform data transformations.
There are more than 20 transforms (data integrator, platform, data quality).
Template table is a BODS object which helps in creating table in any
database; Template tables are used as target table in BODS; Once a job is
executed the template table object is created in target database.
BODS GUI
Project area : Contains the current project (and the job(s) and other objects
within it) available to you at agiven time.
a) Designer tab:wa can design objects / view objects; the objects are
shown in hierarchical fashion.
b) Monitor tab:it is a place where we can see current running jobs (redfail, yellow-currently running, green- successful).
c) Log tab:It is a place where we can see the logs of job( log history 120
days by default)
Workspace: The area of the application window in which you define,
display, and modify objects.
Local object library :Provides access to local repository objects including
built-in system objects, such as transforms, and the objects you build and
save, such as jobs and data flows; All are reusable objects.
Tool palette: Buttons on the tool palette enable you to add new objects to
the workspace by drag and drop functionality.
BODS Job Design Methodology
1. Identify the source and target system types. Systems are 2 types
a)Database type b)File type.
2. Import the source and target systems metadata (schema) into BODS;
File type Format, DB typeDatastore.
3. Create a BODS Job
a. Open an existing project / create a project.
b. Define ETL definition i.,e create dataflow (define source, define
target, define transformation).
c. Validate and save the job.
d. Execute and monitor the job.
Excel workbook Sheet extraction
BODS 3.x supports .xls; from BODS 4.0 onwards .xls and .xlsx
The add-on called Microsoft access database engine should be installed on
both server and client side to make designer application to read Microsoft
excel files; This add-on is provided by sap along withBODS software and is
placed in c:// program files/sapbo/data services/ ext/ Microsoft.
Multiple excel file extraction dynamically:
Recommendations:All fields structure should be unique; All fields data
should be in Named ranges/ sheets; All fields data should be in same sheet
name / sheet number; Maintain common prefix or suffix in file name.

Solutions:
1. Wild card character (* or ?); OR
2. using list of filenames with comma delimiter
Dynamic excel file selection:
Recommendations: Structure in all files should be unique; data should be
in sheet/ named range; Maintain unique sheet name or sheet number;
Naming convention of a file name should contain time stamp when it
distributed.
Solutions:
1. Variable is to pass the file name dynamically during runtime;
2. Script is used to define a file name
Multiple excel workbook sheets extraction dynamically:
Recommendations:Structure should be unique in all sheets; Extra sheet in
a file should have the list of sheet names where data is placed.
Solution:
1. Variable to pass the sheet name dynamically during runtime
2. Script to define the sheet name.
3. Loop to repeat the process to get the data from all the sheets.
Limitations of excel workbook in BODS: BODS is not capable to
distribute data in excel workbook; BODS 3.2 does not support .xlxs format;
We dont have any property to terminate extraction if we get n number of
errors in excel.
Flat file extraction
A flat file is a structured file; a flat file has no data limitations.
Types of flat files:Delimited:A structure text file with markers to identify
row, column and text id called delimited flat file. Columncomma, space,
tab, semicolon; Row windows new line, Unix new line, new line; Text
single quotes, double quotes. Custom row BODS will support special
markers also.Fixed width flat file:is a structured flat file where column
identified by some fixed with (size); Row windows new line, Unix new line,
new line; Text single quotes, double quotes. Custom row BODS will
support special markers also.
SAP transport flat file formatsupports to read a flat file from and write a
flat file to SAP appserver.
Unstructuredtextflat fileis used to read data from unstructured text files
ex: LinkedIn, face book; Here data is in the form of text (characters).
Unstructured binary: Here data is in the form of binary ex .doc,pdf, xml,
html etc.
Advantages of flat file are we can process parallel threads; we have the
option terminate flat file if it contains specific number of errors; with the help
of query transform we can generate flat file.
XML data extraction
It is a nested structure; we get header and data separately; Supports level of
extraction; Supports real time. We get xml data in two fashion .dtd format:
helps in importing xml metadata if a header file is provided in dtd fashion;

xml schema definition: helps in importing metadata if a header file is


provided in xml schema definition.
Make current is available under query transform in output schema which
allows to select the required columns from nested schema, we cant apply
this on multiple nodes at a time. Unnest with subschema: is available
under query transform in output schema which helps to convert the nested
schema structure into flat files.
XML_PIPELINE Transform:
Functionality: Helps in extracting nested structured data; It can be used to
extract the part of xml data Type: Data Integrator; Behaviour: Transform;
Input: Nested input (takes single input); Output : Multiple same outputs;
Properties: no create properties; no target properties; no source object editor
properties; no target object editor properties; Schema In: Source schema
(nested structure); Schema out: Under user control (wecan define schema
i., e create columns here); Limitation: Does not allows to select multiple
columns from same level nodes.
COBOL data extraction
Cobol copybook format: We will get header, file(fixed width flat file and
variable width flat file) in separate files; Supports nested structure; Supports
only one node column extraction at a time. Cobol copy books always acts as
sources only; If there are 3 nodes in Cobol copy book then we get 3 options,
in that we can select any one node.
Time Dimension Implementation
Time dimension is global standard dimension, no need to depend on source
system. Reference entity: based on reference entity we can derive new
entity; Day is the least reference entity in time dimension.
DATE_GENERATION is a transform which provides least reference time
attribute which is day, based on this we can derive new time entity
according to our requirement.
DATE_GENERATION Transform:
Functionality:This transform generates a column which holds the date
values based on the start & end dates provided as an input to the transform
by considering the increment provided to it.Type: Data Integrator transform;
Behavior: It acts like a source;Input: No input; Output: Multiple same
outputs;Schema in: No input schema because it acts as a source; Output
schema: Under system control; we get a single output column
DI_GENERATED_DATE.
Time dependent dimension Implementation
A dimension in which the attributes changes over period of time, if that
changes capture in a dimension then such dimensions are called Time
dependent dimension; We use them for Master data tracking or History
reports; We can define validity of an attribute (Effective to date and effective
from date).
EFFECTIVE_DATE Transform:
Functionality: Derives valid_to column based on valid_from column and
sequence column; Type: Data Integrator; Behaviour: Transform; Input:

Input should contain effective_from column and sequence column; Output:


Multiple same outputs; Schema out: Input schema + one additional date
column.
Slowly changing Dimensions
Dimensions are changing slowly such dimensions are called slowly changing
dimensions; SCD are 3 tpes SCD-T0, SCD-T1, SCD-T2; T3, T4 and T6(Hybrid
SCD-T2) are different representations od SCD-T2.
SCD-T0: It holds current data only; It should not maintain historical data and
it should not capture changes; Functionality to enable in ETL side is truncate
and reload; In BODS it is implemented by check box Delete data before
loading at database table target editor properties.
SCD-T1: It holds current data and it also holds historical data, but it should
not capture changes.Functionality to enable in ETL side is overwrite In BODS
it is implemented by Table Comparison transform and Map
Operation(optional).
SCD-T2: It holds current data and it also contain Historical data and also
capture change data.Functionality to enable in ETL side is every change to
be inserted( we add 4 additional colums SID,CUR_REC _IND,EFF_from,EFF_to)
TRANSFORMS
TABLE COMPARISION:
Functionality: It helps to identify the type of each input record which you
extracted; Type: Data Integrator transform; Behaviour: Transform;
Prerequistics: here should be atleast one common key (primary key) in
between input table and comparision table; Input table columns and
comparision table names and data types should be same(not all columns,
only common columns); Input: songle input; Output: Multiple single output;
Schema out: under system control.

Compares two data sets and produces the difference between them as
a data set with rows flagged as INSERT or UPDATE.
The Table_Comparison transform allows you to detect and forward
changes that have occurred since the last time a target was updated.
3)Table Comparison :
Compares two data sets and produces the difference between them
as a data
set with rows flagged as INSERT, UPDATE, or DELETE.
Row-by-row select Look up the target table using SQL every time
it receives an input row.
This option is best if the target table is large.
Cached comparison table To load the comparison table into
memory. This option is best when the table fits into memory and you
are comparing the entire target table.

Sorted input
read the comparison table in the order of the primary key column(s)
using sequential read.This option improves performance because
Data Integrator reads the comparison table only once.Add a query
between the source and the Table_Comparison transform. Then,
from the querys input schema, drag the primary key columns into
the Order By box of the query.
Input primary key column(s)
The input data set columns that uniquely identify each row. These
columns
must be present in the comparison table with the same column
names and
data types.
Input contains duplicate keys
Generated Key column
Detect Deleted row(s) from comparison table
Detect all rows
Detect row with largest generated key value.
If we are checking the option Input contains duplicate keys
then in the Generated Key column we will be selecting the Primary
key which retrieves the largest generated key value in the duplicate
data.
If we are selecting the option Detect Deleted row(s) from
comparison table
It will flag the record/records with DELETE.
MAP_OPERATION Transform
Functionality: Helps in changing the record modes of a record in BODS;
Type: Platform transform; Behaviour: Behaves like a transform; Input:
Single input; Output: Multiple single output; Schema out: Under system
control; output schema equals input schema.
HISTORY_PRESERVING Transform
Functionality: Helps in defining current record indicator flag and validity if
each input record as part of SCD-T2 implementation.;Type: Data Integrator;
Behaviour: Transform; Input: single input (Input should be table
comparision); Output: multiple single outputs; Schema out: under system
control; output schema structure equal to table comparison structure.

Allows you to produce a new row in your target rather than updating an
existing row.
You can indicate in which columns the transform identifies changes to
be preserved.

If the value of certain columns change, this transform creates a new


row for each row flagged as UPDATE in the input data set.

KEY_GENERATION Transform
Functionality: Helps in implementing SID (like sequence generator in
database); Type: Data Integrator; Behavior: Transform; Input: single input;
Output: Multiple single output.
Generates new keys for new rows in a data set.
The Key_Generation transform looks up the maximum existing key
value from a table and uses it as the starting value to generate new
keys.
MAP OPERATION
Allows conversions between data manipulation operations.
The Map_Operation transform allows you to change operation
codes on data sets to produce the desired output.
For example, if a row in the input data set has been
updated in some previous operation in the data flow, you
can use this transform to map the UPDATE operation to
an INSERT. The result could be to convert UPDATE rows to
INSERT rows to preserve the existing row in the target.
Data Integrator can push Map_Operation transforms to the
source database.
MERGE Transform
Functionality: It enables data integration functionality in BODS; Type:
Platform Transform; Behavior: Transform; Input: Multiple inputs: Output:
Multiple single outputs; Pre requisites: Before passing inputs to merge
transform all the sources should have unique structure (No of columns,
column names, data types order of columns); Schema out: Under system
control; output schema is equal to input schema.

Combines incoming data sets, producing a single output data set with
the same schema as the input data sets.

ROW_GENERATION Transform
Using this transform we can maintain one special character in dimension
table( which handle s null records DWH).Functionality: Helps in generating
the number of rows requesting; Type: Platform transform; Behavior:
Source: Input No input; Output: Multiple single outputs.
Objectives:

Produces a data set with a single column. The column values


start from zero and increment by one to a specified number of
rows.
Descriptions:
Data Inputs
None.
Options
Row count
A positive integer indicating the number of rows in the output
data set. For added flexibility, you can enter a variable for this
option.
Join rank
A positive integer indicating the weight of the output data set if
the data set is used in a join. Sources in the join are accessed in
order based on their join ranks. The highest ranked source is
accessed first to construct the join.
Cache
Select this check box to hold the output from the transform in
memory to be used in subsequent transforms. Select Cache only
if the resulting data set is small enough to fit in memory.
Editor
The Row Generation transform editor includes the target schema,
and transform options.
Data Outputs
The Row_Generation transform produces a data set with a single
column and the number of rows specified in the Row option. The
rows contain integer values in sequence starting from zero and
incrementing by one in each row.

PIVOT Transform
Functionality: Helps in applying pivoting on input data i.,e converting
columns into rows: Type: Data integrator; Behavior: Transform;Input: Single
input: Output: Multiple single output: Schema out: Is under system control;
Default output schema of a pivot transform:
pivot_seq(int);pivot_HDR(varchar); pivot_data (varchar); Output schema of
pivot = no of non pivotcolums + default schema; pivot sequence columns
should be primary key.
Pivot Sequence Column : For each row created from a pivot column,
Data Integrator increments and stores a sequence number.
Non-Pivot Columns : The columns in the source that are to appear in
the target without modification.
Pivot set : The number that identifies a pivot set. Each pivot set
must have a a group of pivot columns,
unique Data field column and the Header column. Data Integrator
automatically saves this information.

Data field column : Contains the pivoted data. This column contains
all of the Pivot columns values.
Header column : lists the names of the columns where the
corresponding data originated.

Creates a new row for each value in a column that you identify as a
pivot column.
The Pivot transform allows you to change how the relationship between
rows is displayed.
For each value in each pivot column, DI produces a row in the output
data set.
You can create pivot sets to specify more than one pivot column.

REVERSE_PIVOT Transform
Functionality: Helps in applying pivoting operation on input data i.,e
converting rows into columns; Type: Data Integrator; Behavior: Transform;
Input: Onlyone input of flat structure: output: Multiple single output;
Schema out: Under system control.
Non-pivot columns
The columns in the source table that will appear in the target table
without modification.
Pivoted columns A set of columns will be created for each unique
value in the Pivot axis column.
Pivot axis column
The column that determines what new columns are needed in the
output table. At run time, a new column is created for each Pivoted
column and each unique value in this column.
Duplicate value
Action taken when a collision occurs. A collision occurs when there
is more than one row with the same key and value in the Pivot axis
column. In this case, you can store either the first row or the last
row, or you can abort the transform process.
Axis value
The value of the pivot axis column that represents a particular set
of output columns.
Column Prefix
Text added to the front of the Pivoted column names when creating
new column names for the rotated data.

Creates one row of data from several existing rows.


The Reverse Pivot transform allows you to combine data from several
rows into one row by creating new columns.
For each unique value in a pivot axis column and each selected pivot
column, DI produces a column in the output data set.

CASE Transform

Functionality: Helps in distributing or categorization of data: Type:


Platform;Behavior: Transform: Input single flat structure input: Output
Multiple distinct output; Output schema structure equal to input schema
structure.
Hierarchies: Grouping of master data is hierarchies.
In Hierarchy Flattening data stores in two fashions; Horizontal fashion:
Enables drill down and drill up functionality (mandatory format); Vertical
fashion: enables global filtering (performance tuning option).
Horizontal Structure current leaf, level 0, level 1level n leaf level; Vertical
structure: Ancestor ID, Descendent ID, root flag, leaf flag, depth
HIERARCY_FLATTENING Transform
Functionality: Helps in handling hierarchy data in BODS; It stores in 2
fashion; Input: either 2 structured flat sources either 2 or 4 columns which
includes parent and childs; Output: Multiple single output.

Constructs a complete hierarchy from parent/child relationships, then


produces a description of the hierarchy in vertically or horizontally
flattened format.
Flattening Types
Horizontal
Vertical
Horizontal Flattening
Each row of the output describes a single node in the hierarchy and the
path to that node from the root. This mode requires that you specify
the maximum path length through the tree as the Maximum depth.
If there is more than one path to a node, all paths are described in the
result.
Vertical Flattening
Each row of the output describes a single relationship between
ancestor and descendent and the number of nodes the relationship
includes.
There is a row in the output for each node and all of the descendents of
that node.
Each node is considered its own descendent and therefore is listed one
time as both ancestor and descendent.

MAP_CDC_OPERATION Transform
Functionality: Helps in handling how source based CDC sources system
data should update in target i.,e it simple updates values (I,U,D) in column
DI_Operation_Type based on CDC column: Type Data integrator ;Behavior:

Transform; Input : Only on e input of type oracle/ Microsoft CDC table or SAP
extractor; Output: Multiple single outputs( with different records modes).
Using its input requirements performs three functions:
Sorts input data based on values in Sequencing column box
and the Additional Grouping Columns box.
Maps output data based on values in Row Operation Column
box.
Source table rows are mapped to INSERT, UPDATE, or DELETE
operations before passing them on to the target.
Resolves missing, separated, or multiple before- and afterimages for UPDATE rows.
While commonly used to support Oracle or mainframe changed-data
capture, this transform supports any data stream as long as its input
requirements are met.
This transform is typically the last object before the target in a data
flow because it produces INPUT, UPDATE and DELETE operation codes.
Data Integrator produces a warning if other objects are used.
VALIDATION Transform
Functionality: helps in enabling validation layer in BODS; Type: Platform;
Behavior: Transform; Input: Single input with flat structure(no xml);
Output: 3.x (2 set distinct output); 4.x (3 set distinct outputs); Schema out:
Pass output schema of pass set is equal to input schema; Fail output schema
os equal to onput schema plus 2 colums (DI_ERRORACTION and
DI_ERRORCOLUMN, if we enable the check box we also get DI_ROWID colum);
Rule violation set schema DI_ROWID,DI_ROWNAME and DI_COLUMNNAME.

One source in a data flow.


Qualifies a data set based on rules for input schema columns. Allows
one validation rule per column.
Filter out or replace data that fails your criteria.
Outputs two schemas: Pass and Fail.

DATA TRANSFER Transform


Functionality: Helps I pushing the source data connected to any database
or BODS job server location Or to cache memory level; Type: Data
Integrator; Behaviour: Transform; Input: Only one input of flat type;
Output: Multiple Single output; Output schema is equal to input schema.
SQL Transform
Functionality: A transform which helps to pushdown select operations to
database level; Type: Platform; Behavior: source; Input: No input; Output:
Multiple single output; Output schema equal to select statement output
schema.

Performs the indicated SQL query operation.


Use this transform to perform standard SQL operations for things that
cannot be performed using other built-in transforms.
The options for the SQL transform include specifying a datastore, join
rank, cache, array fetch size, and entering SQL text.

QUERY TRANSFORM
The Query transform can perform the following operations:
Choose (filter) the data to extract from sources
Join data from multiple sources
Map columns from input to output schemas
Perform transformations and functions on the data
Perform data nesting and unnesting
Add new columns, nested schemas, and function results
to the output schema
Assign primary keys to output columns
DATA ASSESSMENT using BODS
For analyzing the data and measuring the data;mainly about studying the
sources w.r.t data
Functions available in BODS View data function, Validation Transform
Sstatistics,Auditing and data profiling;
Profiling types: Column profiling and Relationship profiling;
Column profiling are two types Basic column profiling (min length, max
length, medium length, min length % max% medium%, no of null %, no of
blanks%) and Detail column profiling (distinct %, pattern %, median).
Relationship Profiling: Defining relationship between two objects; % of
common data, % of non-common data.
SAP has separate tool: Data insight or Information steward.
BODS Variables
Enables dynamism in etl jobs;
Global variable(G_XXX): Job dependent; Can be used any where inside the
job. (not across the job); Prossessing option: default value, User entry and
through script: Input value: single value.

Global variables are global within a job. Setting parameters is not


necessary when you use global variables.
However, once you use a name for a global variable in a job,
that name becomes reserved for the job. Global variables are
exclusive within the context of the job in which they are created.

Local Variable(L_XXX): Job dependent and workflow dependent; Can be


used only thwpalce where it is defined; Processing option: through script;
Input type vle: single value.
To pass a local variable to another object, define the local ariable, then from
the calling object, create a parameter and map the parameter to the local
variable by entering a parameter value.
Parameters (P_XXX):Generally parameters are used in functions to pass
the input and take the output like input parameters and output parameters ;
workflow and dataflow dependent; can be used only the place where it is
defined; processing options: Only through local variable assignment; Input
value type: single value.
Pass their values into and out of work flows
Pass their values into data flows Each parameter is assigned a
type: input, output, or input/output. The value passed by the parameter can
be used by any object called by the
work flow or data flow.
Substitution Parameters ($$_XXX): They hold file path information; they
are repository dependent; can be used anywhere inside the repository;
Processing options: default value, user entry and through script; Input value
type : single.
LOOKUP()
Lookups always work at column level, transform works at row level
Normal lookup: Derives only one column, lookup source is only db, supports
only equicondition(=); only one equi condition; performance tuning (preload
cache, demand load cache)
Lookup_ext(): It derives multiple columns from one lookuo_ext function;
lookup sources are db tables and flat files; supports all operators
(=,<,>..etc); supports multiple conditions/columns; supports scd t2 sources,
performance tuning(preload cache , demand load cache and run as separate
process).
Lookup_seq(): Derives only one colum; lookup sources only db tables;
supports only equi conditions; supports two equi condition; meant for scd t2
handling: no performance tuning options.
LOOKUP : Retrieves a value in a table or file based on the values in a
different source
table or file.
LOOKUP EXT :
Graphic editor in the function wizard.
Retrieve a value in a table or file based on the values in a different
source table or file,
but it also provides extended functionality allowing you to:

Return multiple columns from a single lookup


Choose from more operators to specify a lookup condition
Perform multiple (including recursive) lookups
Call lookup_ext in scripts and custom functions (which also lets
you
reuse the lookup(s) packaged inside scripts)
Define custom SQL, using the SQL_override parameter, to populate
the lookup cache, narrowing large quantities of data to only the
sections
relevant for your lookup(s)
Use lookup_ext to dynamically execute SQL
Call lookup_ext, using the function wizard, in the query output
mapping
to return multiple columns in a Query transform
Design jobs to use lookup_ext without having to hard-code the
name of
the translation file at design time.
Use lookup_ext with memory datastore tables
LOOKUP SEQ :
Retrieves a value in a table or file based on the values in a different
source
table or file and a particular sequence value.
NEW FUNCTION CALL :
** Works same as Normal Lookup Ext.
** The only difference is this can be editable in the wizard form
where as in normal Lookup Ext which is not possible.
** No Need of Global/Local Variables usage in the output ports.
BODS Project Deployment
Repository content: Two methods Direct method,(repo-to-repo); good for
first time deployment. Export/ import method (atl/xml) (good for version
management).
BODS mgmt. console content: All the configuration which we do in BODS
mgmt. console are not stored in repository. They are stored in the files
admin.xml, as.xml and sapconnection.xml which are in <LINK_DIR>/cong
and <LINK_DIR>/bin
CMC content: Use LCM or promotion management to deploy source cms
database to target.
Substitution parameter configuration and System configuration
content: Using export /import method inside BODS designer option we can
deploy to target system.
CENTRAL REPOSITORY

Limitation of local repository: no multi user access, no auto lock feature,


no version managmentand no security.
Central repository enables multiuser environment and version management
in BODS..
Two types of central repository Unsecured central repository and secured
central repository.
Exporting a new job to Central Repository
1. Right click on the new job that you created we will get an object
with name
Export to central Repository just click on that then it will
export the job to
the path that mentioned in the Repository.
2. For doing some modifications and to maintain versions in central
repository Follow these steps mentioned below.
Checking out objects from the central repository
To check out an object and its dependent objects
1. Open the central object library.
2. In the central object library, right-click the job_<job name>.
3. Click Check Out > Object and dependents.
Checking in objects to the central repository
To check in a single object
1. Open the central object library.
2. Right-click the DF_EmpLoc data flow in the central object library and
click Check In > Object. A Comment window opens.
3. Type required comment with which we will get a version to find out
what modification we have done on the job which is already present.

BODS Performance Tuning Techniques


Environmental performance tuning techniques: OS leve I/O ratio;
Application level Increase the app level cache memory; Database level
primary indexes, regular db statistics collection; Network level increase
the bandwidth; BODS app side check boxes in execution properties such
asmonitor sample rate,print all trace message, disable data validation
statistics collection, enable auditing
Multi-Threading: Partitioning the source table, falt file allocate parallel
threads.
Multi-processing: Transform level run as a separate process; Data flow
level Degree of parallelism
Paarallel Processing: parallel data flow is parallel processing
Push down: Pushing BODS operations to other system; BODS automatically
push operation if the source and target are from same schema; if different
schema can useDB link OR Data transfer transform OR SQL Transform.

Bulk Loading: Only enabled for databse type targets; performed with the
help of SQL loader tool in BODS; Target object editor properties check box
bulk load
Through put
Dataflow levelcache type (In memory cache, pageble cache); Transform
level Table comparison (cached comparison method); lookup function
pre load cache and demand load cache.
Other: Source level Array fetch size (best practice 1000-5000); Join rank
(normal, ranking and cache algorithms); Target Level Rows per commit
1000-5000; Advanced->general-> Number of loaders.
Error Handling: Target editor property-> error handling-> use overflow file.
Source-based performance options
Using array fetch size
Caching data
Join ordering
Minimizing extracted data
Target-based performance options
Loading method and rows per commit
Staging tables to speed up auto-correct loads
Job design performance options
Improving throughput
Maximizing the number of pushed-down operations
Minimizing data type conversion
Minimizing locale conversion
Improving Informix repository performance
1 Utilize a database (like Oracle / Sybase / Informix / DB2 etc...)
for significant data handling operations (such as sorts, groups,
aggregates). In other words, staging tables can be a huge benefit to
parallelism of operations. In parallel design - simply defined by
mathematics, nearly always cuts your execution time. Staging tables
have many benefits. Please see the staging table discussion in the
methodologies section for full details.
2 Localize. Localize all target tables on to the SAME instance of
Oracle (same SID), or same instance of Sybase. Try not to use
Synonyms (remote database links) for anything (including: lookups,
stored procedures, target tables, sources, functions, privileges, etc...).
Utilizing remote links will most certainly slow things down. For Sybase
users, remote mounting of databases can definitely be a hindrance to
performance.
3 If you can - localize all target tables, stored procedures,
functions, views, sequences in the SOURCE database. Again, try
not to connect across synonyms. Synonyms (remote database tables)

could potentially affect performance by as much as a factor of 3 times


or more.
4 Remove external registered modules. Perform pre-processing /
post-processing utilizing PERL, SED, AWK, GREP instead. The
Application Programmers Interface (API) which calls externals is
inherently slow (as of: 1/1/2000). Hopefully Informatica will speed this
up in the future. The external module which exhibits speed problems is
the regular expression module (Unix: Sun Solaris E450, 4 CPU's 2 GIGS
RAM, Oracle 8i and Informatica). It broke speed from 1500+ rows per
second without the module - to 486 rows per second with the module.
No other sessions were running. (This was a SPECIFIC case - with a
SPECIFIC map - it's not like this for all maps).
5 Remember that Informatica suggests that each session takes
roughly 1 to 1 1/2 CPU's. In keeping with this - Informatica play's
well with RDBMS engines on the same machine, but does NOT get
along (performance wise) with ANY other engine (reporting engine,
java engine, OLAP engine, java virtual machine, etc...)
6 Remove any database based sequence generators. This requires
a wrapper function / stored procedure call. Utilizing these stored
procedures has caused performance to drop by a factor of 3 times. This
slowness is not easily debugged - it can only be spotted in the Write
Throughput column. Copy the map, replace the stored proc call with an
internal sequence generator for a test run - this is how fast you COULD
run your map. If you must use a database generated sequence
number, then follow the instructions for the staging table usage. If
you're dealing with GIG's or Terabytes of information - this should save
you lot's of hours tuning. IF YOU MUST - have a shared sequence
generator, then build a staging table from the flat file, add a
SEQUENCE ID column, and call a POST TARGET LOAD stored procedure
to populate that column. Place the post target load procedure in to the
flat file to staging table load map. A single call to inside the database,
followed by a batch operation to assign sequences is the fastest
method for utilizing shared sequence generators.
7 TURN OFF VERBOSE LOGGING. The session log has a tremendous
impact on the overall performance of the map. Force over-ride in the
session, setting it to NORMAL logging mode. Unfortunately the logging
mechanism is not "parallel" in the internal core, it is embedded directly
in to the operations.
8 Turn off 'collect performance statistics'. This also has an
impact - although minimal at times - it writes a series of
performance data to the performance log. Removing this operation

reduces reliance on the flat file operations. However, it may be


necessary to have this turned on DURING your tuning exercise. It can
reveal a lot about the speed of the reader, and writer threads.
9 If your source is a flat file - utilize a staging table (see the
staging table slides in the presentations section of this web site). This
way - you can also use SQL*Loader, BCP, or some other database BulkLoad utility. Place basic logic in the source load map, remove all
potential lookups from the code. At this point - if your reader is slow,
then check two things: 1) if you have an item in your registry or
configuration file which sets the "ThrottleReader" to a specific
maximum number of blocks, it will limit your read throughput (this only
needs to be set if the sessions have a demonstrated problems with
constraint based loads) 2) Move the flat file to local internal disk (if at
all possible). Try not to read a file across the network, or from a RAID
device. Most RAID array's are fast, but Informatica seems to top out,
where internal disk continues to be much faster. Here - a link will NOT
work to increase speed - it must be the full file itself - stored locally.
10 Try to eliminate the use of non-cached lookups. By issuing a noncached lookup, you're performance will be impacted significantly.
Particularly if the lookup table is also a "growing" or "updated" target
table - this generally means the indexes are changing during operation,
and the optimizer looses track of the index statistics. Again - utilize
staging tables if possible. In utilizing staging tables, views in the
database can be built which join the data together; or Informatica's
joiner object can be used to join data together - either one will help
dramatically increase speed.
11 Separate complex maps - try to break the maps out in to logical
threaded sections of processing. Re-arrange the architecture if
necessary to allow for parallel processing. There may be more smaller
components doing individual tasks, however the throughput will be
proportionate to the degree of parallelism that is applied. A discussion
on HOW to perform this task is posted on the methodologies page,
please see this discussion for further details.
12 BALANCE. Balance between Informatica and the power of SQL
and the database. Try to utilize the DBMS for what it was built for:
reading/writing/sorting/grouping/filtering data en-masse. Use
Informatica for the more complex logic, outside joins, data integration,
multiple source feeds, etc... The balancing act is difficult without DBA
knowledge. In order to achieve a balance, you must be able to
recognize what operations are best in the database, and which ones
are best in Informatica. This does not degrade from the use of the ETL

tool, rather it enhances it - it's a MUST if you are performance tuning


for high-volume throughput.
13 TUNE the DATABASE. Don't be afraid to estimate: small, medium,
large, and extra large source data set sizes (in terms of: numbers of
rows, average number of bytes per row), expected throughput for
each, turnaround time for load, is it a trickle feed? Give this information
to your DBA's and ask them to tune the database for "wost case". Help
them assess which tables are expected to be high read/high write,
which operations will sort, (order by), etc... Moving disks, assigning the
right table to the right disk space could make all the difference. Utilize
a PERL script to generate "fake" data for small, medium, large, and
extra large data sets. Run each of these through your mappings - in
this manner, the DBA can watch or monitor throughput as a real load
size occurs.
14 Be sure there is enough SWAP, and TEMP space on your
PMSERVER machine. Not having enough disk space could potentially
slow down your entire server during processing (in an exponential
fashion). Sometimes this means watching the disk space as while your
session runs. Otherwise you may not get a good picture of the space
available during operation. Particularly if your maps contain
aggregates, or lookups that flow to disk Cache directory - or if you have
a JOINER object with heterogeneous sources.
15 Place some good server load monitoring tools on your
PMServer in development - watch it closely to understand how the
resources are being utilized, and where the hot spots are. Try to follow
the recommendations - it may mean upgrading the hardware to
achieve throughput. Look in to EMC's disk storage array - while
expensive, it appears to be extremely fast, I've heard (but not verified)
that it has improved performance in some cases by up to 50%
16 SESSION SETTINGS. In the session, there is only so much
tuning you can do. Balancing the throughput is important - by
turning on "Collect Performance Statistics" you can get a good feel for
what needs to be set in the session - or what needs to be changed in
the database. Read the performance section carefully in the
Informatica manuals. Basically what you should try to achieve is:
OPTIMAL READ, OPTIMIAL THROUGHPUT, OPTIMAL WRITE. Over-tuning
one of these three pieces can result in ultimately slowing down your
session. For example: your write throughput is governed by your read
and transformation speed, likewise, your read throughput is governed
by your transformation and write speed. The best method to tune a
problematic map, is to break it in to components for testing: 1) Read
Throughput, tune for the reader, see what the settings are, send the

write output to a flat file for less contention - Check the


"ThrottleReader" setting (which is not configured by default),
increase the Default Buffer Size by a factor of 64k each shot - ignore
the warning above 128k. If the Reader still appears to increase during
the session, then stabilize (after a few thousand rows), then try
increasing the Shared Session Memory from 12MB to 24MB. If the
reader still stabilizes, then you have a slow source, slow lookups, or
your CACHE directory is not on internal disk. If the reader's throughput
continues to climb above where it stabilized, make note of the session
settings. Check the Performance Statistics to make sure the writer
throughput is NOT the bottleneck - you are attempting to tune the
reader here, and don't want the writer threads to slow you down.
Change the map target back to the database targets - run the session
again. This time, make note of how much the reader slows down, it's
optimal performance was reached with a flat file(s). This time - slow
targets are the cause. NOTE: if your reader session to flat file
just doesn't ever "get fast", then you've got some basic map
tuning to do. Try to merge expression objects, set your lookups to
unconnected (for re-use if possible), check your Index and Data cache
settings if you have aggregation, or lookups being performed. Etc... If
you have a slow writer, change the map to a single target table at a
time - see which target is causing the "slowness" and tune it. Make
copies of the original map, and break down the copies. Once the
"slower" of the N targets is discovered, talk to your DBA about
partitioning the table, updating statistics, removing indexes during
load, etc... There are many database things you can do here.
17 Remove all other "applications" on the PMServer. Except for the
database / staging database or Data Warehouse itself. PMServer plays
well with RDBMS (relational database management system) - but
doesn't play well with application servers, particularly JAVA Virtual
Machines, Web Servers, Security Servers, application, and Report
servers. All of these items should be broken out to other machines.
This is critical to improving performance on the PMServer machine.

BODS Scripting language


Keywords Beginend; ifelse; while; trycatch;
Special Charecters: # comment ; end of statement; () to define a
function; {} to take variable values as text string; [] to take the value of
integer; , to differentiate parameter values; ! not; || for concatenation;
string representation; \ escape character, $ variable; *
Multipilication.
BODS Recovery Mechanism

Tool based recovery mech: This tool is acailable in BODS ,but it is


switched off , we can enable when required; go to job execute enable
recovery; recover from last failed execution; recover as a unit (workflow
level)
Limitations: Not at dataflow level; job should be sequential
Custom based recovery: should be done once the last job run should be
successfully done without any
failure.
Automatically recovering jobs An DI feature that allows you to run
unsuccessful jobs in recovery mode.
Manually recovering jobs A design technique that allows you to
rerun jobs without regard to partial results in a previous run.
Manually Recovering Jobs
A job designed for manual recovery must:
Be such that it can be run repeatedly
Implement special steps to recover data when a step did
not complete successfully during a previous run.
BODS Debugging
Debugging is used to trace the job at data level; used to identify issues
(record by record); debugd only dataflow; Pre requisites: without having
breakpoints we cant debug(mandatory); conditions & filters (optional);
Breakpoint is a point where interface debugging starts working
BODS interface debugger: available in designer; Tools->options>designer-> environment->interactive debugger
Interactive Debugger :
It allows to examine and modify data row by row by placing
filters and break points on
lines in the data flow during a
debug mode job execution.
Using the debugger we can know what happen to the data after
each transform or object in the flow.
Designer displays four additional windows:
1. Call stack
2. Trace
3. Variables
4. View Data
The left View Data pane shows the data in the source table, and
the right pane shows one row at a time (the default when you set
a breakpoint) that has passed to the Query
Filters :

A debug filter functions as a simple Query transform with a


WHERE clause; however, complex expressions are not supported

in a debug filter. Use a filter if you want to reduce a data set in a


debug job execution.
Breakpoints :
We can set a breakpoint between a source and
transform or two transforms. A
breakpoint is the location
where a debug job execution pauses and returns
control to
you.
A breakpoint condition applies after UPDATE,
NORMAL and INSERT row
types and to the before DELETE.
Steps to set filters and breakpoints :
1. Open the Job you want to debug with filters and breakpoints
in the workspace.
2. Open one of its Data Flows.
3. Right-click the connecting line that you want to examine, and
select Set Filter/Breakpoint.
4. In the breakpoint window, under the Filter or Breakpoint
columns, select the Set check box.
Complete the Column, Operator, and Value columns accordingly.

TRY CATCH
A try/catch block is a combination of one try object and one or more
catch objects that allow you to specify alternative work flows if errors
occur while DI is executing a job.
Try/catch blocks:
Catch classes of exceptions thrown by DI, the DBMS, or the
operating system
Apply solutions that you provide
Continue execution
Try and catch objects are single-use objects.
Scripts are single-use objects used to call functions and assign values
to variables in a work flow.
A script can contain the following statements:
Function calls
If statements
While statements
Assignment statements
Operators

MIRGRATION AND REPOSITORIES


DI supports a number of environments, including large
enterprises with many developers working on multiple projects.
DI supports multi-site architectures whether centralized or not.
The development process you use to create your ETL application
involves 3 distinct phases:

design,
Test, and
production
Each phase may require a different computer in a different
environment, and different security settings for each.

Data Integrator provides two migration mechanisms:


Export/import migration
works best with small to medium-sized projects where a
small number of developers work on somewhat
independent Data Integrator applications through all
phases of development.
Multi-user development
works best in larger projects where two or more
developers or multiple teams are working on
interdependent parts of Data Integrator applications
through all phases of development.
Exporting Objects to a Database
You can export objects from the current repository to another
repository.
However, the other repository must be the same version as the
current one.
The export process allows you to change environment-specific
information defined in datastores and file formats to match the new
environment

Exporting/Importing Objects to/from a File


You can also export objects to a file.
If you choose a file as the export destination, DI does not
provide options to change environment specific information.
Importing objects or an entire repository from a file overwrites
existing objects with the same names in the destination
repository. You must restart DI after the import process
completes.

Using built-in functions :


DI provides over 60 built-in functions.
Aggregate : avg, count, count_distinct, max, min, sum
Conversion : cast, extract_from_xml, interval_to_char, julian_to_date,
load_to_xml, long_to_varchar, num_to_interval , to_char, to_date, to_decimal,
to_decimal_ext, varchar_to_long

Database : total_rows, key_generation, sql


Date : week_in_year, week_in_month, sysdate, quarter, month,
last_date, julian, isweekend, fiscal_day, day_in_year, day_in_week,
day_in_month, date_part, date_diff, concat_date_time, add_months
Environment : get_env, get_error_filename, get_monitor_filename,
get_trace_filename, is_set_env, set_env
Lookup : lookup, lookup_ext, lookup_seq
Math : ceil, floor, In, log, mod, power, rand, rand_ext, round, sqrt,
trunc
Miscellaneous : current_configuration, current_system_configuration,
dataflow_name, datastore_field_value, db_type, db_version,
db_database_name, db_owner, decode, file_exists, gen_row_num_by_group,
gen_row_num, get_domain_description, get_file_attribute, greatest,
host_name, ifthenelse, is_group_changed, isempty, job_name, least, nvl,
previous_row_value, pushdown_sql, raise_exception, raise_exception_ext,
repository_name, sleep, system_user_name, table_attribute, truncate_table,
wait_for_file, workflow_name
DATA TRANSFER

XML_PIPELINE

VALIDATION TRANSFORM

QUERY TRANSFORM

TABLE COMPARISION

KEY GENERATION

Hierarcy FLATNENNING

Lookup_ext

List of functions

PIVOT TRANSFORM

EXECUTION PROPERTIES

SAP BusinessObjects Data Services 4.0 features.....


Documentation..................................................................................................
......................9
2.2 SAP
integration.........................................................................................................
..............9
2.3
Security..............................................................................................................
..................11
2.4 Text Data
Processing.........................................................................................................
....12
2.5
Architecture.......................................................................................................
...................12

2.6
Transforms.........................................................................................................
...................13
2.7 Operational
excellence.........................................................................................................
..14
2.8
Functions...........................................................................................................
...................14
2.9 Source and target
support.....................................................................................................15
2.10 Data
Quality...............................................................................................................
...........15
2.10.1 Data Cleanse
transform.........................................................................................................1
7
2.10.2 Geocoder
transform...........................................................................................................
....18
2.10.3 Global Address Cleanse
transform.........................................................................................19
2.10.4 Match
transform...........................................................................................................
..........21
2.10.5
USA
Regulatory
Address
Cleanse
transform.........................................................................
What are the benefits of Datawarehousing?

Facilitate integration in an environment characterized by unintegrated


applications.

Integrate enterprise data across a variety of functions.

Integrate external as well as internal data.

Support strategic and longterm business planning.

Support daytoday tactical decisions.

Enable insight into business trends and business opportunities

What is the difference between dimensional table and fact


table?

A dimension table consists of tuples of attributes of the dimension. A fact


table can be thought of as having tuples, one per a recorded fact. This fact
contains some measured or observed variables and identifies them with
pointers to dimension tables.
Data Warehousing helps you store the data while business intelligence helps
you to control the data for decision making, forecasting etc.
Dimension Table Examples
Retail -- store name, zip code, product name, product
category, day of week
Telecommunications -- call origin, call destination
Banking -- customer name, account number, branch,
account officer
Insurance -- policy type, insured party
Dimension Table Characteristics
Dimension tables have the following characteristics:
- Contain textual information that represents the attributes of the
business
- Contain relatively static data
- Are joined to a fact table through foreign key reference
- They are hierarchical in nature and provide the ability to view data at
varying levels of details.
Fact Table Examples
Retail -- number of units sold, sales amount
Telecommunications -- length of call in minutes, average
number of calls
Banking -- average monthly balance
Insurance -- claims amount
Fact Table Characteristics
Fact table have the following characteristics
Contain numerical metrics of the business
Can hold large volumes of data
Can grow quickly
Are joined to dimension table through foreign keys

that reference primary keys in the dimension tables


Identifying Measures and Dimensions
The attribute is perceived
as constant or discrete:
Dimension
Product
Location
Time
Size
The attribute varies
continuously:
Measures
Balance
Units Sold
Cost
Sales
Business intelligence usually refers to the information that is available for
the enterprise to make decisions on. A data warehousing (or data mart)
system is the backend, or the infrastructural, component for achieving
business intellignce. Business intelligence also includes the insight gained
from doing data mining analysis, as well as unstrctured data (thus the need
fo content management systems). For our purposes here, we will discuss
business intelligence in the context of using a data warehouse infrastructure.
Business Intelligence Tools:
Excel, Reporting tool, OLAP tool, Data mining tool
Business intelligence users:
Business operations reporting, Forecasting, Dashboard,
Multidimensional analysis, Finding correlation among different
factors

When should you use a STAR and when a SNOW-FLAKE schema?

The star schema is the simplest data warehouse schema. Snow flake schema
is similar to the star schema. It normalizes dimension table to save data
storage space. It can be used to represent hierarchies of information.
What is the difference between a data warehouse and a data mart?
This is a heavily debated issue. There are inherent similarities between the
basic constructs used to design a data warehouse and a data mart. In
general a Data Warehouse is used on an enterprise level, while Data Marts is
used on a business division/department level. A data mart only contains the
required subject specific data for local analysis.
Data Warehouse
A data warehouse is a central repository for all or significant parts of the
data that an enterprise's various business systems collect. Typically, a data
warehouse is housed on an enterprise mainframe server. Data from various
online transaction processing (OLTP) applications and other sources is
selectively extracted and organized on the data warehouse database for use
by analytical applications and user queries. Data warehousing emphasizes
the capture of data from diverse sources for useful analysis and access.
Data Marts
A data mart is a repository of data gathered from operational data and
other sources that is designed to serve a particular community of knowledge
workers. In scope, the data may derive from an enterprise-wide database or
data warehouse or be more specialized. The emphasis of a data mart is on
meeting the specific demands of a particular group of knowledge users in
terms of analysis, content, presentation, and ease-of-use. Users of a data
mart can expect to have data presented in terms that are familiar

We have two schema models which are suitable for Data Warehousing
in most of the cases.
Star Schema A star schema is a set of tables comprised of a single,
central fact table surrounded by de-normalized dimensions. Each
dimension is represented in a single table.
Snowflake Schema If you normalize the star schema dimensions to
separate tables and link them together, you will have a snowflake
schema

Fact Tables
Types of Measures
Additive facts
Non-additive facts
Semi-additive facts
Additive Facts

Additive facts are facts that can be summed up through


all of the dimensions in the fact table.
Non - Additive Facts
Non-additive facts are facts that cannot be summed up for
any of the dimensions present in the fact table
Semi - Additive Facts
- Semi-additive facts are facts that can be summed up
for some of the dimensions in the fact table, but not
the others.
Initial Load and Incremental/Refresh
Initial Load:
Single event that populates the database with historical
data
Involves large volumes of data
Refresh/Incremental:
Performed according to a business cycle
Less data to load than first-time load
OLTP:
1. Captures transactional information necessary to run business
operations
2. Need performance
3. More DML operations
4. OLTP systems often use fully normalized schemas to optimize
update/insert/delete performance and data consistency
5. A typical OLTP operation accesses only a handful of records
6. OLTP systems usually store data for only a few weeks or months if
needed
OLAP:
1. Analyze transaction information at an aggregate level to improve the
decision-making process
2. Need flexibility and broad scope
3. Very rare DML operations
4. Denormalized or partially denormalized (star schema) to optimize
query performance
5. A typical DW query scans thousands/millions of rows.
6. Considerable historical data is maintained for analysis

Data mart - A logical subset of the complete data warehouse.

A data mart is a complete pie-wedge of the overall data warehouse


pie.
A data warehouse is made up of the union of all its data marts.
A data warehouse is fed from the staging area.
Every data mart must be represented by a dimensional model and,
within a single data warehouse, all such data mart must be built from
dimensions and facts.

Mapping
The definition of the relationship and data flow between source and target
objects.
Metadata
Data that describes data and other structures, such as objects, business
rules, and processes. For example, the schema design of a data warehouse is
typically stored in a repository as metadata, which is used to generate
scripts used to build and populate the data warehouse. A repository contains
metadata.
Staging Area
A place where data is processed before entering the warehouse.
Cleansing
The process of resolving inconsistencies and fixing the anomalies in source
data, typically as part of the ETL process.
Transformation
The process of manipulating data. Any manipulation beyond copying is a
transformation. Examples include cleansing, aggregating, and integrating
data from multiple sources.

hat is Entity Relationship (E-R) Modeling


Entity Relation (E-R) model is developed to answers the following issues of
conventional Data Base Management System (DBMS).
(i)
Redundancy of data
(ii)
Lack of integratioWn and
(iii) Lack of flexibility,
This modeling is based on the relational theory and abides by the 13 rules
proposed by E.F. Codd that a DBMS implementation must follow to be
qualified as truly relational. The data in E-R model is presented in a simple
form of two-dimensional tables.
Normalization

Normalization is a process of decomposing the tables to prevent redundancy,


insert & update anomalies.
First Normal Form (INF):- A table is said to be in a First Normal Form (1NF)
if it satisfy the below three conditions:1) If the columns of the table only contain atomic values (Single, indivisible).
2) Primary key is defined for the table
3) All the columns of the table are defined on the primary key.
Second Normal Form (2NF):- A table is said to be in its Second Normal
Form if it satisfied the following conditions:1) It satisfies the condition for the First Normal Form (1NF),
2) It do not includes and partial dependencies where a column is dependent
only a part of a primary key.
Third Normal Form (3NF):- A table is said to be in the Third Normal form
(3NF) if it satisfy the following conditions:1) It should be in the 2NF
2) It should not contain any transitive dependency which means that any non
key column of the table should not be dependent on another non key
column.
Denormalization:- Denormalization can be defined as the process of
moving from higher normal form to a lower normal forms in order to speed
up the database access.

time stamp:Stores a database wide unique number that gets updated every
time row gets updated
SQL Join Types:
There are different type of joins available in SQL:

INNER JOIN:
returns rows when there is a match in both tables.

LEFT JOIN:
returns all rows from the left table, even if there are no matches

in the right table.

RIGHT JOIN:
returns all rows from the right table, even if there are no matches in the left
table.

FULL JOIN:
returns rows when there is a match in one of the tables.

SELF JOIN:
is used to join a table to itself, as if the table were tw
o tables, temporarily renaming at least one
table in the SQL statement.

CARTESIAN JOIN:
returns the cartesian product of the sets of records from the two or more
joined tables
TIMESTAMP(expr), TIMESTAMP(expr1,expr2)
With a single argument, this function returns the date or
datetime expression expr as a datetime value. With two
arguments, it adds the time expression expr2 to the date or datetime
expression expr1 and returns the result as a
datetime value.
mysql
DELTA LOAD
A delta load, by definition, is loading incremental changes to the data. When
doing a delta load to a fact table, for example, you perform inserts only...
appending the change data to the existing table.
Load types:1)Bulk Load
2)Normal Load
Normal load:1)in case of less data.
2)we can get its log details
3)we can rollback and commit.
4)Session recovery possible.
5)performance may be low .
Bulk load :-

1)In case of large data


2)no log details are available.
3)can't rollback and commit
4)session recovery not possible.
5)performance improves.
Following are commonly used constraints available in SQL:
NOT NULL Constraint: Ensures that a column cannot have NULL value.
DEFAULT Constraint : Provides a default value for a column when none is
specified.
UNIQUE Constraint: Ensures that all values in a column are different.
PRIMARY Key: Uniquely identified each rows/records in a database table.
FOREIGN Key: Uniquely identified a rows/records in any another database
table.
CHECK Constraint: The CHECK constraint ensures that all values in a
column satisfy certain conditions.
INDEX: Use to create and retrieve data from the database very quickly.
NOT NULL Constraint:
By default, a column can hold NULL values. If you do not want a column to
have a NULL value then you need to define such constraint on this column
specifying that NULL is now not allowed for that column.
A NULL is not the same as no data, rather, it represents unknown data.
DEFAULT Constraint:
The DEFAULT constraint provides a default value to a column when the
INSERT INTO statement does not provide a specific value.
UNIQUE Constraint:
The UNIQUE Constraint prevents two records from having identical values in
a particular column. In the CUSTOMERS table, for example, you might want
to prevent two or more people from having identical age.
PRIMARY Key:
A primary key is a field in a table which uniquely identifies the each
rows/records in a database table. Primary keys must contain unique values. A
primary key column cannot have NULL values.
A table can have only one primary key which may consist of single or
multiple fields. When multiple fields are used as a primary key, they are
called a composite key.
If a table has a primary key defined on any field(s) then you can not have
two records having the same value of that field(s).
FOREIGN Key:
A foreign key is a key used to link two tables together. This is sometimes
called a referencing key.
Primary key field from one table and insert it into the other table where it
becomes a foreign key ie. Foreign Key is a column or a combination of
columns whose values match a Primary Key in a different table.

The relationship between 2 tables matches the Primary Key in one


of the tables with a Foreign Key in the second table.
If a table has a primary key defined on any field(s) then you can not have
two records having the same value of that field(s).
CHECK Constraint:
The CHECK Constraint enables a condition to check the value being entered
into a record. If the condition evaluates to false, the record violates the
constraint and isn.t entered into the table.
INDEX:
The INDEX is used to create and retrieve data from the database very
quickly. Index can be created by using single or group of columns in a table.
When index is created it is assigned a ROWID for each rows before it sort out
the data.
Proper indexes are good for performance in large databases but you need to
be careful while creating index. Selection of fields depends on what you are
using in your SQL queries.
Database Normalization
Database normalization is the process of efficiently organizing data in a
database. There are two reasons of the normalization process:
Eliminating redundant data, for example, storing the same data in more
than one tables
DELETE
1. DELETE is a DML Command.
2. DELETE statement is executed using a row lock, each row in the table is
locked for deletion.
3. We can specify filters in where clause
4. It deletes specified data if where condition exists.
5. Delete activates a trigger because the operation are logged individually.
6. Slower than truncate because, it keeps logs.
7. Rollback is possible.
TRUNCATE
1. TRUNCATE is a DDL command.
2. TRUNCATE TABLE always locks the table and page but not each row.
3. Cannot use Where Condition.
4. It Removes all the data.
5. TRUNCATE TABLE cannot activate a trigger because the operation does not
log individual row deletions.
6. Faster in performance wise, because it doesn't keep any logs.
7. Rollback is not possible.
DELETE and TRUNCATE both can be rolled back when used with
TRANSACTION.
Primary Key

Unique Key

It will not accept null values

One and only one Null values are


accepted.
There will be only one primary key More than one unique key will be
in a table
there in a table.
Clustered index is created in
Non-Clustered index is created in
Primary key
unique key.
Primary key allows each row in a
Unique key constraint is used to
table to be uniquely identified and prevent the duplication of key
ensures that no duplicate rows
values within the rows of a table
exist.
and allow null values.
Primary Key:
-1 Primary key is used to avoid the duplication of records under same column
-2 Primary key is used to uniquely identify the each record from table
-3 You can use only one primary key at a time on single table
-4 It does not allow any null value or duplicate value
Foreign Key
it is used to give the reference of another table primary key..when i update
the child table record then it does not allow me..first i need to update from
parent table then it is updated in child table
Primary Key:
It will not allow "Null values" and "Duplicate values"
Foreign Key:
It will allow "Null values" and "Duplicte values" and it refers to a primary key
in anoter table
SELECT column1, column2, columnN FROM table_name;

What is surrogate key? Explain it with an example.


Answer
Data warehouses commonly use a surrogate key to uniquely identify an
entity. A surrogate is not generated by the user but by the system. A primary
difference between a primary key and surrogate key in few databases is that
PK uniquely identifies a record while a SK uniquely identifies an entity.
E.g. an employee may be recruited before the year 2000 while another
employee with the same name may be recruited after the year 2000. Here,
the primary key will uniquely identify the record while the surrogate key will
be generated by the system (say a serial number) since the SK is NOT
derived from the data.

Data warehousing - What is surrogate key? - May 11, 2009 at 14:40


pm by Vidya Sagar
What is surrogate key? Explain it with an example.
A surrogate key is a unique identifier in database either for an entity in the
modeled word or an object in the database. Application data is not used to
derive surrogate key. Surrogate key is an internally generated key by the
current system and is invisible to the user. As several objects are available in
the database corresponding to surrogate, surrogate key can not be utilized
as primary key.
Advantages of surrogate keys include:
Control over data
Reduced fact table size
22.

What is a Stored Procedure?


Its nothing but a set of T-SQL statements combined to perform a single task
of several tasks. Its basically like a Macro so when you invoke the Stored
procedure, you actually run a set of statements
To COMMIT means to make changes to data
COMMIT - save work done
SAVEPOINT - identify a point in a transaction to which you can later roll
back
ROLLBACK - restore database to original since the last COMMIT
SET TRANSACTION - Change transaction options like what rollback
segment to use
What is ods (operation data source)
ODS is nothing but operational data source. It contains most recent data .it
contains around 30-60 days data It is placed in between source system and
staging area.. reports can also be taken in ODS .
Operational Data Store (ODS): An ODS is an integrated database of
operational data. Its sources include legacy systems and it contains current
or near term data. An ODS may contain 30 to 60 days of information, while a
data warehouse typically contains years of data.
it is before staging
Full Load: completely erasing the contents of one or more tables and
reloading with fresh data.

Incremental Load: applying ongoing changes to one or more tables based


on a predefined schedule
ODS (Operational Data Source) is the first point in the Datawarehouse. Its
store the real time data of daily transactions as the first instance of Date.
Staging Area, is the later part which comes after the ODS. Here the Data is
cleansed and temporarily stored before loaded into the Datawarehouse.

Interview Questions
12. TRUNCATE TABLE EMP; DELETE FROM EMP; Will the outputs of
the above two commands
Delete Command:
1. Its a DML Command
2. Data can be rolled back.
3. Its slower than Truncate command bcoz it logs each row deletion.
4. With delete command trigger can be fire.
Truncate Command:
1. Its a DDL Command
2. Data Can not be rolled back.
3. Its is faster than delete bcoz it does not log rows.
With Truncate command trigger can not be fire.
both cases only the table data is removed, not the table structure.
13. What is the use of the DROP option in the ALTER TABLE
command
Drop option in the ALTER TABLE command is used to drop columns you no
longer need from the table.
The column may or may not contain data
Using alter column statement only one column can be dropped at a
time.
The table must have at least one column remaining in it after it is
altered.
Once a column is dropped, it cannot be recovered.
20. What is Data warehousing Hierarchy?
Hierarchies
Hierarchies are logical structures that use ordered levels as a means of
organizing data. A hierarchy can be used to define data aggregation. For
example, in a time dimension, a hierarchy might aggregate data from the

month level to the quarter level to the year level. A hierarchy can also be
used to define a navigational drill path and to establish a family structure.
Within a hierarchy, each level is logically connected to the levels above and
below it. Data values at lower levels aggregate into the data values at higher
levels. A dimension can be composed of more than one hierarchy. For
example, in the product dimension, there might be two hierarchies--one for
product categories and one for product suppliers.
Dimension hierarchies also group levels from general to granular. Query tools
use hierarchies to enable you to drill down into your data to view different
levels of granularity. This is one of the key benefits of a data warehouse.
When designing hierarchies, you must consider the relationships in business
structures. For example, a divisional multilevel sales organization.
Hierarchies impose a family structure on dimension values. For a particular
level value, a value at the next higher level is its parent, and values at the
next lower level are its children. These familial relationships enable analysts
to access data quickly.
Levels
A level represents a position in a hierarchy. For example, a time dimension
might have a hierarchy that represents data at the month, quarter, and year
levels. Levels range from general to specific, with the root level as the
highest or most general level. The levels in a dimension are organized into
one or more hierarchies.
Level Relationships
Level relationships specify top-to-bottom ordering of levels from most
general (the root) to most specific information. They define the parent-child
relationship between the levels in a hierarchy.
Hierarchies are also essential components in enabling more complex
rewrites. For example, the database can aggregate an existing sales revenue
on a quarterly base to a yearly aggregation when the dimensional
dependencies between quarter and year are known..
22. What is surrogate key ? where we use it explain with examples
surrogate key is a substitution for the natural primary key.
It is just a unique identifier or number for each row that can be used for the
primary key to the table. The only requirement for a surrogate primary key is
that it is unique for each row in the table.

Data warehouses typically use a surrogate, (also known as artificial or


identity key), key for the dimension tables primary keys. They can use Infa
sequence generator, or Oracle sequence, or SQL Server Identity values for
the surrogate key.
It is useful because the natural primary key (i.e. Customer Number in
Customer table) can change and this makes updates more difficult.
Some tables have columns such as AIRPORT_NAME or CITY_NAME which are
stated as the primary keys (according to the business users) but ,not only
can these change, indexing on a numerical value is probably better and you
could consider creating a surrogate key called, say, AIRPORT_ID. This would
be internal to the system and as far as the client is concerned you may
display only the AIRPORT_NAME.
Another benefit you can get from surrogate keys (SID) is :
Tracking the SCD - Slowly Changing Dimension.
Let me give you a simple, classical example:
On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1'
(that's what would be in your Employee Dimension). This employee has a
turnover allocated to him on the Business Unit 'BU1' But on the 2nd of June
the Employee 'E1' is muted from Business Unit 'BU1' to Business Unit 'BU2.'
All the new turnover have to belong to the new Business Unit 'BU2' but the
old one should Belong to the Business Unit 'BU1.'
If you used the natural business key 'E1' for your employee within your
datawarehouse everything would be allocated to Business Unit 'BU2' even
what actualy belongs to 'BU1.'
If you use surrogate keys, you could create on the 2nd of June a new record
for the Employee 'E1' in your Employee Dimension with a new surrogate key.
This way, in your fact table, you have your old data (before 2nd of June) with
the SID of the Employee 'E1' + 'BU1.' All new data (after 2nd of June) would
take the SID of the employee 'E1' + 'BU2.'
You could consider Slowly Changing Dimension as an enlargement of your
natural key: natural key of the Employee was Employee Code 'E1' but for you
it becomes
Employee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the
difference with the natural key enlargement process, is that you might not
have all part of your new key within your fact table, so you might not be able
to do the join on the new enlarge key -> so you need another id.

25. What is the main difference between schema in RDBMS and


schemas in Data Warehouse....?
RDBMS Schema
* Used for OLTP systems
* Traditional and old schema
* Normalized
* Difficult to understand and navigate
* Cannot solve extract and complex problems
* Poorly modeled
DWH Schema
* Used for OLAP systems
* New generation schema
* De Normalized
* Easy to understand and navigate
* Extract and complex problems can be easily solved
* Very good model
26. What is Dimensional Modelling?
Dimensional Modelling is a design concept used by many data warehouse
designers to build their data-warehouse. In this design model all the data is
stored in two types of tables - Facts table and Dimension table. Fact table
contains the facts/measurements of the business and the dimension table
contains the context of measurements i.e., the dimensions on which the
facts are calculated.
27. What is real time data-warehousing?
In real-time data warehousing, your warehouse contains completely up-todate data and is synchronized with the source systems that provide the
source data. In near-real-time data warehousing, there is a minimal delay
between source data being generated and being available in the data
warehouse. Therefore, if you want to achieve real-time or near-real-time
updates to your data warehouse, youll need to do three things:
1. Reduce or eliminate the time taken to get new and changed data out of
your source systems.
2. Eliminate, or reduce as much as possible, the time required to cleanse,
transform and load your data.
3. Reduce as much as possible the time required to update your
aggregates.
Starting with version 9i, and continuing with the latest 10g release, Oracle
has gradually introduced features into the database to support real-time, and
near-real-time, data warehousing. These features include:

Change Data Capture

External tables, table functions, pipelining, and the MERGE command,


and
Fast refresh materialized views

28. What is a lookup table?


When a table is used to check for some data for its presence prior to loading
of some other data or the same data to another table, the table is called a
LOOKUP Table.
40. What are the Different methods of loading Dimension tables?
Conventional Load:
Before loading the data, all the Table constraints will be checked against the
data.
Direct load:(Faster Loading)
All the Constraints will be disabled. Data will be loaded directly. Later the
data will be checked against the table constraints and the bad data won't be
indexed.

DataWareHousing - ETL Project Life Cycle ( Simple to understand )


-> Datawarehousing projects are categorized into 4 types.
1)
2)
3)
4)

Development Projects.
Enhancement Projects
Migration Projects
Production support Projects.

-> The following are the different phases involved in a ETL project
development life cycle.
1) Business Requirement Collection ( BRD )
2) System Requirement Collection ( SRD )
3) Design Phase
a) High Level Design Document ( HRD )
b) Low level Design Document ( LLD )
c) Mapping Design
4) Code Review
5) Peer Review
6) Testing
a) Unit Testing
b) System Integration Testing.
c) USer Acceptance Testing ( UAT )
7) Pre - Production
8) Production ( Go-Live )

Business Requirement Collection :---------------------------------------------> The business requirement gathering start by business Analyst, onsite
technical lead and client business users.
-> In this phase,a Business Analyst prepares Business Requirement
Document ( BRD ) (or) Business Requirement Specifications ( BRS )
-> BR collection takes place at client location.
-> The o/p from BR Analysis are
-> BRS :- Business Analyst will gather the Business Requirement and
document in BRS
-> SRS :- Senior technical people (or) ETL architect will prepare the SRS
which contains s/w and h/w requirements.
The SRS will includes
a) O/S to be used ( windows or unix )
b) RDBMS required to build database ( oracle, Teradata etc )
c) ETL tools required ( Informatica,Datastage )
d) OLAP tools required ( Cognos ,BO )
The SRS is also called as Technical Requirement Specifications ( TRS )
Designing and Planning the solutions :------------------------------------------------> The o/p from design and planning phase is
a) HLD ( High Level Design ) Document
b)LLD ( Low Level Design ) Document
HLD ( High Level Design ) Document : An ETL Architect and DWH Architect participate in designing a solution to
build a DWH.
An HLD document is prepared based on Business Requirement.
LLD ( Low Level Design ) Document : Based on HLD,a senior ETL developer prepare Low Level Design Document
The LLD contains more technical details of an ETL System.
An LLD contains data flow diagram ( DFD ), details of source and targets of
each mapping.

An LLD also contains information about full and incremental load.


After LLD then Development Phase will start
Development Phase ( Coding ) :--------------------------------------------------> Based an LLD, the ETL team will create mapping ( ETL Code )
-> After designing the mappings, the code ( Mappings ) will be reviewed by
developers.
Code Review :-> Code Review will be done by developer.
-> In code review,the developer will review the code and the logic but not
the data.
-> The following activities takes place in code review
-> You have to check the naming standards of transformation,mappings of
data etc.
-> Source and target mapping ( Placed the correct logic or not in mapping )
Peer Review :-> The code will reviewed by your team member ( third party developer )
Testing:-------------------------------The following various types testing carried out in testing environment.
1) Unit Testing
2) Development Integration Testing
3) System Integration Testing
4) User Acceptance Testing
The typical real-life ETL cycle consists of the following execution
steps:
1. Cycle initiation
2. Build reference data
3. Extract (from sources)
4. Validate
5. Transform (clean, apply business rules, check for data integrity, create
aggregates or disaggregates)

6. Stage (load into staging tables, if used)


7. Audit reports (for example, on compliance with business rules. Also, in
case of failure, helps to diagnose/repair)
8. Publish (to target tables)
9. Archive
10.

Clean up

Unit Testing :-> A unit test for the DWH is a white Box testing,It should check the ETL
procedure and Mappings.
-> The following are the test cases can be executed by an ETL developer.
1) Verify data loss
2) No.of records in the source and target
3) Dataload/Insert
4) Dataload/Update
5) Incremental load
6) Data accuracy
7) verify Naming standards.
8) Verify column Mapping
-> The Unit Test will be carried by ETL developer in development phase.
-> ETL developer has to do the data validations also in this phase.
Development Integration Testing -> Run all the mappings in the sequence order.
-> First Run the source to stage mappings.
-> Then run the mappings related to dimensions and facts.
System Integration Testing :->
->
->
->

After development phase,we have to move our code to QA environment.


In this environment,we are giving read-only permission to testing people.
They will test all the workflows.
And they will test our code according to their standards.

User Acceptance Testing ( UAT ) :-

-> This test is carried out in the presence of client side technical users to
verify the data migration from source to destination.
Production Environment :---------------------------------> Migrate the code into the Go-Live environment from test environment
( QA Environment ).

Worked as a ETL Developer to develop BODS objects &


involved in the project from realization phase.

Good knowledge in creating Workflow, Dataflow, Data store for


complex ETL transforms

Developed a standard BODS ETL Job to load the Time


Dimension table using Data Integrator Transform DATE_GENERATION

Worked with Debug, Break Points, Filters to Evaluate the data


according to business requirements

Involved in creating Local and Global Variables with Parameters

Used Source As files like Xl work books and COBOL copybooks,.CSV


files to load the data from these sources to target.

Sound knowledge in Table comparison, History preserving, Key


generation, Validation, Merge, Case, Data transfer Date generation, Pivot, Sql
Transforms

Having experience in to import and exporting the project from SYST


to PROD.

Design and implement ETL processes, programs and scripts.

Involved in working with Scripts and Try Catch Mechanisms

Monitored and observed Jobs Execution flow in Management


Console

Design and Prepared High level and Detailed ETL Test cases,
Executed BODS ETL Test Case Scenarios

Roles and Responsibilities:

Created Local Repository using Repository Manager and


tested the repository assigned by Job Server.

Created SAP Data Store to connect SAP ERP system and


created Multiple Excel Workbook Format to extract Excel Workbook data.


Extracted multiple Excel Workbook sheet data dynamically by
implementing custom logic in BODS.

Involved in SCD Types using (Table Comparison, History


Preserving, Key Generation)

Designed and ETL Job which distributes the multiple tables


data in different formats(MERGE,CASE,VALIDATION) according to business
requirements

Implemented a solution to extract multiple XML files data


dynamically. Involved in handling tickets, monitoring and Production Support.

Prepared Technical & Dataflow documents as part of project


deliverable.

Performed System Test Batch Monitoring and ETL Testing

You might also like