Bods Interview
Bods Interview
Bods Interview
is used to call functions and assign values ina work flow.A script can
contain: Function calls If statements While statements Assignment
statements Operators.
Transforms:It is a readymade logic provide by SAP along with this tool to
perform data transformations.
There are more than 20 transforms (data integrator, platform, data quality).
Template table is a BODS object which helps in creating table in any
database; Template tables are used as target table in BODS; Once a job is
executed the template table object is created in target database.
BODS GUI
Project area : Contains the current project (and the job(s) and other objects
within it) available to you at agiven time.
a) Designer tab:wa can design objects / view objects; the objects are
shown in hierarchical fashion.
b) Monitor tab:it is a place where we can see current running jobs (redfail, yellow-currently running, green- successful).
c) Log tab:It is a place where we can see the logs of job( log history 120
days by default)
Workspace: The area of the application window in which you define,
display, and modify objects.
Local object library :Provides access to local repository objects including
built-in system objects, such as transforms, and the objects you build and
save, such as jobs and data flows; All are reusable objects.
Tool palette: Buttons on the tool palette enable you to add new objects to
the workspace by drag and drop functionality.
BODS Job Design Methodology
1. Identify the source and target system types. Systems are 2 types
a)Database type b)File type.
2. Import the source and target systems metadata (schema) into BODS;
File type Format, DB typeDatastore.
3. Create a BODS Job
a. Open an existing project / create a project.
b. Define ETL definition i.,e create dataflow (define source, define
target, define transformation).
c. Validate and save the job.
d. Execute and monitor the job.
Excel workbook Sheet extraction
BODS 3.x supports .xls; from BODS 4.0 onwards .xls and .xlsx
The add-on called Microsoft access database engine should be installed on
both server and client side to make designer application to read Microsoft
excel files; This add-on is provided by sap along withBODS software and is
placed in c:// program files/sapbo/data services/ ext/ Microsoft.
Multiple excel file extraction dynamically:
Recommendations:All fields structure should be unique; All fields data
should be in Named ranges/ sheets; All fields data should be in same sheet
name / sheet number; Maintain common prefix or suffix in file name.
Solutions:
1. Wild card character (* or ?); OR
2. using list of filenames with comma delimiter
Dynamic excel file selection:
Recommendations: Structure in all files should be unique; data should be
in sheet/ named range; Maintain unique sheet name or sheet number;
Naming convention of a file name should contain time stamp when it
distributed.
Solutions:
1. Variable is to pass the file name dynamically during runtime;
2. Script is used to define a file name
Multiple excel workbook sheets extraction dynamically:
Recommendations:Structure should be unique in all sheets; Extra sheet in
a file should have the list of sheet names where data is placed.
Solution:
1. Variable to pass the sheet name dynamically during runtime
2. Script to define the sheet name.
3. Loop to repeat the process to get the data from all the sheets.
Limitations of excel workbook in BODS: BODS is not capable to
distribute data in excel workbook; BODS 3.2 does not support .xlxs format;
We dont have any property to terminate extraction if we get n number of
errors in excel.
Flat file extraction
A flat file is a structured file; a flat file has no data limitations.
Types of flat files:Delimited:A structure text file with markers to identify
row, column and text id called delimited flat file. Columncomma, space,
tab, semicolon; Row windows new line, Unix new line, new line; Text
single quotes, double quotes. Custom row BODS will support special
markers also.Fixed width flat file:is a structured flat file where column
identified by some fixed with (size); Row windows new line, Unix new line,
new line; Text single quotes, double quotes. Custom row BODS will
support special markers also.
SAP transport flat file formatsupports to read a flat file from and write a
flat file to SAP appserver.
Unstructuredtextflat fileis used to read data from unstructured text files
ex: LinkedIn, face book; Here data is in the form of text (characters).
Unstructured binary: Here data is in the form of binary ex .doc,pdf, xml,
html etc.
Advantages of flat file are we can process parallel threads; we have the
option terminate flat file if it contains specific number of errors; with the help
of query transform we can generate flat file.
XML data extraction
It is a nested structure; we get header and data separately; Supports level of
extraction; Supports real time. We get xml data in two fashion .dtd format:
helps in importing xml metadata if a header file is provided in dtd fashion;
Compares two data sets and produces the difference between them as
a data set with rows flagged as INSERT or UPDATE.
The Table_Comparison transform allows you to detect and forward
changes that have occurred since the last time a target was updated.
3)Table Comparison :
Compares two data sets and produces the difference between them
as a data
set with rows flagged as INSERT, UPDATE, or DELETE.
Row-by-row select Look up the target table using SQL every time
it receives an input row.
This option is best if the target table is large.
Cached comparison table To load the comparison table into
memory. This option is best when the table fits into memory and you
are comparing the entire target table.
Sorted input
read the comparison table in the order of the primary key column(s)
using sequential read.This option improves performance because
Data Integrator reads the comparison table only once.Add a query
between the source and the Table_Comparison transform. Then,
from the querys input schema, drag the primary key columns into
the Order By box of the query.
Input primary key column(s)
The input data set columns that uniquely identify each row. These
columns
must be present in the comparison table with the same column
names and
data types.
Input contains duplicate keys
Generated Key column
Detect Deleted row(s) from comparison table
Detect all rows
Detect row with largest generated key value.
If we are checking the option Input contains duplicate keys
then in the Generated Key column we will be selecting the Primary
key which retrieves the largest generated key value in the duplicate
data.
If we are selecting the option Detect Deleted row(s) from
comparison table
It will flag the record/records with DELETE.
MAP_OPERATION Transform
Functionality: Helps in changing the record modes of a record in BODS;
Type: Platform transform; Behaviour: Behaves like a transform; Input:
Single input; Output: Multiple single output; Schema out: Under system
control; output schema equals input schema.
HISTORY_PRESERVING Transform
Functionality: Helps in defining current record indicator flag and validity if
each input record as part of SCD-T2 implementation.;Type: Data Integrator;
Behaviour: Transform; Input: single input (Input should be table
comparision); Output: multiple single outputs; Schema out: under system
control; output schema structure equal to table comparison structure.
Allows you to produce a new row in your target rather than updating an
existing row.
You can indicate in which columns the transform identifies changes to
be preserved.
KEY_GENERATION Transform
Functionality: Helps in implementing SID (like sequence generator in
database); Type: Data Integrator; Behavior: Transform; Input: single input;
Output: Multiple single output.
Generates new keys for new rows in a data set.
The Key_Generation transform looks up the maximum existing key
value from a table and uses it as the starting value to generate new
keys.
MAP OPERATION
Allows conversions between data manipulation operations.
The Map_Operation transform allows you to change operation
codes on data sets to produce the desired output.
For example, if a row in the input data set has been
updated in some previous operation in the data flow, you
can use this transform to map the UPDATE operation to
an INSERT. The result could be to convert UPDATE rows to
INSERT rows to preserve the existing row in the target.
Data Integrator can push Map_Operation transforms to the
source database.
MERGE Transform
Functionality: It enables data integration functionality in BODS; Type:
Platform Transform; Behavior: Transform; Input: Multiple inputs: Output:
Multiple single outputs; Pre requisites: Before passing inputs to merge
transform all the sources should have unique structure (No of columns,
column names, data types order of columns); Schema out: Under system
control; output schema is equal to input schema.
Combines incoming data sets, producing a single output data set with
the same schema as the input data sets.
ROW_GENERATION Transform
Using this transform we can maintain one special character in dimension
table( which handle s null records DWH).Functionality: Helps in generating
the number of rows requesting; Type: Platform transform; Behavior:
Source: Input No input; Output: Multiple single outputs.
Objectives:
PIVOT Transform
Functionality: Helps in applying pivoting on input data i.,e converting
columns into rows: Type: Data integrator; Behavior: Transform;Input: Single
input: Output: Multiple single output: Schema out: Is under system control;
Default output schema of a pivot transform:
pivot_seq(int);pivot_HDR(varchar); pivot_data (varchar); Output schema of
pivot = no of non pivotcolums + default schema; pivot sequence columns
should be primary key.
Pivot Sequence Column : For each row created from a pivot column,
Data Integrator increments and stores a sequence number.
Non-Pivot Columns : The columns in the source that are to appear in
the target without modification.
Pivot set : The number that identifies a pivot set. Each pivot set
must have a a group of pivot columns,
unique Data field column and the Header column. Data Integrator
automatically saves this information.
Data field column : Contains the pivoted data. This column contains
all of the Pivot columns values.
Header column : lists the names of the columns where the
corresponding data originated.
Creates a new row for each value in a column that you identify as a
pivot column.
The Pivot transform allows you to change how the relationship between
rows is displayed.
For each value in each pivot column, DI produces a row in the output
data set.
You can create pivot sets to specify more than one pivot column.
REVERSE_PIVOT Transform
Functionality: Helps in applying pivoting operation on input data i.,e
converting rows into columns; Type: Data Integrator; Behavior: Transform;
Input: Onlyone input of flat structure: output: Multiple single output;
Schema out: Under system control.
Non-pivot columns
The columns in the source table that will appear in the target table
without modification.
Pivoted columns A set of columns will be created for each unique
value in the Pivot axis column.
Pivot axis column
The column that determines what new columns are needed in the
output table. At run time, a new column is created for each Pivoted
column and each unique value in this column.
Duplicate value
Action taken when a collision occurs. A collision occurs when there
is more than one row with the same key and value in the Pivot axis
column. In this case, you can store either the first row or the last
row, or you can abort the transform process.
Axis value
The value of the pivot axis column that represents a particular set
of output columns.
Column Prefix
Text added to the front of the Pivoted column names when creating
new column names for the rotated data.
CASE Transform
MAP_CDC_OPERATION Transform
Functionality: Helps in handling how source based CDC sources system
data should update in target i.,e it simple updates values (I,U,D) in column
DI_Operation_Type based on CDC column: Type Data integrator ;Behavior:
Transform; Input : Only on e input of type oracle/ Microsoft CDC table or SAP
extractor; Output: Multiple single outputs( with different records modes).
Using its input requirements performs three functions:
Sorts input data based on values in Sequencing column box
and the Additional Grouping Columns box.
Maps output data based on values in Row Operation Column
box.
Source table rows are mapped to INSERT, UPDATE, or DELETE
operations before passing them on to the target.
Resolves missing, separated, or multiple before- and afterimages for UPDATE rows.
While commonly used to support Oracle or mainframe changed-data
capture, this transform supports any data stream as long as its input
requirements are met.
This transform is typically the last object before the target in a data
flow because it produces INPUT, UPDATE and DELETE operation codes.
Data Integrator produces a warning if other objects are used.
VALIDATION Transform
Functionality: helps in enabling validation layer in BODS; Type: Platform;
Behavior: Transform; Input: Single input with flat structure(no xml);
Output: 3.x (2 set distinct output); 4.x (3 set distinct outputs); Schema out:
Pass output schema of pass set is equal to input schema; Fail output schema
os equal to onput schema plus 2 colums (DI_ERRORACTION and
DI_ERRORCOLUMN, if we enable the check box we also get DI_ROWID colum);
Rule violation set schema DI_ROWID,DI_ROWNAME and DI_COLUMNNAME.
QUERY TRANSFORM
The Query transform can perform the following operations:
Choose (filter) the data to extract from sources
Join data from multiple sources
Map columns from input to output schemas
Perform transformations and functions on the data
Perform data nesting and unnesting
Add new columns, nested schemas, and function results
to the output schema
Assign primary keys to output columns
DATA ASSESSMENT using BODS
For analyzing the data and measuring the data;mainly about studying the
sources w.r.t data
Functions available in BODS View data function, Validation Transform
Sstatistics,Auditing and data profiling;
Profiling types: Column profiling and Relationship profiling;
Column profiling are two types Basic column profiling (min length, max
length, medium length, min length % max% medium%, no of null %, no of
blanks%) and Detail column profiling (distinct %, pattern %, median).
Relationship Profiling: Defining relationship between two objects; % of
common data, % of non-common data.
SAP has separate tool: Data insight or Information steward.
BODS Variables
Enables dynamism in etl jobs;
Global variable(G_XXX): Job dependent; Can be used any where inside the
job. (not across the job); Prossessing option: default value, User entry and
through script: Input value: single value.
Bulk Loading: Only enabled for databse type targets; performed with the
help of SQL loader tool in BODS; Target object editor properties check box
bulk load
Through put
Dataflow levelcache type (In memory cache, pageble cache); Transform
level Table comparison (cached comparison method); lookup function
pre load cache and demand load cache.
Other: Source level Array fetch size (best practice 1000-5000); Join rank
(normal, ranking and cache algorithms); Target Level Rows per commit
1000-5000; Advanced->general-> Number of loaders.
Error Handling: Target editor property-> error handling-> use overflow file.
Source-based performance options
Using array fetch size
Caching data
Join ordering
Minimizing extracted data
Target-based performance options
Loading method and rows per commit
Staging tables to speed up auto-correct loads
Job design performance options
Improving throughput
Maximizing the number of pushed-down operations
Minimizing data type conversion
Minimizing locale conversion
Improving Informix repository performance
1 Utilize a database (like Oracle / Sybase / Informix / DB2 etc...)
for significant data handling operations (such as sorts, groups,
aggregates). In other words, staging tables can be a huge benefit to
parallelism of operations. In parallel design - simply defined by
mathematics, nearly always cuts your execution time. Staging tables
have many benefits. Please see the staging table discussion in the
methodologies section for full details.
2 Localize. Localize all target tables on to the SAME instance of
Oracle (same SID), or same instance of Sybase. Try not to use
Synonyms (remote database links) for anything (including: lookups,
stored procedures, target tables, sources, functions, privileges, etc...).
Utilizing remote links will most certainly slow things down. For Sybase
users, remote mounting of databases can definitely be a hindrance to
performance.
3 If you can - localize all target tables, stored procedures,
functions, views, sequences in the SOURCE database. Again, try
not to connect across synonyms. Synonyms (remote database tables)
TRY CATCH
A try/catch block is a combination of one try object and one or more
catch objects that allow you to specify alternative work flows if errors
occur while DI is executing a job.
Try/catch blocks:
Catch classes of exceptions thrown by DI, the DBMS, or the
operating system
Apply solutions that you provide
Continue execution
Try and catch objects are single-use objects.
Scripts are single-use objects used to call functions and assign values
to variables in a work flow.
A script can contain the following statements:
Function calls
If statements
While statements
Assignment statements
Operators
design,
Test, and
production
Each phase may require a different computer in a different
environment, and different security settings for each.
XML_PIPELINE
VALIDATION TRANSFORM
QUERY TRANSFORM
TABLE COMPARISION
KEY GENERATION
Hierarcy FLATNENNING
Lookup_ext
List of functions
PIVOT TRANSFORM
EXECUTION PROPERTIES
2.6
Transforms.........................................................................................................
...................13
2.7 Operational
excellence.........................................................................................................
..14
2.8
Functions...........................................................................................................
...................14
2.9 Source and target
support.....................................................................................................15
2.10 Data
Quality...............................................................................................................
...........15
2.10.1 Data Cleanse
transform.........................................................................................................1
7
2.10.2 Geocoder
transform...........................................................................................................
....18
2.10.3 Global Address Cleanse
transform.........................................................................................19
2.10.4 Match
transform...........................................................................................................
..........21
2.10.5
USA
Regulatory
Address
Cleanse
transform.........................................................................
What are the benefits of Datawarehousing?
The star schema is the simplest data warehouse schema. Snow flake schema
is similar to the star schema. It normalizes dimension table to save data
storage space. It can be used to represent hierarchies of information.
What is the difference between a data warehouse and a data mart?
This is a heavily debated issue. There are inherent similarities between the
basic constructs used to design a data warehouse and a data mart. In
general a Data Warehouse is used on an enterprise level, while Data Marts is
used on a business division/department level. A data mart only contains the
required subject specific data for local analysis.
Data Warehouse
A data warehouse is a central repository for all or significant parts of the
data that an enterprise's various business systems collect. Typically, a data
warehouse is housed on an enterprise mainframe server. Data from various
online transaction processing (OLTP) applications and other sources is
selectively extracted and organized on the data warehouse database for use
by analytical applications and user queries. Data warehousing emphasizes
the capture of data from diverse sources for useful analysis and access.
Data Marts
A data mart is a repository of data gathered from operational data and
other sources that is designed to serve a particular community of knowledge
workers. In scope, the data may derive from an enterprise-wide database or
data warehouse or be more specialized. The emphasis of a data mart is on
meeting the specific demands of a particular group of knowledge users in
terms of analysis, content, presentation, and ease-of-use. Users of a data
mart can expect to have data presented in terms that are familiar
We have two schema models which are suitable for Data Warehousing
in most of the cases.
Star Schema A star schema is a set of tables comprised of a single,
central fact table surrounded by de-normalized dimensions. Each
dimension is represented in a single table.
Snowflake Schema If you normalize the star schema dimensions to
separate tables and link them together, you will have a snowflake
schema
Fact Tables
Types of Measures
Additive facts
Non-additive facts
Semi-additive facts
Additive Facts
Mapping
The definition of the relationship and data flow between source and target
objects.
Metadata
Data that describes data and other structures, such as objects, business
rules, and processes. For example, the schema design of a data warehouse is
typically stored in a repository as metadata, which is used to generate
scripts used to build and populate the data warehouse. A repository contains
metadata.
Staging Area
A place where data is processed before entering the warehouse.
Cleansing
The process of resolving inconsistencies and fixing the anomalies in source
data, typically as part of the ETL process.
Transformation
The process of manipulating data. Any manipulation beyond copying is a
transformation. Examples include cleansing, aggregating, and integrating
data from multiple sources.
time stamp:Stores a database wide unique number that gets updated every
time row gets updated
SQL Join Types:
There are different type of joins available in SQL:
INNER JOIN:
returns rows when there is a match in both tables.
LEFT JOIN:
returns all rows from the left table, even if there are no matches
RIGHT JOIN:
returns all rows from the right table, even if there are no matches in the left
table.
FULL JOIN:
returns rows when there is a match in one of the tables.
SELF JOIN:
is used to join a table to itself, as if the table were tw
o tables, temporarily renaming at least one
table in the SQL statement.
CARTESIAN JOIN:
returns the cartesian product of the sets of records from the two or more
joined tables
TIMESTAMP(expr), TIMESTAMP(expr1,expr2)
With a single argument, this function returns the date or
datetime expression expr as a datetime value. With two
arguments, it adds the time expression expr2 to the date or datetime
expression expr1 and returns the result as a
datetime value.
mysql
DELTA LOAD
A delta load, by definition, is loading incremental changes to the data. When
doing a delta load to a fact table, for example, you perform inserts only...
appending the change data to the existing table.
Load types:1)Bulk Load
2)Normal Load
Normal load:1)in case of less data.
2)we can get its log details
3)we can rollback and commit.
4)Session recovery possible.
5)performance may be low .
Bulk load :-
Unique Key
Interview Questions
12. TRUNCATE TABLE EMP; DELETE FROM EMP; Will the outputs of
the above two commands
Delete Command:
1. Its a DML Command
2. Data can be rolled back.
3. Its slower than Truncate command bcoz it logs each row deletion.
4. With delete command trigger can be fire.
Truncate Command:
1. Its a DDL Command
2. Data Can not be rolled back.
3. Its is faster than delete bcoz it does not log rows.
With Truncate command trigger can not be fire.
both cases only the table data is removed, not the table structure.
13. What is the use of the DROP option in the ALTER TABLE
command
Drop option in the ALTER TABLE command is used to drop columns you no
longer need from the table.
The column may or may not contain data
Using alter column statement only one column can be dropped at a
time.
The table must have at least one column remaining in it after it is
altered.
Once a column is dropped, it cannot be recovered.
20. What is Data warehousing Hierarchy?
Hierarchies
Hierarchies are logical structures that use ordered levels as a means of
organizing data. A hierarchy can be used to define data aggregation. For
example, in a time dimension, a hierarchy might aggregate data from the
month level to the quarter level to the year level. A hierarchy can also be
used to define a navigational drill path and to establish a family structure.
Within a hierarchy, each level is logically connected to the levels above and
below it. Data values at lower levels aggregate into the data values at higher
levels. A dimension can be composed of more than one hierarchy. For
example, in the product dimension, there might be two hierarchies--one for
product categories and one for product suppliers.
Dimension hierarchies also group levels from general to granular. Query tools
use hierarchies to enable you to drill down into your data to view different
levels of granularity. This is one of the key benefits of a data warehouse.
When designing hierarchies, you must consider the relationships in business
structures. For example, a divisional multilevel sales organization.
Hierarchies impose a family structure on dimension values. For a particular
level value, a value at the next higher level is its parent, and values at the
next lower level are its children. These familial relationships enable analysts
to access data quickly.
Levels
A level represents a position in a hierarchy. For example, a time dimension
might have a hierarchy that represents data at the month, quarter, and year
levels. Levels range from general to specific, with the root level as the
highest or most general level. The levels in a dimension are organized into
one or more hierarchies.
Level Relationships
Level relationships specify top-to-bottom ordering of levels from most
general (the root) to most specific information. They define the parent-child
relationship between the levels in a hierarchy.
Hierarchies are also essential components in enabling more complex
rewrites. For example, the database can aggregate an existing sales revenue
on a quarterly base to a yearly aggregation when the dimensional
dependencies between quarter and year are known..
22. What is surrogate key ? where we use it explain with examples
surrogate key is a substitution for the natural primary key.
It is just a unique identifier or number for each row that can be used for the
primary key to the table. The only requirement for a surrogate primary key is
that it is unique for each row in the table.
Development Projects.
Enhancement Projects
Migration Projects
Production support Projects.
-> The following are the different phases involved in a ETL project
development life cycle.
1) Business Requirement Collection ( BRD )
2) System Requirement Collection ( SRD )
3) Design Phase
a) High Level Design Document ( HRD )
b) Low level Design Document ( LLD )
c) Mapping Design
4) Code Review
5) Peer Review
6) Testing
a) Unit Testing
b) System Integration Testing.
c) USer Acceptance Testing ( UAT )
7) Pre - Production
8) Production ( Go-Live )
Business Requirement Collection :---------------------------------------------> The business requirement gathering start by business Analyst, onsite
technical lead and client business users.
-> In this phase,a Business Analyst prepares Business Requirement
Document ( BRD ) (or) Business Requirement Specifications ( BRS )
-> BR collection takes place at client location.
-> The o/p from BR Analysis are
-> BRS :- Business Analyst will gather the Business Requirement and
document in BRS
-> SRS :- Senior technical people (or) ETL architect will prepare the SRS
which contains s/w and h/w requirements.
The SRS will includes
a) O/S to be used ( windows or unix )
b) RDBMS required to build database ( oracle, Teradata etc )
c) ETL tools required ( Informatica,Datastage )
d) OLAP tools required ( Cognos ,BO )
The SRS is also called as Technical Requirement Specifications ( TRS )
Designing and Planning the solutions :------------------------------------------------> The o/p from design and planning phase is
a) HLD ( High Level Design ) Document
b)LLD ( Low Level Design ) Document
HLD ( High Level Design ) Document : An ETL Architect and DWH Architect participate in designing a solution to
build a DWH.
An HLD document is prepared based on Business Requirement.
LLD ( Low Level Design ) Document : Based on HLD,a senior ETL developer prepare Low Level Design Document
The LLD contains more technical details of an ETL System.
An LLD contains data flow diagram ( DFD ), details of source and targets of
each mapping.
Clean up
Unit Testing :-> A unit test for the DWH is a white Box testing,It should check the ETL
procedure and Mappings.
-> The following are the test cases can be executed by an ETL developer.
1) Verify data loss
2) No.of records in the source and target
3) Dataload/Insert
4) Dataload/Update
5) Incremental load
6) Data accuracy
7) verify Naming standards.
8) Verify column Mapping
-> The Unit Test will be carried by ETL developer in development phase.
-> ETL developer has to do the data validations also in this phase.
Development Integration Testing -> Run all the mappings in the sequence order.
-> First Run the source to stage mappings.
-> Then run the mappings related to dimensions and facts.
System Integration Testing :->
->
->
->
-> This test is carried out in the presence of client side technical users to
verify the data migration from source to destination.
Production Environment :---------------------------------> Migrate the code into the Go-Live environment from test environment
( QA Environment ).
Design and Prepared High level and Detailed ETL Test cases,
Executed BODS ETL Test Case Scenarios
Extracted multiple Excel Workbook sheet data dynamically by
implementing custom logic in BODS.