DataStage Matter
DataStage Matter
DataStage Matter
Centralized Data Warehouse A Centralized Data Warehouse is a data warehousing implementation wherein a single data warehouse serves the needs of several separate business units simultaneously using a single data model that spans the needs of multiple business divisions. What is Central Data Warehouse A Central Data Warehouse is a repository of company data where a database is created from operational data extracts. This database adheres to a single, consistent enterprise data model to ensure consistency in decision making support across the company. A Central Data Warehouse is a single physical database which contains business data for a specific function area, department, branch, division or the whole enterprise. Choosing the central data warehouse is commonly based on where there is the largest common need for informational data and where the largest numbers of end users are already hooked to a central computer or a network. A central data warehouse employs the computing style of having all the information systems located and managed from one physical location even if there are many data sources spread around the globe. What is Active Data Warehouse Active Data Warehouse is repository of any form of captured transactional data so that they can be used for the purpose of finding trends and patterns to be used for future decision making. What is Active Metadata Warehouse An Active Metadata Warehouse is a repository of Metadata to help speed up data reporting and analyses from an active data warehouse. In its most simple definition, a Metadata is data describing data. What is Enterprise Data Warehouse
Enterprise Data Warehouse is a centralized warehouse which provides service for the entire enterprise. A data warehouse is by essence a large repository of historical and current transaction data of an organization. An Enterprise Data Warehouse is a specialized data warehouse which may have several interpretations. In order to give a clear picture of an Enterprise Data Warehouse and how it differs from an ordinary data warehouses, five attributes are being considered. This is not really exclusive they bring people closer to a focused meaning of the Enterprise Data Warehouse from among the many interpretations of the term. These attributes mainly pertain to the overall philosophy as well as the underlying infrastructure of an Enterprise Data Warehouse. The first attribute of an Enterprise Data Warehouse is that it should have a single version of truth and that entire goal of the warehouse's design is to come up with a definitive representation of the organization's business data as well as the corresponding rules. Given the number and variety of systems and silos of company data that exist within any business organization, many business warehouses may not qualify as an Enterprise Data Warehouse. The second attribute is that an Enterprise Data Warehouse should have multiple subject areas. In order to have a unified version of the truth for an organization, an Enterprise Data Warehouse should contain all subject areas related to the enterprise such as marketing, sale, finance, human resource and others. The third attribute is that an Enterprise Data Warehouse should have a normalized design. This may be an arguable attribute as both normalized and de-normalized databases have their own advantages for a data warehouse. In fact, may data warehouse designers have used denormalized models such as star or snowflake schemas for implementing data marts. But many also go for normalized databases for an Enterprise Data Warehouse in the consideration of flexibility first and performance second. The fourth attribute is that an Enterprise Data Warehouse should be implemented as a Mission-Critical Environment. The entire underlying infrastructure should be able to handle any unforeseen critical conditions because failure in the data warehouse means stoppage of the business operation and loss of income and revenue. An Enterprise Data Warehouse should have high availability features such as online parameter or database structural changes, business continuance such as failover and disaster recovery features and security features. Finally an Enterprise Data Warehouse should be scalable across several dimensions. It should expect that a company's main objective is to grow and that the warehouse should be able to handle the growth of data as well as the growing complexities of processes which will come together with the evolution of the business enterprise. What is Functional Data Warehouse Today's business environment is very data driven and more companies are hoping to create competitive advantage over other business organization competitors by creating a system whereby they can assess the current status of their operations any at any given moment and at the same time, they can also analyze trends and patterns within the company operation and its relation to the trends and patterns of the industry in a truly up-to-date fashion. Breaking down the Enterprise Data Warehouse into several Functional Data Warehouses can have many big benefits. Since the organization as a data driven enterprise deals with very high level volumes of data, having separate Functional Data Warehouses distributes the load and compartmentalize the processes. With this set up, there will no way the whole information system will break down because if there is a glitch in one of the functional data warehouses, only that certain point will have to be temporarily halted while being fixed. As opposed to one monolithic data warehouse setup, if the central database breaks down, the whole system will suffer. What is Operational Data Store (ODS) An Operational Data Store (ODS) is an integrated database of operational data. Its sources include legacy systems and it contains current or near term data. An ODS may contain 30 to 60 days of information, while a data warehouse typically contains years of data. An operational data store is basically a database that is used for being an interim area for a data warehouse. As such, its primary purpose is for handling data which are progressively in use such as transactions, inventory and collecting data from Point of Sales. It works with a data warehouse but unlike a data warehouse, an operational data store does not contain static data. Instead, an operational data store contains data which are constantly updated through the course of the business operations. What is Operational Database
An operational database contains enterprise data which are up to date and modifiable. In an enterprise data management system, an operational database could be said to be an opposite counterpart of a decision support database which contain non-modifiable data that are extracted for the purpose of statistical analysis. An example use of a decision support database is that it provides data so that the average salary of many different kinds of workers can be determined while the operational database contains the same data which would be used to calculate the amount for pay checks of the workers depending on the number of days that they have reported in any given period of time. Data profiling Data profiling is the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data. The purpose of these statistics may be to: Find out whether existing data can easily be used for other purposes Give metrics on data quality including whether the data conforms to company standards Assess the risk involved in integrating data for new applications, including the challenges of joins Track data quality Assess whether metadata accurately describes the actual values in the source database Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can incur time delays and project cost overruns. Have an enterprise view of all data, for uses such as Master Data Management where key data is needed, or Data governance for improving data quality Data governance is a quality control discipline for assessing, managing, using, improving, monitoring, maintaining, and protecting organizational information.[1] It is a system of decision rights and accountabilities for information-related processes, executed according to agreed-upon models which describe who can take what actions with what information, and when, under what circumstances, using what methods.[2]: What are derived facts and cumulative facts? There are 2 kinds of derived facts that are additive and can be calculated entirely from the other facts in the same fact table row can be shown in a user view as if they existed in the real data. The user will never know the difference. The second kind of derived fact is a non additive calculation, such as ratio or cumulative fact that is typically expressed at a different level of details than the base facts themselves. A Cumulative fact might be year-to-date or month-to-date fact. In any case these kinds of derived facts can not be presented in a simple view at the DBMS level because they violate the grain of the fact table. They need to be calculated at query time by the BI tool. Question : what is the data type of the surrogate key Data type of the surrogate key is either integer or numeric or number Answer :
Question :
What is hybrid slowly changing dimension Hybrid SCDs are combination of both SCD 1 and SCD 2.
Answer :
It may happen that in a table, some columns are important and we need to track changes for them i.e. capture the historical data for them whereas in some columns even if the data changes, we don't care. For such tables we implement Hybrid SCDs, where in some columns are Type 1 and some are Type 2.
Question :
Can a dimension table contain numeric values? Yes. But those data type will be char (only the values can numeric/char)
Answer :
Question :
What are Data Marts A data mart is a focused subset of a data warehouse that deals with a single area(like different department) of data and is organized for quick analysis
Answer :
Question :
Difference between Snow flake and Star Schema. What are situations where Snow flake Schema is better than Star Schema to use and when the opposite is true?
Star schema contains the dimension tables mapped around one or more fact tables. Answer : It is a de normalized model. No need to use complicated joins. Queries results fastly. Snowflake schema It is the normalized form of Star schema. Contains in depth joins, bcas the tables r spited in to many pieces. We can easily do modification directly in the tables. We have to use complicated joins, since we have more tables. There will be some delay in processing the Query.
Question :
What is ER Diagram The Entity-Relationship (ER) model was originally proposed by Peter in 1976 [Chen76] as a way to unify the network and relational database views. Simply stated the ER model is a conceptual data model that views the real world as entities and relationships. A basic component of the model is the Entity-Relationship diagram which is used to visually represent data objects. Since Chen wrote his paper the model has been extended and today it is commonly used for database design For the database designer, the utility of the ER model is: it maps well to the relational model. The constructs used in the ER model can easily be transformed into relational tables. It is simple and easy to understand with a minimum of training. Therefore, the model can be used by the database designer to communicate the design to the end user. In addition, the model can be used as a design plan by the database developer to implement a data model in specific database management software.
Answer :
Question :
Answer :
Degenerate Dimensions : If a table contains the values, which r neither dimension nor measures is called degenerate dimensions : invoice id,empno
Question :
What is VLDB The perception of what constitutes a VLDB continues to grow. A one terabyte database would normally be considered to be a VLDB
Answer :
Question :
What are the various ETL tools in the Market Various ETL tools used in market are:
Answer :
Question :
What is the main difference between schema in RDBMS and schemas in Data Warehouse....? RDBMS Schema * Used for OLTP systems * Traditional and old schema * Normalized * Difficult to understand and navigate * Cannot solve extract and complex problems * Poorly modeled DWH Schema * Used for OLAP systems * New generation schema * De Normalized * Easy to understand and navigate * Extract and complex problems can be easily solved * Very good model
Answer :
Question :
What are the possible data marts in Retail sales? Product information, sales information
Answer :
Question :
1. What is incremental loading? 2. What is batch processing? 3. What is cross reference table? 4.what is aggregate fact table Incremental loading means loading the ongoing changes in the OLTP.
Answer :
Aggregate table contains the [measure] values, aggregated /grouped/summed up to some level of hierarchy.
Question :
Answer :
Meta data is the data about data; Business Analyst or data modeler usually capture information about data - the source (where and how the data is originated), nature of data (char, varchar, nullable, existence, valid values etc) and behavior of data (how it is modified / derived and the life cycle) in data dictionary a.k.a metadata. Metadata is also presented at the Datamart level, subsets, fact and dimensions, ODS etc. For a DW user, metadata provides vital information for analysis / DSS.
Question :
What is a linked cube? Linked cube in which a sub-set of the data can be analyzed into great detail. The linking ensures that the data in the cubes remain consistent.
Answer :
Question :
What is surrogate key? where we use it explain with examples Surrogate key is a substitution for the natural primary key.
Answer :
It is just a unique identifier or number for each row that can be used for the primary key to the table. The only requirement for a surrogate primary key is that it is unique for each row in the table. Data warehouses typically use a surrogate, (also known as artificial or identity key), key for the dimension tables primary keys. They can use Infa sequence generator, or Oracle sequence, or SQL Server Identity values for the surrogate key. It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult. Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the primary keys (according to the business users) but ,not only can these change, indexing on a numerical value is probably better and you could consider creating a surrogate key called, say, AIRPORT_ID. This would be internal to the system and as far as the client is concerned you may display only the AIRPORT_NAME. 2. Adapted from response by Vincent on Thursday, March 13, 2003 Another benefit you can get from surrogate keys (SID) is : Tracking the SCD - Slowly Changing Dimension. Let me give you a simple, classical example: On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1' (that's what would be in your Employee Dimension). This employee has a turnover allocated to him on the Business Unit 'BU1' But on the 2nd of June the Employee 'E1' is muted from Business Unit 'BU1' to Business Unit 'BU2.' All the new turnover has to belong to the new Business Unit 'BU2' but the old one should Belong to the Business Unit 'BU1.' If you used the natural business key 'E1' for your employee within your data warehouse everything would be allocated to Business Unit 'BU2' even what actually belongs to 'BU1.' If you use surrogate keys, you could create on the 2nd of June a new record for the Employee 'E1' in your Employee Dimension with a new surrogate key. This way, in your fact table, you have your old data (before 2nd of June) with the SID of the Employee 'E1' + 'BU1.' All new data (after 2nd of June) would take the SID of the employee 'E1' + 'BU2.' You could consider Slowly Changing Dimension as an enlargement of your natural key: natural key of the Employee was Employee Code 'E1' but for you it becomes Employee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the difference with the natural
key enlargement process is that you might not have all part of your new key within your fact table, so you might not be able to do the join on the new enlarge key -> so you need another id.
Question :
What r the data types present in bo?n what happens if we implement view in the designer n report Three different data types: Dimensions, Measure and Detail.
Answer :
View is nothing but an alias and it can be used to resolve the loops in the universe.
Question :
What is data validation strategies for data mart validation after loading process Data validation is to make sure that the loaded data is accurate and meets the business requirements. Strategies are different methods followed to meet the validation requirements
Answer :
Question :
What is Data warehousing Hierarchy? Hierarchies Hierarchies are logical structures that use ordered levels as a means of organizing data. A hierarchy can be used to define data aggregation. For example, in a time dimension, a hierarchy might aggregate data from the month level to the quarter level to the year level. A hierarchy can also be used to define a navigational drill path and to establish a family structure. Within a hierarchy, each level is logically connected to the levels above and below it. Data values at lower levels aggregate into the data values at higher levels. A dimension can be composed of more than one hierarchy. For example, in the product dimension, there might be two hierarchies-one for product categories and one for product suppliers. Dimension hierarchies also group levels from general to granular. Query tools use hierarchies to enable you to drill down into your data to view different levels of granularity. This is one of the key benefits of a data warehouse. When designing hierarchies, you must consider the relationships in business structures. For example, a divisional multilevel sales organization. Hierarchies impose a family structure on dimension values. For a particular level value, a value at the next higher level is its parent, and values at the next lower level are its children. These familial relationships enable analysts to access data quickly. Levels A level represents a position in a hierarchy. For example, a time dimension might have a hierarchy that represents data at the month, quarter, and year levels. Levels range from general to specific, with the root level as the highest or most general level. The levels in a dimension are organized into one or more hierarchies. Level Relationships Level relationships specify top-to-bottom ordering of levels from most general (the root) to most specific information. They define the parent-child relationship between the levels in a hierarchy. Hierarchies are also essential components in enabling more complex rewrites. For example, the database can aggregate an existing sales revenue on a quarterly base to a yearly aggregation when the dimensional dependencies between quarter and year are known
Answer :
Question :
Answer :
BUS Schema is composed of a master suite of confirmed dimension and standardized definition if facts.
Question :
What are the methodologies of Data Warehousing? Every company has methodology of their own. But to name a few SDLC Methodology, AIM methodology are standard Ely used. Other methodologies are AMM, World class methodology and many more.
Answer :
Question :
What is conformed dimension? Conformed dimensions are the dimensions which can be used across multiple Data Marts in combination with multiple facts tables accordingly
Answer :
Question :
What is Difference between E-R Modeling and Dimensional Modeling? Modeling is nothing but designing the database by using any Database Normalization techniques (1NF 2NF 3NF.......etc) Data Modeling is having 2 types: 1. ER Modeling 2. Dimensional Modeling ER Modeling is for OLTP databases uses any of the forms 1NF or 2NF or 3NF. Contains Normalized data Dimensional Modeling is for Data warehouses uses 3NF. Contains denormalized data. The only difference between these 2 modeling techniques is the Normalization Form used to design the databases. Both modeling techniques are represented using ER diagrams. So depends upon the client requirement it should be decided...
Answer :
Question :
Why fact table is in normal form? The Fact table is central table in Star schema Fact table is kept Normalized because its very bigger and so we should avoid redundant data in it. Thats why we make different dimensions there by making normalized star schema model which helps in query performance + to eliminate redundant data.
Answer :
Question :
What is the definition of normalized and denormalized view and what are the differences between them Normalization is the process of removing redundancies.
Answer :
Question :
What is junk dimension? What is the difference between junk dimension and degenerated dimension? Junk dimension: Grouping of Random flags and text Attributes in a dimension and moving them to a separate sub dimension.
Answer :
Degenerate Dimension: Keeping the control information on Fact table ex: Consider a Dimension table with fields like order number and order line number and have 1:1 relationship with Fact table, In this case this dimension is removed and the order information will be directly stored in a Fact table in order eliminate unnecessary joins while retrieving order information..
Question :
What is the difference between view and materialized view View - store the SQL statement in the database and let you use it as a table. Every time you access the view, the SQL statement executes. Materialized view - stores the results of the SQL in table form in the database. SQL statement only executes once and after that every time you run the query, the stored result set is used. Pros include quick query results.
Answer :
Question :
What is the main difference between Inmon and Kimball philosophies of data warehousing? Both differed in the concept of building the datawarehosue...
Answer :
According to Kimball ... Kimball views data warehousing as a constituency of data marts. Data marts are focused on delivering business objectives for departments in the organization. And the data warehouse is a conformed dimension of the data marts. Hence a unified view of the enterprise can be obtained from the dimension modeling on a local departmental level. Inmon beliefs in creating a data warehouse on a subject-by-subject area basis. Hence the development of the data warehouse can start with data from the online store. Other subject areas can be added to the data warehouse as their needs arise. Point-of-sale (POS) data can be added later if management decides it is necessary. i.e., Kimball--First Data Marts--Combined way ---Data warehouse Inman---First Data warehouse--Later----Data marts
Question :
What is the advantages data mining over traditional approaches? Data Mining is used for the estimation of future. For example, if we take a company/business organization, by using the concept of Data Mining, we can predict the future of business in terms of Revenue (or) Employees (or) Customers (or) Orders etc. Traditional approaches use simple algorithms for estimating the future. But, it does not give accurate results when compared to Data Mining.
Answer :
Question :
What are the different architecture of data warehouse I think, there are two main things
Answer :
Question :
As far I know... Answer : Gathering business requirements Identifying Sources Identifying Facts Defining Dimensions Define Attributes Redefine Dimensions & Attributes Organize Attribute Hierarchy & Define Relationship Assign Unique Identifiers Additional convetions:Cardinality/Adding ratios
Question :
What is fact less fact table? Where you have used it in your project? Fact less table means only the key available in the Fact there is no measures available
Answer :
Question :
What is the difference between ODS and OLTP ODS:- It is nothing but a collection of tables created in the Data warehouse that maintains only current data where as OLTP maintains the data only for transactions, these are designed for recording daily operations and transactions of a business
Answer :
Question :
What is the difference between data warehouse and BI? Simply speaking, BI is the capability of analyzing the data of a data warehouse in advantage of that business. A BI tool analyzes the data of a data warehouse and to come into some business decision depending on the result of the analysis.
Answer :
Question :
What is the difference between OLAP and datawarehosue Data warehouse is the place where the data is stored for analyzing
Answer :
where as OLAP is the process of analyzing the data, managing aggregations, partitioning information into cubes for in-depth visualization.
Question :
what is aggregate table and aggregate fact table ... any examples of both Aggregate table contains summarized data. The materialized view is aggregated tables.
Answer :
For ex in sales we have only date transaction. If we want to create a report like sales by product per year. In such cases we aggregate the date vales into week_agg, month_agg, quarter_agg, year_agg. To retrieve date from these tables we use @aggregate function.
Question :
What are non-additive facts in detail? A fact may be measure, metric or a dollar value. Measure and metric are non additive facts.
Answer :
Dollar value is additive fact. If we want to find out the amount for a particular place for a particular period of time, we can add the dollar amounts and come up with the total amount. A non additive fact, for e.g. measure height(s) for 'citizens by geographical location' , when we rollup 'city' data to 'state' level data we should not add heights of the citizens rather we may want to use it to derive 'count'
Question :
Why Denormalization is promoted in Universe Designing? In a relational data model, for normalization purposes, some lookup tables are not merged as a single table. In a dimensional data modeling (star schema), these tables would be merged as a single table called DIMENSION table for performance and slicing data. Due to this merging of tables into one large Dimension table, it comes out of complex intermediate joins. Dimension tables are directly joined to Fact tables. Though, redundancy of data occurs in DIMENSION table, size of DIMENSION table is 15% only when compared to FACT table. So only Denormalization is promoted in Universe Designing.
Answer :
Question :
Answer :
Question :
What is snapshot You can disconnect the report from the catalog to which it is attached by saving the report with a snapshot of the data. However, you must reconnect to the catalog if you want to refresh the data.
Answer :
ETL Basic Questions ****************************** Question : What is a data warehousing? Data Warehouse is a repository of integrated information, available for queries and analysis. Data and information are extracted from heterogeneous sources as they are generated....This makes it much easier and more efficient to run queries over data that originally came from different sources. Typical relational databases are designed for on-line transactional processing (OLTP) and do not meet the requirements for effective on-line analytical processing (OLAP). As a result, data warehouses are designed differently than traditional relational databases.
Answer :
Question :
What are Data Marts? Data Marts are designed to help manager make strategic decisions about their business.
Answer :
Data Marts are subset of the corporate-wide data that is of value to a specific group of users. There are two types of Data Marts: 1.Independent data marts sources from data captured form OLTP system, external providers or from data generated locally within a particular department or geographic area. 2. Dependent data mart sources directly form enterprise data warehouses.
Question :
What is ER Diagram? The Entity-Relationship (ER) model was originally proposed by Peter in 1976 [Chen76] as a way
Answer :
to unify the network and relational database views. Simply stated the ER model is a conceptual data model that views the real world as entities and relationships. A basic component of the model is the Entity-Relationship diagram which is used to visually represent data objects. Since Chen wrote his paper the model has been extended and today it is commonly used for database design For the database designer, the utility of the ER model is: it maps well to the relational model. The constructs used in the ER model can easily be transformed into relational tables. It is simple and easy to understand with a minimum of training. Therefore, the model can be used by the database designer to communicate the design to the end user. In addition, the model can be used as a design plan by the database developer to implement a data model in specific database management software.
Question :
What is a Star Schema? Star schema is a type of organizing the tables such that we can retrieve the result from the database easily and fastly in the warehouse environment. Usually a star schema consists of one or more dimension tables around a fact table which looks like a star, so that it got its name.
Answer :
Question :
What is Dimensional Modeling? Why is it important? Dimensional Modeling is a design concept used by many data warehouse designers to build their data warehouse. In this design model all the data is stored in two types of tables - Facts table and Dimension table. Fact table contains the facts/measurements of the business and the dimension table contains the context of measurements i.e., the dimensions on which the facts are calculated. Why is Data Modeling Important? --------------------------------------Data modeling is probably the most labor intensive and time consuming part of the development process. Why bother especially if you are pressed for time? A common response by practitioners who write on the subject is that you should no more build a database without a model than you should build a house without blueprints. The goal of the data model is to make sure that the all data objects required by the database are completely and accurately represented. Because the data model uses easily understood notations and natural language, it can be reviewed and verified as correct by the end-users. The data model is also detailed enough to be used by the database developers to use as a "blueprint" for building the physical database. The information contained in the data model will be used to define the relational tables, primary and foreign keys, stored procedures, and triggers. A poorly designed database will require more time in the long-term. Without careful planning you may create a database that omits data required to create critical reports, produces results that are incorrect or inconsistent, and is unable to accommodate changes in the user's requirements.
Answer :
Question :
What Snow Flake Schema? Snowflake Schema, each dimension has a primary dimension table, to which one or more additional dimensions can join. The primary dimension table is the only table that can join to the fact table.
Answer :
Question :
What are Aggregate tables? Aggregate table contains the summary of existing warehouse data which is grouped to certain levels of dimensions. Retrieving the required data from the actual table, which have millions of records will take more time and also affects the performance. To avoid this we can aggregate the table to certain required level and can use it. This table reduces the load in the database server and increases the performance of the query and can retrieve the result very fastly.
Answer :
Question :
What is the Difference between OLTP and OLAP? Main Differences between OLTP and OLAP are:-
Answer :
1. User and System Orientation OLTP: customer-oriented, used for data analysis and querying by clerks, clients and IT professionals. OLAP: market-oriented, used for data analysis by knowledge workers (managers, executives, analysis). 2. Data Contents OLTP: manages current data, very detail-oriented. OLAP: manages large amounts of historical data, provides facilities for summarization and aggregation, stores information at different levels of granularity to support decision making process. 3. Database Design OLTP: adopts an entity relationship(ER) model and an application-oriented database design. OLAP: adopts star, snowflake or fact constellation model and a subject-oriented database design. 4. View OLTP: focuses on the current data within an enterprise or department. OLAP: spans multiple versions of a database schema due to the evolutionary process of an organization; integrates information from many organizational locations and data stores
Question :
Answer :
ETL provide developers with an interface for designing source-to-target mappings, transformation and job control parameter. Extraction Take data from an external source and move it to the warehouse pre-processor database. Transformation Transform data task allows point-to-point generating, modifying and transforming data. Loading Load data task adds records to a database table in a warehouse.
Question :
What are the various Reporting tools in the Market? 1. MS-Excel 2. Business Objects (Crystal Reports) 3. Cognos (Impromptu, Power Play) 4. Micro strategy 5. MS reporting services 6. Informatics Power Analyzer 7. Actuate 8. Hyperion (BRIO) 9. Oracle Express OLAP 10. ProClarity
Answer :
Question :
What is Fact table? Fact Table contains the measurements or metrics or facts of business process. If your business process is "Sales" , then a measurement of this business process such as "monthly sales
Answer :
number" is captured in the Fact table. Fact table also contains the foreign keys for the dimension tables.
Question :
What is a dimension table? A dimensional table is a collection of hierarchies and categories along which the user can drill down and drill up. It contains only the textual attributes.
Answer :
Question :
What is a lookup table? A lookup table is the one which is used when updating a warehouse. When the lookup is placed on the target table (fact table / warehouse) based upon the primary key of the target, it just updates the table by allowing only new records or updated records based on the lookup condition.
Answer :
Question :
What is a general purpose scheduling tool? The basic purpose of the scheduling tool in a DW Application is to stream line the flow of data from Source To Target at specific time or based on some condition.
Answer :
Question :
What are modeling tools available in the Market? There are a number of data modeling tools
Answer :
Tool Name Company Name Erwin Computer Associates Embarcadero Embarcadero Technologies Rational Rose IBM Corporation Power Designer Sybase Corporation Oracle Designer Oracle Corporation
Question :
What is real time data-warehousing? Real-time data warehousing is a combination of two things: 1) real-time activity and 2) data warehousing. Real-time activity is activity that is happening right now. The activity could be anything such as the sale of widgets. Once the activity is complete, there is data about it. Data warehousing captures business activity data. Real-time data warehousing captures business activity data as it occurs. As soon as the business activity is complete and there is data about it, the completed activity data flows into the data warehouse and becomes available instantly. In other words, real-time data warehousing is a framework for deriving information from data as the data becomes available.
Answer :
Question :
What is data mining? Data mining is a process of extracting hidden trends within a data warehouse. For example an insurance data ware house can be used to mine data for the most high risk people to insure in a certain geographical area.
Answer :
Question :
What are Normalization, First Normal Form, Second Normal Form, And Third Normal Form? 1.Normalization is process for assigning attributes to entitiesReduces data redundanciesHelps eliminate data anomaliesProduces controlled redundancies to link tables 2.Normalization is the analysis of functional dependency between attributes / data items of user views It reduces a complex user view to a set of small ands table subgroups of fields / relations 1NF:Repeating groups must be eliminated, Dependencies can be identified, All key
Answer :
attributesdefined,No repeating groups in table 2NF: The Table is already in1NF,Includes no partial dependenciesNo attribute dependent on a portion of primary key, Still possible to exhibit transitivedependency,Attributes may be functionally dependent on non-key attributes 3NF: The Table is already in 2NF, Contains no transitive dependencies
Question :
Answer :
Submitted by Francis C. (xxchen74 @ hotmail. com) 2. A collection of operation or bases data that is extracted from operation databases and standardized, cleansed, consolidated, transformed, and loaded into enterprise data architecture. An ODS is used to support data mining of operational data, or as the store for base data that is summarized for a data warehouse. The ODS may also be used to audit the data warehouse to assure summarized and derived data is calculated properly. The ODS may further become the enterprise shared operational database, allowing operational systems that are being reengineered to use the ODS as there operation databases.
Question :
What type of Indexing mechanism do we need to use for a typical data warehouse? On the fact table it is best to use bitmap indexes. Dimension tables can use bitmap and/or the other types of clustered/non-clustered, unique/non-unique indexes. To my knowledge, SQLServer does not support bitmap indexes. Only Oracle supports bitmaps.
Answer :
Question :
Which columns go to the fact table and which columns go the dimension table? The Primary Key columns of the Tables (Entities) go to the Dimension Tables as Foreign Keys.
Answer :
The Primary Key columns of the Dimension Tables go to the Fact Tables as Foreign Keys.
Question :
What is a level of Granularity of a fact table? Level of granularity means level of detail that you put into the fact table in a data warehouse. For example: Based on design you can decide to put the sales data in each transaction. Now, level of granularity would mean what detail you are willing to put for each transactional fact. Product sales with respect to each minute or you want to aggregate it unto minute and put that data.
Answer :
Question :
What does level of Granularity of a fact table signify? Granularity The first step in designing a fact table is to determine the granularity of the fact table. By granularity, we mean the lowest level of information that will be stored in the fact table. This constitutes two steps: Determine which dimensions will be included. Determine where along the hierarchy of each dimension the information will be kept. The determining factors usually goes back to the requirements
Answer :
Question :
How are the Dimension tables designed? Most dimension tables are designed using Normalization principles unto 2NF. In some instances they are further normalized to 3NF.
Answer :
Find where data for this dimension are located. Figure out how to extract this data. Determine how to maintain changes to this dimension (see more on this in the next section). Change fact table and DW population routines.
Question :
What are slowly changing dimensions? SCD stands for Slowly changing dimensions. Slowly changing dimensions are of three types
Answer : SCD1: only maintained updated values. Ex: a customer address modified we update existing record with new address. SCD2: maintaining historical information and current information by using A) Effective Date B) Versions C) Flags or combination of these scd3: by adding new columns to target table we maintain historical information and current information
Question :
What are non-additive facts? Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.
Answer :
Question :
What are conformed dimensions? Conformed dimensions are dimensions which are common to the cubes.(cubes are the schemas contains facts and dimension tables) Consider Cube-1 contains F1,D1,D2,D3 and Cube-2 contains F2,D1,D2,D4 are the Facts and Dimensions here D1,D2 are the Conformed Dimensions
Answer :
Question :
Answer :
It is an environment or storage space managed by a relational database management system (RDBMS) consisting of vast quantities of information. Submitted By: Francis C. (xxchen74 @ hotmail. com) _____________________ VLDB doesnt refer to size of database or vast amount of information stored. It refers to the window of opportunity to take back up the database. Window of opportunity refers to the time of interval and if the DBA was unable to take back up in the specified time then the database was considered as VLDB.
Question :
What are Semi-additive and fact less facts and in which scenario will you use such kinds of fact tables? Snapshot facts are semi-additive, while we maintain aggregated facts we go for semi-additive.
Answer :
EX: Average daily balance A fact table without numeric fact columns is called fact less fact table. Ex: Promotion Facts While maintain the promotion values of the transaction (ex: product samples) because this table doesnt contain any measures.
Question :
How do you load the time dimension? Time dimensions are usually loaded by a program that loops through all possible dates that may appear in the data. It is not unusual for 100 years to be represented in a time dimension, with one row per day.
Answer :
Question :
Why OLTP database are designs not generally a good idea for a Data Warehouse? Since in OLTP, tables are normalized and hence query response will be slow for end user and OLTP doesnt contain years of data and hence cannot be analyzed.
Answer :
Question :
Why should you put your data warehouse on a different system than your OLTP system? An OLTP system is basically data oriented (ER model) and not Subject oriented "(Dimensional Model) .That is why we design a separate system that will have a subject oriented OLAP system... Moreover if a complex query is fired on an OLTP system will cause a heavy overhead on the OLTP server that will affect the day-to-day business directly. _____________ The loading of a warehouse will likely consume a lot of machine resources. Additionally, users may create queries or reports that are very resource intensive because of the potentially large amount of data available. Such loads and resource needs will conflict with the needs of the OLTP systems for resources and will negatively impact those production systems.
Answer :
Question :
What is Full load & Incremental or Refresh load? Full Load: completely erasing the contents of one or more tables and reloading with fresh data.
Answer :
Incremental Load: applying ongoing changes to one or more tables based on a predefined schedule
Question :
What are snapshots? What are materialized views & where do we use them? What is a materialized view lo Materialized view is a view in which data is also stored in some temp table.i.e if we will go with the View concept in DB in that we only store query and once we call View it extract data from DB.But In materialized View data is stored in some temp tables.
Answer :
Question :
Answer :
ETL tool is meant for extraction data from the legacy systems and load into specified data base with some process of cleansing data. ex: Informatica,data stage ....etc OLAP is meant for Reporting purpose. In OLAP data available in Multidirectional model. So that u can write simple query to extract data fro the data base. ex: Businee objects,Cognos....etc
Question :
Where do we use semi and non additive facts Additive: A measure can participate arithmetic calculations using all or any dimensions.
Answer :
Ex: Sales profit Semi additive: A measure can participate arithmetic calculations using some dimensions. Ex: Sales amount Non Additives measure can't participate arithmetic calculations using dimensions. Ex: temperature
Question :
What is a staging area? Do we need it? What is the purpose of a staging area? Data staging is actually a collection of processes used to prepare source system data for loading a data warehouse. Staging includes the following steps: Source data extraction, Data transformation (restructuring), Data transformation (data cleansing, value transformations), Surrogate key assignments
Answer :
Question :
What are the various methods of getting incremental records or delta records from the source systems? One foolproof method is to maintain a field called 'Last Extraction Date' and then impose a condition in the code saying 'current_extraction_date > last_extraction_date'.
Answer :
Question :
What is a three tier data warehouse? A data warehouse can be thought of as a three-tier system in which a middle system provides usable data in a secure way to end users. On either side of this middle system are the end users and the back-end data stores.
Answer :
Question :
What are active transformation / Passive transformations? Active transformation can change the number of rows that pass through it. (decrease or increase rows) Passive transformation can not change the number of rows that pass through it.
Answer :
Question :
Compare ETL & Manual development? ETL - The process of extracting data from multiple sources. (Ex. flat files,XML, COBOL, SAP etc)
Answer :
is more simpler with the help of tools. Manual - Loading the data other than flat files and oracle table need more effort. ETL - High and clear visibility of logic. Manual - complex and not so user friendly visibility of logic. ETL - Contains Meta data and changes can be done easily. Manual - No Meta data concept and changes needs more effort. ETL- Error handling, log summary and load progress makes life easier for developer and maintainer. Manual - need maximum effort from maintenance point of view. ETL - Can handle Historic data very well. Manual - as data grows the processing time degrades. These are some differences b/w manual and ETL development.
Defining the data warehouse A Data Warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management decisions. Subjectoriented: Focus on natural data groups, Integrated: Provide consistent formats and encodings. not applicatio ns boundarie s. Time-variant: Data is organized by time and is stored in diverse time slices. Nonvolatile: No updates are allowed. Only load and retrieval operations.
Subject-orientation mandated a cross-functional slice of data drawn from multiple sources to support a diversity of needs. This was a radical departure from serving only the vertical application views of data (supply-side) or the overlapping departmental needs for data (demand side). The integration goal was taken from the realm of enterprise rhetoric down to something attainable. Integration is not the act of wiring applications together. Nor is it simply commingling data from a variety of sources. Integration is the process of mapping dissimilar codes to a common base, developing consistent data element presentations and delivering this standardized data as broadly as possible. Time variance is the most confusing Inmon concept but also a most pivotal one. At its essence, it calls for storage of multiple copies of the underlying detail in aggregations of differing periodicity and/or time frames. You might have detail for seven years along with weekly, monthly and quarterly aggregates of differing duration. The time variant strategy is essential; not only for performance but also for maintaining the consistency of reported summaries across departments and over time.5 Non-volatile design is essential. It is also the principle most often violated or poorly implemented. Non-volatility literally means that once a row is written, it is never modified. This is necessary to preserve incremental net change history. This, in turn, is required to represent data as of any point in time. When you update a data row, you destroy information. You can never recreate a fact or total that included the unmodified data. Maintaining "institutional memory" is one of the higher goals of data warehousing. Inmon lays out several other principles that are not a component of his definition. Some of these principles were initially controversial but are commonly accepted now. Others are still in dispute or have fallen into disfavor. Modification of one principle is the basis for the next leap forward. Data Warehouse 2000: Real-Time Data Warehousing Our next step in the data warehouse saga is to eliminate the snapshot concept and the batch ETL mentality that has dominated since the very beginning. The majority of our developmental dollars and a massive amount
of processing time go into retrieving data from operational databases. What if we eliminated this whole write then detect then extract process? What if the data warehouse read the same data stream that courses into and between the operational system modules? What if data that was meaningful to the data warehouse environment was written by the operational system to a queue as it was created? This is the beginning of a real-time data warehouse model. But, there is more. What if we had a map for every operational instance that defined its initial transformation and home location in the data warehouse detail/history layer? What if we also had publish-and-subscribe rules that defined downstream demands for this instance in either raw form or as a part of some derivation or aggregation? What if this instance was propagated from its operational origin through its initial transformation then into the detail/history layer and to each of the recipient sites in parallel and in real time?
1.
Which DataStage EE client application is used to manage roles for DataStage projects? A. Director B. Manager C. Designer D. Administrator Importing metadata from data modeling tools like ERwin is accomplished by which facility? A. MetaMerge B. MetaExtract C. MetaBrokers D. MetaMappers Which two statements are true of writing intermediate results between parallel jobs to persistent data sets? (Choose two.) A. Datasets are pre-indexed. B. Datasets are stored in native internal format. C. Datasets retain data partitioning and sort order. D. Datasets can only use RCP when a schema file is specified. You are reading customer data using a Sequential File stage and sorting it by customer ID using the Sort stage. Then the sorted data is to be sent to an Aggregator stage which will count the number of records for each customer. Which partitioning method is more likely to yield optimal performance without violating the business requirements? A. Entire B. Random C. Round Robin D. Hash by customer ID A customer wants to create a parallel job to append to an existing Teradata table with an input file of over 30 gigabytes. The input data also needs to be transformed and combined with two additional flat files. The first has State codes and is about 1 gigabyte in size. The second file is a complete view of the current data which is roughly 40 gigabytes in size. Each of these files will have a one to one match and ultimately be combined into the original file. Which DataStage stage will communicate with Teradata using the maximum parallel performance to write the results to an existing Teradata table? A. B. C. D. Teradata API Teradata Enterprise Teradata TPump Teradata MultiLoad
2.
3.
4.
5.
6.
Which column attribute could you use to avoid rejection of a record with a NULL when it is written to a nullable field in a target Sequential File? A. null field value B. bytes to skip C. out format
7.
D. pad char You are reading customer records from a sequential file. In addition to the customer ID, each record has a field named Rep ID that contains the ID of the company representative assigned to the customer. When this field is blank, you want to retrieve the customers representative from the REP table. Which stage has this functionality? A. Join Stage B. Merge Stage C. Lookup Stage D. No stage has this functionality. You want to ensure that you package all the jobs that are used in a Job Sequence for deployment to a production server. Which command line interface utility will let you search for jobs that are used in a specified Job Sequence? A. dsjob B. dsinfo C. dsadmin D. dssearch Your job is running in a grid environment consisting of 50 computers each having two processors. You need to add a job parameter that will allow you to run the job using different sets of resources and computers on different job runs. Which environment variable should you add to your job parameters? A. B. C. D. A. B. C. D. APT_CONFIG_FILE APT_DUMP_SCORE APT_EXECUTION_MODE APT_RECORD_COUNTS Job Templates can be created from any parallel job or Job Sequence. Job Templates should include recommended environment variables including APT_CONFIG_FILE. Job Templates are stored on the DataStage development server where they can be shared among developers. The locatation where Job Templates are stored can be changed within DataStage Designer Tools - Options menu.
8.
9.
10. Which two statements are valid about Job Templates? (Choose two.)
10. A and B
About IBM
Privacy
Contact
ETL is important, as it is the way data actually gets loaded into the warehouse. This article assumes that data is always loaded into a data warehouse, whereas the term ETL can in fact refer to a process that loads any database. Contents [hide]
[edit] Extract
o
6 Surveys
The first part of an ETL process is to extract the data from the source systems. Most data warehousing projects consolidate data from different source systems. Each separate system may also use a different data organization / format. Common data source formats are relational databases and flat files, but may include non-relational database structures such as IMS or other data structures such as VSAM or ISAM. Extraction converts the data into a format for transformation processing. [edit] Transform The transform phase applies a series of rules or functions to the extracted data to derive the data to be loaded. Some data sources will require very little manipulation of data. However, in other cases any combination of the following transformations types may be required:
Selecting only certain columns to load (or if you prefer, null columns not to load) Translating coded values (e.g. If the source system stores M for male and F for female but the warehouse stores 1 for male and 2 for female) Encoding free-form values (e.g. Mapping "Male" and "M" and "Mr" onto 1) Deriving a new calculated value (e.g. sale_amount = qty * unit_price) Joining together data from multiple sources (e.g. lookup, merge, etc) Summarizing multiple rows of data (e.g. total sales for each region) Generating surrogate key values Transposing or pivotting (turning multiple columns into multiple rows or vice versa)
[edit] Load The load phase loads the data into the data warehouse. Depending on the requirements of the organization, this process ranges widely. Some data warehouses merely overwrite old information with new data. More complex systems can maintain a history and audit trail of all changes to the data. [edit] Challenges ETL processes can be quite complex, and significant operational problems can occur with improperly designed ETL systems. The range of data values or data quality in an operational system may be outside the expectations of designers at the time validation and transformation rules are specified. Data profiling of a source during data analysis is recommended to identify the data conditions that will need to be managed by transform rules specifications. The scalability of an ETL system across the lifetime of its usage needs to be established during analysis. This includes understanding the volumes of data that will have to be processed within Service Level Agreements. The time available to extract from source systems may change, which may mean the same amount of data may have to be processed in less time. Some ETL systems have to scale to process terabytes of data to update data warehouses with tens of terabytes of data. Increasing volumes of data may require designs that can scale from daily batch to intra-day micro-batch to integration with message queues for continuous transformation and update. Recent developments in ETL software has been the implementation of parallel processing. This has enabled a number of methods to improve overall performance of ETL processes when dealing with large volumes of data. There are 3 main types of parallelisms as implemented in ETL applications: Data: By splitting a single sequential file into smaller data files to provide parallel access. Pipeline: Allowing the simultaneous running of several components on the same data stream. E.g. performing step 2: lookup a value on record 1 at the same time as step 1: add two fields together is performed on record 2. Component: The simultaneous running of multiple processes on different data streams in the same job. E.g. doing a sort on input file 1 at the same time that the contents of input file 2 are deduped. All three types of parallelisms are usually combined in a single job. An additional difficulty is making sure the data being uploaded is relatively consistent. Since multiple source databases all have different update cycles (some may be updated every few minutes, while others may take days or weeks), an ETL system may be required to hold back certain data until all sources are synchronized. Likewise, where a warehouse may have to be reconciled to the contents in a source system or with the general ledger establishing synchronization and reconciliation points is necessary. [edit] Tools While an ETL process can be created using almost any programming language, creating them from scratch is quite complex. Increasingly, companies are buying ETL tools to help in the creation of ETL processes. A good ETL tool must be able to communicate with the many different relational databases and read the various file formats used throughout an organization. ETL tools have started to migrate into Enterprise Application Integration, or even Enterprise Service Bus, systems that now cover much more than just the extraction, transformation and loading of data. Many ETL vendors now have data profiling, data quality and metadata capabilities.
Question:
What are other Performance tunings you have done in your last project to increase the performance of slowly running jobs?
Added: 7/26/2006
Minimise the usage of Transformer (Instead of this use Copy, modify, Filter, Row Generator) Use SQL Code while extracting the data Handle the nulls Minimise the warnings Reduce the number of lookups in a job design Use not more than 20stages in a job Use IPC stage between two passive stages Reduces processing time Drop indexes before data loading and recreate after loading data into tables Gen\'ll we cannot avoid no of lookups if our requirements to do lookups compulsory. There is no limit for no of stages like 20 or 30 but we can break the job into small jobs then we use dataset Stages to store the data. IPC Stage that is provided in Server Jobs not in Parallel Jobs Check the write cache of Hash file. If the same hash file is used for Look up and as well as target, disable this Option. If the hash file is used only for lookup then \"enable Preload to memory\". This will improve the performance. Also, check the order of execution of the routines. Don\'t use more than 7 lookups in the same transformer; introduce new transformers if it exceeds 7 lookups. Use Preload to memory option in the hash file output. Use Write to cache in the hash file input. Write into the error tables only after all the transformer stages. Reduce the width of the input record - remove the columns that you would not use. Cache the hash files you are reading from and writting into. Make sure your cache is big enough to hold the hash files. Use ANALYZE.FILE or HASH.HELP to determine the optimal settings for your hash files. This would also minimize overflow on the hash file. If possible, break the input into multiple threads and run multiple instances of the job. Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts. Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects. Tuned the 'Project Tunables' in Administrator for better performance. Used sorted data for Aggregator. Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs Removed the data not used from the source as early as possible in the job. Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs. If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel. Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories. Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal. Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made. Tuning should occur on a job-by-job basis. Use the power of DBMS. Try not to use a sort stage when you can use an ORDER BY clause in the database. Using a constraint to filter a record set is much slower than performing a SELECT WHERE. Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE. CoolInterview.com Question: How can I extract data from DB2 (on IBM iSeries) to the data warehouse via Datastage as the ETL tool. I mean do I first need to use ODBC to create connectivity and use an adapter for the extraction and transformation of data? Thanks so much if Added: 7/26/2006
Answer:
Answer:
You would need to install ODBC drivers to connect to DB2 instance (does not come with regular drivers that we try to install, use CD provided for DB2 installation, that would have ODBC drivers to connect to DB2) and then try out CoolInterview.com
Question:
Added: 7/26/2006
Answer:
You use the Designer to build jobs by creating a visual design that models the flow and transformation of data from the data source through to the target warehouse. The Designer graphical interface lets you select stage icons, drop them onto the Designer work area, and add links. CoolInterview.com
Question:
How can I connect my DB2 database on AS400 to DataStage? Do I need to use ODBC 1st to open the database connectivity and then use an adapter for just connecting between the two? Thanks alot of any replies.
Added: 7/26/2006
Answer: Question:
You need to configure the ODBC connectivity for database (DB2 or AS400) in the datastage. CoolInterview.com How to improve the performance of hash file? You can inprove performance of hashed file by 1 .Preloading hash file into memory -->this can be done by enabling preloading options in hash file output stage 2. Write caching options -->.It makes data written into cache before being flushed to disk.you can enable this to ensure that hash files are written in order onto cash before flushed to disk instead of order in which individual rows are written 3 .Preallocating--> Estimating the approx size of the hash file so that file need not to be splitted to often after write operation CoolInterview.com Added: 7/26/2006
Answer:
Added: 7/26/2006
You can do this, by passing parameters from unix file, and then calling the execution of a datastage job. the ds job has the parameters defined (which are passed by unix) CoolInterview.com What is a project? Specify its various components? Added: 7/26/2006
You always enter DataStage through a DataStage project. When you start a DataStage client you are prompted to connect to a project. Each project contains: DataStage jobs. Answer: Built-in components. These are predefined components used in a job. User-defined components. These are customized components created using the DataStage Manager or DataStage Designer CoolInterview.com Question: How can u implement slowly changed dimensions in datastage? explain? 2) can u join flat file and database in datastage?how? Added: 7/26/2006
Answer:
Yes, we can do it in an indirect way. First create a job which can populate the data from database into a Sequential file and name it as Seq_First1. Take the flat file which you are having and use a Merge Stage to join the two files. You have various join types in Merge Stage like Pure Inner Join, Left Outer Join, Right Outer Join etc., You can use any one of these which suits your requirements. CoolInterview.com Can any one tell me how to extract data from more than 1 hetrogenious Sources. mean, example 1 sequenal file, Sybase , Oracle in a singale Job. Added: 7/26/2006
Question:
Answer:
Yes you can extract the data from from two heterogenious sources in data stages using the the transformer stage it's so simple you need to just form a link between the two sources in the transformer stage. CoolInterview.com Will the data stage consider the second constraint in the transformer once the first condition is satisfied ( if the link odering is given) Added: 7/26/2006
Question:
Answer:
Answer: Yes.
Question:
Added: 7/26/2006
When we say "Validating a Job", we are talking about running the Job in the "check only" mode. The following checks are made : - Connections are made to the data sources or data warehouse. - SQL SELECT statements are prepared. - Files are opened. Intermediate files in Hashed File, UniVerse, or ODBC stages that use the local data source are created, if they do not already exist. CoolInterview.com Question: Answer: Question: Why do you use SQL LOADER or OCI STAGE? Added: 7/26/2006
Answer:
When the source data is anormous or for bulk data we can use OCI and SQL loader depending upon the source CoolInterview.com Where we use link partitioner in data stage job?explain with example? Added: 7/26/2006
Answer:
We use Link Partitioner in DataStage Server Jobs.The Link Partitioner stage is an active stage which takes one input andallows you to distribute partitioned rows to up to 64 output links.Through Link Partioner,Link Collector and IPC Stage we can achieve the Parallelism capabilities in Server jobs. CoolInterview.com Purpose of using the key and difference between Surrogate keys and natural key Added: 7/26/2006
Question:
We use keys to provide relationships between the entities(Tables). By using primary and foreign key relationship, we can maintain integrity of the data. The natural key is the one coming from the OLTP system. Answer: The surrogate key is the artificial key which we are going to create in the target DW. We can use thease surrogate keys insted of using natural key. In the SCD2 scenarions surrogate keys play a major role CoolInterview.com
Added: 7/26/2006
we have to create users in the Administrators and give the necessary priviliges to users. CoolInterview.com How can I specify a filter command for processing data while defining sequential file output data? Added: 7/26/2006
We have some thing called as after job subroutine and Before subroutine, with then we can execute the Unix commands. Answer: Here we can use the sort sommand or the filter cdommand CoolInterview.com Question: How to parametarise a field in a sequential file?I am using Datastage as ETL Tool,Sequential file as source. Added: 7/26/2006
Answer: Question:
We cannot parameterize a particular field in a sequential file, instead we can parameterize the source file name in a sequential file. CoolInterview.com Is it possible to move the data from oracle ware house to SAP Warehouse using with DATASTAGE Tool. Added: 7/26/2006
Answer:
We can use DataStage Extract Pack for SAP R/3 and DataStage Load Pack for SAP BW to transfer the data from oracle to SAP Warehouse. These Plug In Packs are available with DataStage Version 7.5 CoolInterview.com How to implement type2 slowly changing dimensions in data stage?explain with example? We can handle SCD in the following ways Type 1: Just use, Insert rows Else Update rows Or Update rows Else Insert rows, in update action of target Type 2: Use the steps as follows a) U have use one hash file to Look-Up the target Added: 7/26/2006
Question:
Answer:
b) Take 3 instances of target c) Give different conditions depending on the process d) Give different update actions in target e) Use system variables like Sysdate and Null.
CoolInterview.com Question: How to handle the rejected rows in datastage? Added: 7/26/2006
Answer:
We can handle rejected rows in two ways with help of Constraints in a Tansformer.1) By Putting on the Rejected cell where we will be writing our constarints in the properties of the Transformer2)Use REJECTED in the expression editor of the ConstraintCreate a hash file as a temporory storage for rejected rows. Create a link and use it as one of the output of the transformer. Apply either ofthe two stpes above said on that Link. All the rows which are rejected by all the constraints will go to the Hash File. CoolInterview.com
Question:
Added: 7/26/2006
Answer:
We can call Datastage Batch Job from Command prompt using 'dsjob'. We can also pass all the parameters from command prompt. Then call this shell script in any of the market available schedulers. The 2nd option is schedule these jobs using Data Stage director. CoolInterview.com What is Hash file stage and what is it used for? Added: 7/26/2006
We can also use the Hash File stage to avoid / remove dupilcate rowsby specifying the hash key on a particular fileld CoolInterview.com What is version Control? Version Control stores different versions of DS jobs runs different versions of same job Added: 7/26/2006
Answer:
How to find the number of rows in a sequential file? Using Row Count system variable CoolInterview.com Suppose if there are million records did you use OCI? if not then what stage do you prefer? Using Orabulk CoolInterview.com How to run the job in command prompt in unix? Using dsjob command, -options
Added: 7/26/2006
Added: 7/26/2006
Added: 8/18/2006
Answer:
How to find errors in job sequence? using DataStage Director we can find the errors in job sequence CoolInterview.com How do you eliminate duplicate rows?
Added: 7/26/2006
Added: 7/26/2006
Use Remove Duplicate Stage: It takes a single sorted data set as input, removes all duplicate records, and writes the results to an output data set. Without Remove duplicate Stage ************************************** 1. In Target make the column as the key column and run the job.
2. 3.
Go to partitioning tab there select hash, select perform sort, select unique, select the column on which you want to remove duplicates and then run. It will work. For example, the source is coming as database table, you can write user defined query at source level like select distinct, or source data is coming like sequential file, then you can
pass to the sort stage there you can give option like allow duplicates is false in the sort stage.In The Sequential file we are having the property filter ther u can give the sort u <file name> it removes the duplicates and gives the output.
CoolInterview.com
Question:
Added: 7/26/2006
Answer:
U can view all the environment variables in designer. U can check it in Job properties. U can add and access the environment variables from Job properties CoolInterview.com
Question: How do you pass the parameter to the job sequence if the job is running at night? Added: 7/26/2006
Answer:
Two ways 1. Ste the default values of Parameters in the Job Sequencer and map these parameters to job. 2. Run the job in the sequencer using dsjobs utility where we can specify the values to be taken for each parameter. CoolInterview.com
Question:
What is the transaction size and array size in OCI stage? how these can be used?
Added: 7/26/2006
Transaction Size - This field exists for backward compatibility, but it is ignored for release 3.0 and later of the Plug-in. The transaction size for new jobs is now handled by Rows per transaction on the Transaction Handling tab on the Input page. Rows per transaction - The number of rows written before a commit are executed for the transaction. The default value is 0, that is, all the rows are written before being committed to the data table. Array Size - The number of rows written to or read from the database at a time. The default value is 1, that is, each row is written in a separate statement. CoolInterview.com Question: What is the difference between drs and odbc stage Added: 7/26/2006
Answer:
To answer your question the DRS stage should be faster then the ODBC stage as it uses native database connectivity. You will need to install and configure the required database clients on your DataStage server for it to work. Dynamic Relational Stage was leveraged for Peoplesoft to have a job to run on any of the supported databases. It supports ODBC connections too. Read more of that in the plug-in documentation. ODBC uses the ODBC driver for a particular database, DRS is a stage that tries to make it seamless for switching from one database to another. It uses the native connectivities for the chosen target ... CoolInterview.com Question: Answer: How do you track performance statistics and enhance it? Through Monitor we can view the performance statistics. CoolInterview.com Added: 7/26/2006
Answer:
Question:
what is the mean of Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made?
Added: 7/26/2006
Answer: Question:
This means try to improve the performance by avoiding use of constraints wherever possible and instead using them while selecting the data itself using a where clause. This improves performace. CoolInterview.com My requirement is like this : Here is the codification suggested: SALE_HEADER_XXXXX_YYYYMMDD.PSV SALE_LINE_XXXXX_YYYYMMDD.PSV XXXXX = LVM sequence to ensure unicity and continuity of file exchanges Caution, there will an increment to implement. YYYYMMDD = LVM date of file creation COMPRESSION AND DELIVERY TO: SALE_HEADER_XXXXX_YYYYMMDD.ZIP AND SALE_LINE_XXXXX_YYYYMMDD.ZIP if we run that job the target file names are like this sale_header_1_20060206 & sale_line_1_20060206. If we run next time means the target files we like this sale_header_2_20060206 & sale_line_2_20060206. If we run the same in next day means the target files we want like this sale_header_3_20060306 & sale_line_3_20060306. i.e., whenever we run the same job the target files automatically changes its filename to filename_increment to previous number(previousnumber + 1)_currentdate; Please do needful by repling this question.. Added: 7/26/2006
This can be done by using unix script 1. Keep the Target filename as constant name xxx.psv 2. Once the job completed, invoke the Unix Script through After job routine - ExecSh Answer: 3. The script should get the number used in previous file and increment it by 1, After that move the file from xxx.psv to filename_(previousnumber + 1)_currentdate.psv and then delete the xxx.psv file.This is the Easiest way to implement. CoolInterview.com Question: Answer: Question: Answer: How to drop the index befor loading data in target and how to rebuild it in data stage? This can be achieved by "Direct Load" option of SQLLoaded utily. CoolInterview.com What are the Job parameters? Added: 7/26/2006 Added: 7/26/2006
These Parameters are used to provide Administrative access and change run time values of the job.
EDIT>JOBPARAMETERS In that Parameters Tab we can define the name,prompt,type,value CoolInterview.com Question: There are three different types of user-created stages available for PX. What are they? Which would you use? What are the disadvantage for using each type? These are the three different stages: i) Custom ii) Build iii) Wrapped CoolInterview.com Added: 7/26/2006
Answer:
Question:
Added: 7/26/2006
Answer:
There are two types of lookupslookup stage and lookupfilesetLookup:Lookup refrence to another stage or Database to get the data from it and transforms to other database.LookupFileSet:It allows you to create a lookup file set or reference one for a lookup. The stage can have a single input link or a single output link. The output link must be a reference link. The stage can be configured to execute in parallel or sequential mode when used with an input link. When creating Lookup file sets, one file will be created for each partition. The individual files are referenced by a single descriptor file, which by convention has the suffix .fs. CoolInterview.com How can we create Containers? There are Two types of containers 1.Local Container 2.Shared Container Local container is available for that particular Job only. Where as Shared Containers can be used any where in the project. Local container: Added: 7/26/2006
Question:
Answer:
Step1:Select the stages required Step2:Edit>ConstructContainer>Local SharedContainer: Step1:Select the stages required Step2:Edit>ConstructContainer>Shared Shared containers are stored in the SharedContainers branch of the Tree Structure CoolInterview.com
Question:
Added: 7/26/2006
Answer:
DataStage Designer. A design interface used to create DataStage applications (known as jobs). Each job specifies the data sources, the transforms required, and the destination of the data. Jobs are compiled to create executables that are scheduled by the Director and run by the Server. DataStage Director. A user interface used to validate, schedule, run, and monitor DataStage jobs. DataStage Manager. A user interface used to view and edit the contents of the Repository. DataStage Administrator. A user interface used to configure DataStage projects and users. CoolInterview.com Question: Types of views in Datastage Director? Added: 7/27/2006
Answer:
There are 4 types of views in Datastage Director a) Job Status View - Dates of Jobs Compiled,Finished,Start time,End time and Elapsedtime b)Job Scheduler View-It Displays whar are the jobs are scheduled. c) Log View - Status of Job last run d) Detail View - Warning Messages, Event Messages, Program Generated Messages. CoolInterview.com
Question:
How to implement routines in data stage, There are 3 kind of routines is there in Datastage. 1.server routines which will used in server jobs. these routines will write in BASIC Language
Added: 7/26/2006
Answer:
2.parlell routines which will used in parlell jobs These routines will write in C/C++ Language 3.mainframe routines which will used in mainframe jobs CoolInterview.com
Question:
Added: 7/26/2006
Theare are the variables used at the project or job level.We can use them to to configure the job ie.we can associate the configuration file(Wighout this u can not run ur job), increase the sequential or dataset read/ write buffer. ex: $APT_CONFIG_FILE Answer: Like above we have so many environment variables. Please go to job properties and click on "add environment variable" to see most of the environment variables.
Question:
What are the Steps involved in development of a job in DataStage? The steps required are:
Added: 7/26/2006
select the datasource stage depending upon the sources for ex:flatfile,database, xml etc select the required stages for transformation logic such as transformer,link collector,link partitioner, Aggregator, merge etc select the final target stage where u want to load the data either it is datawatehouse, datamart, ODS,staging etc
Answer:
Question:
Added: 7/26/2006
What is Modulus and Splitting in Dynamic Hashed File? The modulus size can be increased by contacting your Unix Admin. What are Static Hash files and Dynamic Hash files?
Added: 7/27/2006
Added: 7/26/2006
The hashed files have the default size established by their modulus and separation when you create them, and this can be static or dynamic. Answer: Overflow space is only used when data grows over the reserved size for someone of the groups (sectors) within the file. There are many groups as the specified by the modulus.
Question:
What is the exact difference betwwen Join,Merge and Lookup Stage? The exact difference between Join,Merge and lookup is The three stages differ mainly in the memory they use
Added: 7/26/2006
DataStage doesn't know how large your data is, so cannot make an informed choice whether to combine data using a join stage or a lookup stage. Here's how to decide which to use: Answer: if the reference datasets are big enough to cause trouble, use a join. A join does a high-speed sort on the driving and reference datasets. This can involve I/O if the data is big enough, but the I/O is all highly optimized and sequential. Once the sort is over the join processing is very fast and never involves paging or other I/O Unlike Join stages and Lookup stages, the Merge stage allows you to specify several reject links as many as input links.
Question:
Added: 7/26/2006
The different hashing algorithms are designed to distribute records evenly among the groups of the file based on characters and their position in the record ids. When a hashed file is created, Separation and Modulo respectively specifies the group buffer size and the number of buffers allocated for a file. When a Static Hashfile is created, DATASTAGE creates a file that contains the number of groups specified by modulo. Size of Hashfile = modulus(no. groups) * Separations (buffer size)
Answer:
DataStage Interview Questions & Answers What is DS Administrator used for - did u use it?
Question:
Added: 7/26/2006
Answer:
The Administrator enables you to set up DataStage users, control the purging of the Repository, and, if National Language Support (NLS) is enabled, install and manage maps and locales.
Question:
What is the max capacity of Hash file in DataStage? Take a look at the uvconfig file: # 64BIT_FILES - This sets the default mode used to # create static hashed and dynamic files. # A value of 0 results in the creation of 32-bit # files. 32-bit files have a maximum file size of # 2 gigabytes. A value of 1 results in the creation # of 64-bit files (ONLY valid on 64-bit capable platforms). # The maximum file size for 64-bit # files is system dependent. The default behavior # may be overridden by keywords on certain commands. 64BIT_FILES 0
Added: 7/26/2006
Answer:
Question:
Added: 7/26/2006
Symmetric Multiprocessing (SMP) - Some Hardware resources may be shared by processor. Processor communicate via shared memory and have single operating system. Cluster or Massively Parallel Processing (MPP) - Known as shared nothing in which each processor have exclusive access to hardware resources. CLuster systems can be physically dispoersed.The processor have their own operatins system and communicate via high speed network
Answer:
Question:
What is the order of execution done internally in the transformer with the stage editor having input links on the lft hand side and output links? Stage variables, constraints and column derivation or expressions. What are Stage Variables, Derivations and Constants?
Added: 7/26/2006
Answer: Question:
Added: 7/27/2006
Stage Variable - An intermediate processing variable that retains value during read and doesnt pass the value into target column. Derivation - Expression that specifies value to be passed on to the target column. Answer: Constant - Conditions that are either true or false that specifies flow of data with a link.
Question:
What is SQL tuning? how do you do it ? Sql tunning can be done using cost based optimization this parameters are very important of pfile
Added: 7/26/2006
Answer:
Question:
How to implement type2 slowly changing dimenstion in datastage? give me with example?
Added: 7/26/2006
Slow changing dimension is a common problem in Dataware housing. For example: There exists a customer called lisa in a company ABC and she lives in New York. Later she she moved to Florida. The company must modify her address now. In general 3 ways to solve this problem Type 1: The new record replaces the original record, no trace of the old record at all, Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated essentially as two different people. Type 3: The original record is modified to reflect the changes. In Type1 the new one will over write the existing one that means no history is maintained, History of the person where she stayed last is lost, simple to use. Answer: In Type2 New record is added, therefore both the original and the new record Will be present, the new record will get its own primary key, Advantage of using this type2 is, Historical information is maintained But size of the dimension table grows, storage and performance can become a concern. Type2 should only be used if it is necessary for the data warehouse to track the historical changes. In Type3 there will be 2 columns one to indicate the original value and the other to indicate the current value. example a new column will be added which shows the original address as New york and the current address as Florida. Helps in keeping some part of the history and table size is not increased. But one problem is when the customer moves from Florida to Texas the new york information is lost. so Type 3 should only be used if the changes will only occur for a finite number of time. Question: Functionality of Link Partitioner and Link Collector? Added: 7/27/2006
Answer:
server jobs mainly execute the jobs in sequential fashion,the ipc stage as well as link partioner and link collector will simulate the parllel mode of execution over the sever jobs having single cpu Link Partitioner : It receives data on a single input link and diverts the data to a maximum no.of 64 output links and the data processed by the same stage having same meta dataLink Collector : It will collects the data from 64 inputlinks, merges it into a single data flowand loads to target. these both r active stagesand the design and mode of execution of serverjobs has to be decidead by the designer What is the difference between sequential file and a dataset? When to use the copy stage? Added: 7/26/2006
Question:
Answer:
Sequentiial Stage stores small amount of the data with any extension in order to acces the file where as DataSet is used to store Huge amount of the data and it opens only with an extension (.ds ) . The Copy stage copies a single input data set to a number of output datasets. Each record of the input data set is copied to every output data set.Records can be copied without modification or you can drop or change the order of columns. What Happens if RCP is disable ? Added: 7/26/2006
Question:
Answer:
Runtime column propagation (RCP): If RCP is enabled for any job, and specifically for those stage whose output connects to the shared container input, then meta data will be propagated at run time, so there is no need to map it at design time. If RCP is disabled for the job, in such case OSH has to perform Import and export every time when the job runs and the processing time job is also increased.
Question:
What are Routines and where/how are they written and have you written any routines before?
Added: 7/26/2006
Answer:
RoutinesRoutines are stored in the Routines branch of the DataStage Repository,where you can create, view, or edit them using the Routine dialog box. Thefollowing program components are classified as routines: Transform functions. These are functions that you can use whendefining custom transforms. DataStage has a number of built-intransform functions which are located in the Routines Examples Functions branch of the Repository. You can also defineyour own transform functions in the Routine dialog box. Before/After subroutines. When designing a job, you can specify asubroutine to run before or after the job, or before or after an activestage. DataStage has a number of built-in before/after subroutines,which are located in the Routines Built-in Before/Afterbranch in the Repository. You can also define your ownbefore/after subroutines using the Routine dialog box. Custom UniVerse functions. These are specialized BASIC functionsthat have been defined outside DataStage. Using the Routinedialog box, you can get DataStage to create a wrapper that enablesyou to call these functions from within DataStage. These functionsare stored under the Routines branch in the Repository. Youspecify the category when you create the routine. If NLS is enabled, How we can call the routine in datastage job?explain with steps? Added: 7/26/2006
Question:
Answer:
Routines are used for impelementing the business logic they are two types 1) Before Sub Routines and 2)After Sub Routinestepsdouble click on the transformer stage right click on any one of the mapping field select [dstoutines] option within edit window give the business logic and select the either of the options( Before / After Sub Routines)
Types of Parallel Processing? Parallel Processing is broadly classified into 2 types. a) SMP - Symmetrical Multi Processing. b) MPP - Massive Parallel Processing. What are orabulk and bcp stages?
Added: 7/27/2006
Added: 7/26/2006
ORABULK is used to load bulk data into single table of target oracle database. BCP is used to load bulk data into a single table for microsoft sql server and sysbase. What is the OCI? and how to use the ETL Tools?
Question: Answer:
Added: 7/26/2006
OCI doesn't mean the orabulk data. It actually uses the "Oracle Call Interface" of the oracle to load the data. It is kind of the lowest level of Oracle being used for loading the data.
Added: 7/26/2006
No, It is not possible to run Parallel jobs in server jobs. But Server jobs can be executed in Parallel jobs through Server Shared Containers. It is possible to access the same job two users at a time in datastage? Added: 7/26/2006
No, it is not possible to access the same job two users at the same time. DS will produce the following error : "Job is accessed by other user" Explain the differences between Oracle8i/9i? Mutliproceesing,databases more dimesnionsal modeling Do u know about METASTAGE? Added: 7/26/2006 Added: 7/26/2006
MetaStage is used to handle the Metadata which will be very useful for data lineage and data analysis later on. Meta Data defines the type of data we are handling. This Data Definitions are stored in repository and can be accessed with the use of MetaStage. What is merge and how it can be done plz explain with simple example taking 2 tables Added: 7/26/2006
Answer:
Merge is used to join two tables.It takes the Key columns sort them in Ascending or descending order.Let us consider two table i.e Emp,Dept.If we want to join these two tables we are having DeptNo as a common Key so we can give that column name as key and sort Deptno in ascending order and can join those two tables What is merge ?and how to use merge? Merge is a stage that is available in both parallel and server jobs. Added: 7/26/2006
Question:
Answer:
The merge stage is used to join two tables(server/parallel) or two tables/datasets(parallel). Merge requires that the master table/dataset and the update table/dataset to be sorted. Merge is performed on a key field, and the key field is mandatory in the master and update dataset/table. What is difference between Merge stage and Join stage? Merge and Join Stage Difference : Added: 7/26/2006
Question: Answer:
1. Merge Reject Links are there 2. can take Multiple Update links 3. If you used it for comparision , then first matching data will be the output . Because it uses the update links to extend the primary details which are coming from master link Question: What are the enhancements made in datastage 7.5 compare with 7.0 Added: 7/26/2006
Answer:
Many new stages were introduced compared to datastage version 7.0. In server jobs we have stored procedure stage, command stage and generate report option was there in file tab. In job sequence many stages like startloop activity, end loop activity,terminate loop activity and user variables activities were introduced. In parallel jobs surrogate key stage, stored procedure stage were introduced. For all other specifications, What is NLS in datastage? how we use NLS in Datastage ? what advantages in that ? at the time of installation i am not choosen that NLS option , now i want to use that options what can i do ? to reinstall that datastage or first uninstall and install once again ? Just reinstall you can see the option to include the NLS How can we join one Oracle source and Sequential file?. Join and look up used to join oracle and sequential file Added: 7/26/2006 Added: 7/26/2006
Question:
Answer: Question: Answer: Question: What is job control?how can it used explain with steps?
Added: 7/26/2006
Answer:
JCL defines Job Control Language it is ued to run more number of jobs at a time with or without using loops. steps:click on edit in the menu bar and select 'job properties' and enter the parameters asparamete prompt typeSTEP_ID STEP_ID stringSource SRC stringDSN DSN stringUsername unm stringPassword pwd stringafter editing the above steps then set JCL button and select the jobs from the listbox and run the job What is the difference between Datastage and Datastage TX? Added: 7/26/2006
Question:
Answer:
Its a critical question to answer, but one thing i can tell u that Datastage Tx is not a ETL tool & this is not a new version of Datastage 7.5. Tx is used for ODS source ,this much i know
If the size of the Hash file exceeds 2GB..What happens? Does it overwrite the current rows? It overwrites the file Do you know about INTEGRITY/QUALITY stage?
Added: 7/26/2006
Added: 7/26/2006
Integriry/quality stage is a data integration tool from ascential which is used to staderdize/integrate the data from different sources How much would be the size of the database in DataStage ? What is the difference between Inprocess and Interprocess ?
Added: 7/26/2006
In-process You can improve the performance of most DataStage jobs by turning in-process row buffering on and recompiling the job. This allows connected active stages to pass data via buffers rather than row by row. Note: You cannot use in-process row-buffering if your job uses COMMON blocks in transform
functions to pass data between stages. This is not recommended practice, and it is advisable to redesign your job to use row buffering rather than COMMON blocks. Inter-process Use this if you are running server jobs on an SMP parallel system. This enables the job to run using a separate process for each active stage, which will run simultaneously on a separate processor. Note: You cannot inter-process row-buffering if your job uses COMMON blocks in transform functions to pass data between stages. This is not recommended practice, and it is advisable to redesign your job to use row buffering rather than COMMON blocks. How can you do incremental load in datastage? Incremental load means daily load. when ever you are selecting data from source, select the records which are loaded or updated between the timestamp of lastsuccessful load and todays load start date and time. Answer: for this u have to pass parameters for those two dates. store the last rundate and time in a file and read the parameter through job parameters and state second argument as currentdate and time.
Question:
Added: 7/26/2006
Question:
What is the meaning of the following.. 1)If an input file has an excessive number of rows and can be split-up then use standard 2)logic to run jobs in parallel 3)Tuning should occur on a job-by-job basis. Use the power of DBMS.
Added: 7/26/2006
Question:
I want to process 3 files in sequentially one by one , how can i do that. while processing the files it should fetch files automatically . If the metadata for all the files r same then create a job having file name as parameter, then use same job in routine and call the job with different file name...or u can create sequencer to use the job...
Answer:
Added: 7/26/2006 Question: What happends out put of hash file is connected to transformer ..What error it throws Added: 7/26/2006
Answer:
If Hash file output is connected to transformer stage the hash file will consider as the Lookup file if there is no primary link to the same Transformer stage, if there is no primary link then this will treat as primary link itself. you can do SCD in server job by using Lookup functionality. This will not return any error code. What is iconv and oconv functions? Added: 7/26/2006
Question:
Answer:
Iconv( )-----converts string to internal storage format Oconv( )----converts an expression to an output format
Question: Answer:
Added: 7/26/2006
I have never tried doing this, however, I have some information which will help you in saving a lot of time. You can convert your server job into a server shared container. The server shared container
can also be used in parallel jobs as shared container. Question: Can we use shared container as lookup in datastage server jobs? Added: 7/26/2006
Answer: Question:
I am using DataStage 7.5, Unix. we can use shared container more than one time in the job.There is any limit to use it. why because in my job i used the Shared container at 6 flows. At any time only 2 flows are working. can you please share the info on this. DataStage from Staging to MDW is only running at 1 row per second! What do we do to remedy? Added: 7/26/2006
I am assuming that there are too many stages, which is causing problem and providing the solution. In general. if you too many stages (especially transformers , hash look up), there would be a lot of overhead and the performance would degrade drastically. I would suggest you to write a query instead of doing several look ups. It seems as though embarassing to have a tool and still write a query but that is best at times. If there are too many look ups that are being done, ensure that you have appropriate indexes while querying. If you do not want to write the query and use intermediate stages, ensure that you use proper elimination of data between stages so that data volumes do not cause overhead. So, there might be a re-ordering of stages needed for good performance. Other things in general that could be looked in: 1) for massive transaction set hashing size and buffer size to appropriate values to perform as much as possible in memory and there is no I/O overhead to disk. 2) Enable row buffering and set appropriate size for row buffering 3) It is important to use appropriate objects between stages for performance CoolInterview.com Question: What is the flow of loading data into fact & dimensional tables? Here is the sequence of loading a datawarehouse. 1. The source data is first loading into the staging area, where data cleansing takes place. Answer: 2. The data from staging area is then loaded into dimensions/lookups. 3.Finally the Fact tables are loaded from the corresponding source tables from the staging area. Added: 7/26/2006
Answer:
Question: How to handle Date convertions in Datastage? Convert a mm/dd/yyyy format to yyyydd-mm? Added: 7/26/2006
Here is the right conversion: Answer: Function to convert mm/dd/yyyy format to yyyy-dd-mm is Oconv(Iconv(Filedname,"D/MDY[2,2,4]"),"D-YDM[4,2,2]") . What is difference between serverjobs & paraller jobs Added: 7/26/2006
Question:
Server jobs. These are available if you have installed DataStage Server. They run on the DataStage Server, connecting to other data sources as necessary. Answer: Parallel jobs. These are only available if you have installed Enterprise Edition. These run on DataStage servers that are SMP, MPP, or cluster systems. They can also run on a separate z/OS (USS) machine if required.
Question:
What are the most important aspects that a beginner must consider doin his first DS project ?
Added: 7/26/2006
Answer: Question:
He should be good at DataWareHousing Concepts and he should be familiar with all stages What is hashing algorithm and explain breafly how it works?
Added: 7/26/2006
Hashing is key-to-address translation. This means the value of a key is transformed into a disk address by means of an algorithm, usually a relative block and anchor point within the block. It's closely related to statistical probability as to how well the algorithms work. Answer: It sounds fancy but these algorithms are usually quite simple and use division and remainder techniques. Any good book on database systems will have information on these techniques. Interesting to note that these approaches are called "Monte Carlo Techniques" because the behavior of the hashing or randomizing algorithms can be simulated by a roulette wheel where the slots represent the blocks and the balls represent the records (on this roulette wheel there are many balls not just one). Question: How the hash file is doing lookup in serverjobs?How is it comparing the key values? Added: 7/26/2006
Answer:
Hashed File is used for two purpose: 1. Remove Duplicate Records 2. Then Used for reference lookups.The hashed file contains 3 parts: Each record having Hashed Key, Key Header and Data portion.By using hashed algorith and the key valued the lookup is faster. What are types of Hashed File? Hashed File is classified broadly into 2 types. a) Static - Sub divided into 17 types based on Primary Key Pattern. b) Dynamic - sub divided into 2 types i) Generic ii) Specific. Default Hased file is "Dynamic - Type30.
Question:
Added: 7/26/2006
Answer:
Question: Answer:
Added: 7/26/2006
Hash file stores the data based on hash algorithm and on a key value. A sequential file is just a file with no key column. Hash file used as a reference for look up. Sequential file cannot
Added: 7/26/2006
Flat files stores the data and the path can be given in general tab of the sequential file stage What is data set? and what is file set? Added: 7/26/2006
Answer:
File set:- It allows you to read data from or write data to a file set. The stage can have a single input link. a single output link, and a single rejects link. It only executes in parallel modeThe data files and the file that lists them are called a file set. This capability is useful because some operating systems impose a 2 GB limit on the size of a file and you need to distribute files among nodes to prevent overruns. Datasets r used to import the data in parallel jobs like odbc in server jobs
Question:
what is meaning of file extender in data stage server jobs. can we run the data stage job from one job to another job that file data where it is stored and what is the file extender in ds jobs.
Added: 7/26/2006
Answer:
File extender means the adding the columns or records to the already existing the file, in the data stage,
we can run the data stage job from one job to another job in data stage. Question: Answer: How do you merge two files in DS? Added: 7/26/2006
Either used Copy command as a Before-job subroutine if the metadata of the 2 files are same or created a job to concatenate the 2 files into one if the metadata is different.
What is the default cache size? How do you change the cache size if needed?
Added: 7/27/2006
Default read cache size is 128MB. We can increase it by going into Datastage Administrator and selecting the Tunable Tab and specify the cache size over there. What about System variables? Added: 7/26/2006
DataStage provides a set of variables containing useful system information that you can access from a transform or routine. System variables are read-only. @DATE The internal date when the program started. See the Date function. @DAY The day of the month extracted from the value in @DATE. @FALSE The compiler replaces the value with 0. @FM A field mark, Char(254). @IM An item mark, Char(255). @INROWNUM Input row counter. For use in constrains and derivations in Transformer stages. @OUTROWNUM Output row counter (per link). For use in derivations in Transformer stages. @LOGNAME The user login name. @MONTH The current extracted from the value in @DATE. @NULL The null value. @NULL.STR The internal representation of the null value, Char(128). Answer: @PATH The pathname of the current DataStage project. @SCHEMA The schema name of the current DataStage project. @SM A subvalue mark (a delimiter used in UniVerse files), Char(252). @SYSTEM.RETURN.CODE Status codes returned by system processes or commands. @TIME The internal time when the program started. See the Time function. @TM A text mark (a delimiter used in UniVerse files), Char(251). @TRUE The compiler replaces the value with 1. @USERNO The user number. @VM A value mark (a delimiter used in UniVerse files), Char(253). @WHO The name of the current DataStage project directory. @YEAR The current year extracted from @DATE. REJECTED Can be used in the constraint expression of a Transformer stage of an output link. REJECTED is initially TRUE, but is set to FALSE whenever an output link is successfully written. Question: Where does unix script of datastage executes weather in clinet machine or in server.suppose if it eexcutes on server then it will Added: 7/26/2006
Datastage jobs are executed in the server machines only. There is nothing that is stored in the client machine. What is DS Director used for - did u use it?
Added: 7/26/2006
Datastage Director is GUI to monitor, run, validate & schedule datastage server jobs. What's the difference between Datastage Developers and Datastage Designers. What are the skill's required for this. Added: 7/26/2006
Answer:
Datastage developer is one how will code the jobs.datastage designer is how will desgn the job, i mean he will deal with blue prints and he will design the jobs the stages that are required in developing the code If data is partitioned in your job on key 1 and then you aggregate on key 2, what issues could arise?
Question:
Added: 7/26/2006
Answer: Question:
Data will partitioned on both the keys ! hardly it will take more for execution . Dimension Modelling types along with their significance Data Modeling 1) E-R Diagrams Added: 7/26/2006
Answer:
2) Dimensional modeling 2.a) logical modeling 2.b)Physical modeling CoolInterview.com What is job control?how it is developed?explain with steps? Added: 7/26/2006
Question:
Controlling Datstage jobs through some other Datastage jobs. Ex: Consider two Jobs XXX and YYY. The Job YYY can be executed from Job XXX by using Datastage macros in Routines. To Execute one job from other job, following steps needs to be followed in Routines. 1. Attach job using DSAttachjob function. Answer: 2. Run the other job using DSRunjob function 3. Stop the job using DSStopJob function CoolInterview.com Question: Containers : Usage and Types? Added: 7/27/2006
Answer:
Container is a collection of stages used for the purpose of Reusability. There are 2 types of Containers. a) Local Container: Job Specific b) Shared Container: Used in any job within a project. There are two types of shared container: 1.Server shared container. Used in server jobs (can also be used in parallel jobs). 2.Parallel shared container. Used in parallel jobs. You can also include server shared containers in parallel jobs as a way of incorporating server job functionality into a parallel stage (for example, you could use one to make a server plug-in stage available to a parallel job).regardsjagan CoolInterview.com * What are constraints and derivation? * Explain the process of taking backup in DataStage? *What are the different types of lookups available in DataStage? Added: 7/26/2006
Question:
Answer:
Constraints are used to check for a condition and filter the data. Example: Cust_Id<>0 is set as a constraint and it means and only those records meeting this will be processed further.
Derivation is a method of deriving the fields, for example if you need to get some SUM,AVG etc. Question: Answer: Question: What does a Config File in parallel extender consist of? Config file consists of the following. a) Number of Processes or Nodes. b) Actual Disk Storage Location. How can you implement Complex Jobs in datastage Added: 7/26/2006 Added: 7/27/2006
Answer:
Complex design means having more joins and more look ups. Then that job design will be called as complex job.We can easily implement any complex design in DataStage by following simple tips in terms of increasing performance also. There is no limitation of using stages in a job. For better performance, Use at the Max of 20 stages in each job. If it is exceeding 20 stages then go for another job.Use not more than 7 look ups for a transformer otherwise go for including one more transformer.
Question:
What are validations you perform after creating jobs in designer. What r the different type of errors u faced during loading and how u solve them
Added: 7/26/2006
Answer:
Check for Parameters. and check for inputfiles are existed or not and also check for input tables existed or not and also usernames, datasource names, passwords like that
How do you fix the error "OCI has fetched truncated data" in DataStage
Added: 7/26/2006
Can we use Change capture stage to get the truncated data's.Members please confirm. What user varibale activity when it used how it used !where it is used with real example Added: 7/26/2006
Answer:
By using This User variable activity we can create some variables in the job sequnce,this variables r available for all the activities in that sequnce. Most probablly this activity is @ starting of the job sequnce
Question:
How we use NLS function in Datastage? what are advantages of NLS function? where we can use that one? explain briefly? By using NLS function we can do the following - Process the data in a wide range of languages - Use Local formats for dates, times and money - Sort the data according to the local rules If NLS is installed, various extra features appear in the product. For Server jobs, NLS is implemented in DataStage Server engine For Parallel jobs, NLS is implemented using the ICU library. If a DataStage job aborts after say 1000 records, how to continue the job from 1000th record after fixing the error?
Added: 7/26/2006
Answer:
Question: Answer:
Added: 7/26/2006
By specifying Checkpointing in job sequence properties, if we restart the job. Then job will start by skipping upto the failed record.this option is available in 7.5 edition. Differentiate Database data and Data warehouse data?
Question: Answer:
Added: 7/26/2006
By Database, one means OLTP (On Line Transaction Processing). This can be the source systems or the ODS (Operational Data Store), which contains the transactional data.
Added: 8/18/2006
Basically Environment variable is predefined variable those we can use while creating DS job.We can set eithere as Project level or Job level.Once we set specific variable that variable will be availabe into the project/job.We can also define new envrionment variable.For that we can got to DS Admin . What are all the third party tools used in DataStage? Autosys, TNG, event coordinator What is APT_CONFIG in datastage Added: 7/26/2006 Added: 7/26/2006
APT_CONFIG is just an environment variable used to idetify the *.apt file. Dont confuse that with *.apt file that has the node's information and Configuration of SMP/MMP server. If your running 4 ways parallel and you have 10 stages on the canvas, how many processes does datastage create? Added: 7/26/2006
Answer: Question:
Answer is 40 You have 10 stages and each stage can be partitioned and run on 4 nodes which makes total number of processes generated are 40 Did you Parameterize the job or hard-coded the values in the jobs? Added: 7/26/2006
Answer:
Always parameterized the job. Either the values are coming from Job Properties or from a Parameter Manager a third part tool. There is no way you will hardcode some parameters in your jobs. The often Parameterized variables in a job are: DB DSN name, username, password, dates W.R.T for the data to be looked against at. Defaults nodes for datastage parallel Edition Added: 7/26/2006
Question: Answer:
Actually the Number of Nodes depend on the number of processors in your system.If your system is supporting two processors we will get two nodes by default.
Question:
What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job.
Added: 7/26/2006
A. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. May be you can schedule the sequencer around the time the file is expected to arrive. B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the file. What are the command line functions that import and export the DS jobs? A. dsimport.exe- imports the DataStage components. B. dsexport.exe- exports the DataStage components. Dimensional modelling is again sub divided into 2 types. Added: 7/26/2006 Added: 7/26/2006
a)Star Schema - Simple & Much Faster. Denormalized form. b)Snowflake Schema - Complex with more Granularity. More normalized form. What are Sequencers? Added: 7/26/2006
Answer:
A sequencer allows you to synchronize the control flow of multiple activities in a job sequence. It can have multiple input triggers as well as multiple output triggers.The sequencer operates in two modes:ALL mode. In this mode all of the inputs to the sequencer must be TRUE for any of the sequencer outputs to fire.ANY mode. In this mode, output triggers can be fired if any of the sequencer inputs are TRUE What are the Repository Tables in DataStage and What are they? Added: 7/26/2006
Question: Answer:
A datawarehouse is a repository(centralized as well as distributed) of Data, able to answer any adhoc,analytical,historical or complex queries.Metadata is data about data. Examples of metadata include data element descriptions, data type descriptions, attribute/property descriptions, range/domain descriptions, and process/method descriptions. The repository environment
encompasses all corporate metadata resources: database catalogs, data dictionaries, and navigation services. Metadata includes things like the name, length, valid values, and description of a data element. Metadata is stored in a data dictionary and repository. It insulates the data warehouse from changes in the schema of operational systems.In data stage I/O and Transfer , under interface tab: input , out put & transfer pages.U will have 4 tabs and the last one is build under that u can find the TABLE NAME .The DataStage client components are:AdministratorAdministers DataStage projects and conducts housekeeping on the serverDesignerCreates DataStage jobs that are compiled into executable programs DirectorUsed to run and monitor the DataStage jobsManagerAllows you to view and edit the contents of the repository. Question: Whats difference betweeen operational data stage (ODS) & data warehouse? Added: 7/26/2006
Answer:
A dataware house is a decision support database for organisational needs.It is subject oriented,non volatile,integrated ,time varient collect of data. ODS(Operational Data Source) is a integrated collection of related information . it contains maximum 90 days information.
Question:
Added: 7/26/2006
1. If u want to know some job is a part of a sequence, then in the Manager right click the job and select Usage Analysis. It will show all the jobs dependents. 2. To find how many jobs are using a particular table. Answer: 3. To find how many jobs are usinga particular routine. Like this, u can find all the dependents of a particular object. Its like nested. U can move forward and backward and can see all the dependents. Question: How do you pass filename as the parameter for a job? Added: 7/26/2006
1. Go to DataStage Administrator->Projects->Properties->Environment->UserDefined. Here you can see a grid, where you can enter your parameter name and the corresponding the path of the file. Answer: 2. Go to the stage Tab of the job, select the NLS tab, click on the "Use Job Parameter" and select the parameter name which you have given in the above. The selected parameter name appears in the text box beside the "Use Job Parameter" button. Copy the parameter name from the text box and use it in your job. Keep the project default in the text box. How to remove duplicates in server job Added: 7/26/2006
Question:
Answer:
1)Use a hashed file stage or 2) If you use sort command in UNIX(before job sub-routine), you can reject duplicated records using -u parameter or 3)using a Sort stage What are the difficulties faced in using DataStage ? or what are the constraints in using DataStage ? 1)If the number of lookups are more? Added: 7/26/2006
Question:
Answer: Question:
2)what will happen, while loading the data due to some regions job aborts? Does Enterprise Edition only add the parallel processing for better performance?Are any stages/transformations available in the enterprise edition only? Added: 7/26/2006
Answer:
DataStage Standard Edition was previously called DataStage and DataStage Server Edition. DataStage Enterprise Edition was originally called Orchestrate, then renamed to Parallel Extender when purchased by Ascential. DataStage Enterprise: Server jobs, sequence jobs, parallel jobs. The enterprise edition offers parallel processing features for scalable high volume solutions. Designed
originally for Unix, it now supports Windows, Linux and Unix System Services on mainframes. DataStage Enterprise MVS: Server jobs, sequence jobs, parallel jobs, mvs jobs. MVS jobs are jobs designed using an alternative set of stages that are generated into cobol/JCL code and are transferred to a mainframe to be compiled and run. Jobs are developed on a Unix or Windows server transferred to the mainframe to be compiled and run. The first two versions share the same Designer interface but have a different set of design stages depending on the type of job you are working on. Parallel jobs have parallel stages but also accept some server stages via a container. Server jobs only accept server stages, MVS jobs only accept MVS stages. There are some stages that are common to all types (such as aggregation) but they tend to have different fields and options within that stage. Question: Answer: Question: What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director? Added: 7/26/2006
"AUTOSYS": Thru autosys u can automate the job by invoking the shell script written to schedule the datastage jobs. It is possible to call one job in another job in server jobs? Added: 7/26/2006
I think we can call a job into another job. In fact calling doesn't sound good, because you attach/add the other job through job properties. In fact, you can attach zero or more jobs. Answer: Steps will be Edit --> Job Properties --> Job Control Click on Add Job and select the desired job. Question: How do u clean the datastage repository. Added: 8/18/2006
Answer:
Universe Commands from DS Administrator Here is the process:1 Telnet onto the Datastage server E.g. > telnet pdccal05 2 Logon with the datastage username and password as if you were logging onto DS Director or an Administrator of the Server. 3. when prompted to the account name choose the project name or hit enter Welcome to the DataStage Telnet Server. Enter user name: dstage Enter password: ******* Account name or path(live): live DataStage Command Language 7.5 Copyright (c) 1997 - 2004 Ascential Software Corporation. All Rights Reserved live logged on: Monday, June 12, 2006 12:48 Type >"DS.TOOLS" You then get DataStage Tools Menu 1. Report on project licenses 2. Rebuild Repository indices 3. Set up server-side tracing >> 4. Administer processes/locks >> 5. Adjust job tunable properties Which would you like? ( 1 - 5 ) ? "select # 4" You then get DataStage Process and Lock Administration 1. List all processes
2. List state of a process 3. List state of all processes in a job Stopping and Restarting the Server Engine From time to time you may need to stop or restart the DataStage server engine manually, for example, when you wish to shut down the physical server. A script called uv is provided for these purposes. To stop the server engine, use: # dshome/bin/uv -admin -stop This shuts down the server engine and frees any resources held by server engine processes. To restart the server engine, use: # dshome/bin/uv -admin -start This ensures that all the server engine processes are started correctly. You should leave some time between stopping and restarting. A minimum of 30 seconds is recommended.
Subscribe
Data Modelling is Broadly classified into 2 types. a) E-R Diagrams (Entity - Relatioships). b) Dimensional Modelling. 2. What is the flow of loading data into fact & dimensional tables? Subscribe
Fact table - Table with Collection of Foreign Keys corresponding to the Primary Keys in Dimensional table. Consists of fields with numeric values. Dimension table - Table with Unique Primary Key. Loa 3. Orchestrate Vs Datastage Parallel Extender? Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX Subscribe
platform. Datastage used Orchestrate with Datastage XE (Beta version of 6.0) to incorporate the p
Read / Answer 4. Differentiate Primary Key and Partition Key? Subscribe Primary Key is a combination of unique and not null. It can be a collection of key values called as composite primary key. Partition Key is a just a part of Primary Key. There are several methods of 5. How do you execute datastage job from command line prompt? Using "dsjob" command as follows. dsjob -run -jobstatus projectname jobname 6. What are Stage Variables, Derivations and Constants? Subscribe Subscribe
Stage Variable - An intermediate processing variable that retains value during read and doesnt pass the value into target column. Derivation - Expression that specifies value to be passed on to the target column. 7. What is the default cache size? How do you change the cache size if needed? Subscribe
Default cache size is 256 MB. We can incraese it by going into Datastage Administrator and selecting the Tunable Tab and specify the cache size over there. 8. Containers : Usage and Types? Subscribe
Container is a collection of stages used for the purpose of Reusability. There are 2 types of Containers. a) Local Container: Job Specific b) Shared Container: Used in any job within a project. 9. Compare and Contrast ODBC and Plug-In stages? Subscribe
ODBC : a) Poor Performance. b) Can be used for Variety of Databases. c) Can handle Stored Procedures. Plug-In: a) Good Performance. b) Database specific.(Only one database) c) Cannot handle Stored Pr 10. How to run a Shell Script within the scope of a Data stage job? By using "ExcecSH" command at Before/After job properties. 11. Types of Parallel Processing? Subscribe Parallel Processing is broadly classified into 2 types. a) SMP - Symmetrical Multi Processing. b) MPP - Massive Parallel Processing. Subscribe
12. What does a Config File in parallel extender consist of? Config file consists of the following. a) Number of Processes or Nodes. b) Actual Disk Storage Location. 13. Functionality of Link Partitioner and Link Collector?
Subscribe
Subscribe
Link Partitioner : It actually splits data into various partitions or data flows using various partition methods . Link Collector : It collects the data coming from partitions, merges it into a single partiotion. 14. What is Modulus and Splitting in Dynamic Hashed File? Subscribe
In a Hashed File, the size of the file keeps changing randomly. If the size of the file increases it is called as "Modulus". If the size of the file decreases it is called as "Splitting 15. Types of vies in Datastage Director? Subscribe
There are 3 types of views in Datastage Director a) Job View - Dates of Jobs Compiled. b) Log View - Status of Job last run c) Status View - Warning Messages, Event Messages, Program Generated Messag 16. Differentiate Database data and Data warehouse data? Data in a Database is a) Detailed or Transactional b) Both Readable and Writable. c) Current. 37. What are Static Hash files and Dynamic Hash files? Subscribe Subscribe
As the names itself suggest what they mean. In general we use Type-30 dynamic Hash files. The Data file has a default size of 2Gb and the overflow file is used if the data exceeds the 2GB size. 62. Does the selection of 'Clear the table and Insert rows' in the ODBC stage send a Truncate Does the selection of 'Clear the table and Insert rows' in the ODBC stage send a Truncate statement to the DB or does it do some kind of Delete logic. Subscribe
There is no TRUNCATE on ODBC stages. It is Clear table blah blah and that is a delete from statement. On an OCI stage such as Oracle, you do have both Clear and Truncate options. They are radically di 63. How do you rename all of the jobs to support your new File-naming conventions? Create a Excel Subscribe
spreadsheet with new and old names. Export the whole project as a dsx. Write a Perl program, which can do a simple rename of the strings looking up the Excel file. Then import the new d
Read / Answer
101. defaults nodes for datastage parallel Edition Asked by: sekhar Latest Answer: Actually the Number of Nodes depend on the number of processors in your system.
Subscribe
how to implement routines in data stage,have any o... there are 3 kind of routines is there in Datastage. 1.server routines which will used in server jobs. these routines will write in BASIC Language 2.parlell routines which will used in parlell jobs
These routines will write in C/C++ Language 3.mainframe routines which will used in mainframe jobs
SQL Interview Questions Question : How to find second maximum value from a table? select max(field1) from tname1 where field1=(select max(field1) from tname1 where field1<(select max(field1) from tname1); Field1- Salary field Tname= Table name. Question : How to retrieving the data from 11th column to n th column in a table. select * from emp where rowid in ( select rowid from emp where rownum <=&upto minus select rowid from emp where rownum <&startfrom) from this you can select between any range. Question : I have a table with duplicate names in it. Write me a query which returns only duplicate rows with number of times they are repeated. SELECT COL1 FROM TAB1 WHERE COL1 IN (SELECT MAX(COL1) FROM TAB1 GROUP BY COL1 HAVING COUNT(COL1) > 1 ) Difference between Store Procedure and Trigger Information related to Stored procedure you can see in USER_SOURCE,USER_OBJECTS(current user) tables. Information related to triggers stored in USER_SOURCE,USER_TRIGGERS (current user) Tables. Stored procedure can't be inactive but trigger can be Inactive. Question : When using a count(disitnct) is it better to use a self-join or temp table to find redundant data, and provide an example? Instead of this we can use GROUP BY Clause with HAVING condition. Answer : For ex, Select count(*),lastname from tblUsers group by lastname having count(*)>1 This query return the duplicated lastnames values in the lastname column from tblUsers table. Question : What is the advantage to use trigger in your PL? Triggers are fired implicitly on the tables/views on which they are created. There are various advantages of using a trigger. Some of them are: - Suppose we need to validate a DML statement(insert/Update/Delete) that modifies a table then we can write a trigger on the table that gets fired implicitly whenever DML statement is executed
Answer :
Answer :
Answer :
Question :
Answer :
Answer :
on that table. - Another reason of using triggers can be for automatic updation of one or more tables whenever a DML/DDL statement is executed for the table on which the trigger is created. - Triggers can be used to enforce constraints. For eg : Any insert/update/ Delete statements should not be allowed on a particular table after office hours. For enforcing this constraint Triggers should be used. - Triggers can be used to publish information about database events to subscribers. Database event can be a system event like Database startup or shutdown or it can be a user even like User loggin in or user logoff. Question : What the difference between UNION and UNIONALL? Union will remove the duplicate rows from the result set while Union all doesnt. Answer : Question : How to display duplicate rows in a table? select * from emp Answer : group by (empid) having count(empid)>1 Question : What is the difference between TRUNCATE and DELETE commands Both will result in deleting all the rows in the table .TRUNCATE call cannot be rolled back as it is a DDL command and all memory space for that table is released back to the server. TRUNCATE is much faster.Whereas DELETE call is an DML command and can be rolled back. Which system table contains information on constraints on all the tables created Ans: Answer : yes, USER_CONSTRAINTS, system table contains information on constraints on all the tables created Question : What is the difference between Single row sub-Query and Scalar sub-Query SINGLE ROW SUBQUERY RETURNS A VALUE WHICH IS USED BY WHERE CLAUSE , WHEREAS SCALAR SUBQUERY IS A SELECT STATEMENT USED IN COLUMN LIST CAN BE THOUGHT OF AS AN INLINE FUNCTION IN SELECT COLUMN LIST How to copy sql table. COPY FROM database TO database action Answer : destination_table (column_name, column_name...) USING query eg copy from scott/tiger@ORCL92 to scott/tiger@ORCL92create new_emp using select * from emp; Question : What is table space
Answer : Question :
Answer : Question :
Answer : Question :
Table-space is a physical concept.it has pages where the records of the database is stored with a logical perception of tables.so tablespace contains tables. What is difference between Co-related sub query and nested sub query?? Correlated subquery runs once for each row selected by the outer query. It contains a reference to a value from the row selected by the outer query. Nested subquery runs only once for the entire nesting (outer) query. It does not contain any reference to the outer query row. For example, Correlated Subquery: select e1.empname, e1.basicsal, e1.deptno from emp e1 where e1.basicsal = (select max(basicsal) from emp e2 where e2.deptno = e1.deptno) Nested Subquery: select empname, basicsal, deptno from emp where (deptno, basicsal) in (select deptno, max(basicsal) from emp group by deptno)
Answer :
Question :
There is a eno & gender in a table. Eno has primary key and gender has a check constraints for the values 'M' and 'F'. While inserting the data into the table M was misspelled as F and F as M. What is the update statement to replace F with M and M with F? update <TableName> set gender=
Answer :
Question :
When we give SELECT * FROM EMP; How does oracle respond: When u give SELECT * FROM EMP;
Answer : Question :
the server check all the data in the EMP file and it displays the data of the EMP file What is reference cursor Reference cursor is dynamic cursor used with SQL statement like For select* from emp;
Answer : Question : WHAT OPERATOR PERFORMS PATTERN MATCHING? Pattern matching operator is LIKE and it has to used with two attributes Answer : 1. % and 2. _ ( underscore ) % means matches zero or more characters and under score means matching exactly one character Question : There are 2 tables, Employee and Department. There are few records in employee table, for which, the department is not assigned. The output of the query should contain all th employees names and their corresponding departments, if the department is assigned otherwise employee names and null value in the place department name. What is the query?
Answer :
What you want to use here is called a left outer join with Employee table on the left side. A left outer join as the name says picks up all the records from the left table and based on the joint column picks the matching records from the right table and in case there are no matching records in the right table, it shows null for the selected coloumns of the right table. eg in this query which uses the key-word LEFT OUTER JOIN. syntax though varies across databases. In DB2/UDB it uses the key word LEFT OUTER JOIN, in case of Oracle the connector is Employee_table.Dept_id *= Dept_table.Dept_id SQL Server/Sybase : Employee_table.Dept_id(+) = Dept_table.Dept_id
Question :
What is normalazation,types with eg\'s. _ with queries of all types There are 5 normal forms. It is necessary for any database to be in the third normal form to maintain referential integrity and non-redundance. First Normal Form: Every field of a table (row,col) must contain an atomic value Second Normal Form: All columns of a table must depend entirely on the primary key column. Third Normal Form: All columns of a table must depend on all columns of a composite primary key. Fourth Normal Form: A table must not contain two or more independent multi-valued facts. This normal form is often avoided for maintenance reasons. Fifth Normal Form: is about symmetric dependencies. Each normal form assumes that the table is already in the earlier normal form.
Answer :
Question :
When we give SELECT * FROM EMP; How does oracle respond: When u give SELECT * FROM EMP;
Answer : Question :
the server check all the data in the EMP file and it displays the data of the EMP file Difference between a equijoin and a union Indeed both equi join and the Union are very different. Equi join is used to establish a condition between two tables to select data from them.. eg select a.employeeid, a.employeename, b.dept_name from employeemaster a , DepartmentMaster b where a.employeeid = b.employeeid; This is the example of equijoin whereas with a Union allows you to select the similar data based on different conditions eg select a.employeeid, a.employeename from employeemaster a where a.employeeid >100 b.employeeid Union select a.employeeid, a.employeename from employeemaster a where a.employeename like 'B%' the above is the example of Union where in we select employee name and Id for two different conditions into the same recordset and is used thereafter.
Answer :
Question :
What is database? A database is a collection of data that is organized so that itscontents can easily be accessed, managed and updated. Difference between VARCHAR and VARCHAR2?
Answer : Question :
Varchar means fixed length character data(size) ie., min size-1 and max-2000 Answer : Question : Varchar2 means variable length character data ie., min-1 to max-4000 How to write a sql statement to find the first occurrence of a non zero value? There is a slight chance the column "a" has a value of 0 which is not null. In that case, You'll loose the information. There is another way of searching the first not null value of a column: select column_name from table_name where column_name is not null and rownum<2; Question : How can i hide a particular table name of our schema. you can hide the table name by creating synonyms. Answer : e.g) you can create a synonym y for table x create synonym y for x; Question : What is the main difference between the IN and EXISTS clause in subqueries?? The main difference between the IN and EXISTS predicate in subquery is the way in which the query gets executed. IN -- The inner query is executed first and the list of values obtained as its result is used by the outer query.The inner query is executed for only once. EXISTS -- The first row from the outer query is selected ,then the inner query is executed and , the outer query output uses this result for checking.This process of inner query execution repeats as many no.of times as there are outer query rows. That is, if there are ten rows that can result from outer query, the inner query is executed that many no.of times. Question : In subqueries, which is efficient ,the IN clause or EXISTS clause? Does they produce the same result????? EXISTS is efficient because, Answer : 1.Exists is faster than IN clause. 2.IN check returns values to main query where as EXISTS returns Boolean (T or F). Question : What is difference between DBMS and RDBMS? 1.RDBMS=DBMS+Refrential Integrity Answer : Question : 2. An RDBMS ia one that follows 12 rules of CODD. What is Materialized View? A materialized view is a database object that contains the results of a query. They are local copies of data located remotely or used to create summary tables based on aggregation of a tables data. Materialized views which store data based on the remote tables are also know as snapshots. How to find out the 10th highest salary in SQL query? Table - Tbl_Test_Salary Column - int_salary
Answer :
Answer :
Answer :
Question :
Answer :
select max(int_salary) from Tbl_Test_Salary where int_salary in (select top 10 int_Salary from Tbl_Test_Salary order by int_salary) Question : How to find second maximum value from a table? select max(field1) from tname1 where field1=(select max(field1) from tname1 where field1<(select max(field1) from tname1); Field1- Salary field Tname= Table name. Question : What are the advantages and disadvantages of primary key and foreign key in SQL? Primary key Answer : Advantages 1) It is a unique key on which all the other candidate keys are functionally dependent Disadvantage 1) There can be more than one keys on which all the other attributes are dependent on. Foreign Key Advantage 1)It allows referencing another table using the primary key for the other table Question : What operator tests column for the absence of data? IS NULL operator. Answer : Question : What is the parameter substitution symbol used with INSERT INTO command? & Answer : Question : Which command displays the SQL command in the SQL buffer, and then executes it? RUN. Answer : Question : What are the wildcards used for pattern matching. _ for single character substitution and % for multi-character substitution. Answer : Question : What are the privileges that can be granted on a table by a user to others? Insert, update, delete, select, references, index, execute, alter, all. Answer : Question : What command is used to get back the privileges offered by the GRANT command? REVOKE. Answer : Question : Which system tables contain information on privileges granted and privileges obtained? USER_TAB_PRIVS_MADE, USER_TAB_PRIVS_RECD.
Answer :
Answer : Question : What command is used to create a table by copying the structure of another table? CREATE TABLE .. AS SELECT command Explanation: To copy only the structure, the WHERE clause of the SELECT command should contain a FALSE statement as in the following. CREATE TABLE NEWTABLE AS SELECT * FROM EXISTINGTABLE WHERE 1=2; If the WHERE condition is true, then all the rows or rows satisfying the condition will be copied to the new table. Which date function is used to find the difference between two dates? MONTHS_BETWEEN. Answer : Question : Why does the following command give a compilation error? DROP TABLE &TABLE_NAME; Variable names should start with an alphabet. Here the table name starts with an '&' symbol. Answer : Question : What is the advantage of specifying WITH GRANT OPTION in the GRANT command? The privilege receiver can further grant the privileges he/she has obtained from the owner to any other user. What is the use of the DROP option in the ALTER TABLE command? It is used to drop constraints specified on the table. Answer : Question : What is the use of DESC in SQL? DESC has two purposes. It is used to describe a schema as well as to retrieve rows from table in descending order. Explanation : The query SELECT * FROM EMP ORDER BY ENAME DESC will display the output sorted on ENAME in descending order. What is the use of CASCADE CONSTRAINTS? When this clause is used with the DROP command, a parent table can be dropped even when a child table exists. Which function is used to find the largest integer less than or equal to a specific value? FLOOR. Answer : Question : How we can count duplicate entry in particular table against Primary Key ? What are constraints? The syntax in the previous answer (where count(*) > 1) is very questionable. suppose you think that you have duplicate employee numbers. there's no need to count them to find out which values were duplicate but the followin SQL will show only the empnos that are duplicate and how many exist in the table: Select empno, count(*) from employee group by empno having count(*) > 1
Answer :
Question :
Answer : Question :
Answer :
Question :
Answer : Question :
Answer :
Generally speaking aggregate functions (count, sum, avg etc.) go in the HAVING clause. I know some systems allow them in the WHERE clause but you must be very careful in interpreting the result. WHERE COUNT(*) > 1 will absolutely NOT work in DB2 or ORACLE. Sybase and SQLServer is a different animal. Question : How to display nth highest record in a table for example How to display 4th highest (salary) record from customer table. Query: SELECT sal FROM `emp` order by sal desc limit (n-1),1If the question: "how to display 4th highest (salary) record from customer table."The query will SELECT sal FROM `emp` order by sal desc limit 3,1
Answer :
1. 2. 3. 4.
SELECT table_name, comments FROM dictionary WHERE table_name LIKE 'user_%' ORDER BY table_name; View Description Contains every Tables, View and Synonym. All Object-Oriented Tables. comments for every Table, which are usually a short description of the Table. All owner Privileges, including PUBLIC('everyone'). All Object Types. All Object Types that are Methods. Names and Create Dates for all Users. This item is in bold because it is very important. If you know nothing, you can start with 'SELECT * FROM DATA_DICT'. It is a top-level record of everything in the Data Dictionary. Remaining Free Space in each Tablespace. A DBA typically spends a lot of time increasing Free Space as Databases have to hold more data than originally planned. All Tables, Views and Synonynms that the currently logged-in User can see. Details of User-specific Indexes. Details of User-specific Tables, with some statistics, such as record counts. Details of Columns within each User Table, which is very useful when you want to find the structure of your Tables. Details of User-specific Privileges for Table access. Details of User-specific Views, and the SQL for each View,(which is essential).
ALL_CATALOG ALL_OBJECT_TABLES ALL_TAB_COMMENTS ALL_TAB_GRANTS ALL_TYPES ALL_TYPES_METHODS ALL_USERS DATA_DICT ('DICT') DBA_FREE_SPACE USER_CATALOG USER_INDEXES USER_TABLES USER_TAB_COLUMNS USER_TAB_GRANTS USER_VIEWS
Q.How to rename the file in the after job routine with todays date? A.In the after job routine select ExecSh and write the following command. mv <oldfilename.csv> <newfilename_`date '+%d_%m_%Y_%H%M%S'`.csv> Explain the difference between server transformer and parallel transformer ?? A. B. C. The main difference is server Transformer supports basic transforms only,but in pararllel both basic and pararllel transforms. server transformer is basic language compatability,pararllel transformer is c++ language compatabillit 1.multipla input links--single input link accepts routines which r written in basic language-- in c/c++ language
Q.How to see the data in the Dataset in UNIX. What command we have to use to see the data in Dataset in UNIX?
Ans : orchadmin dump <datasetname>.ds What is the Diff between Change Capture and Change Apply Stages? A. the 2 stages compares two data set(after and before) and makes a record of the differences. change apply stage combine the changes from the change capture stage with the original before data set to reproduce the after data set
Orchadmin is a command line utility provided by datastage to research on data sets. The general callable format is : $orchadmin <command> [options] [descriptor file] 1. Before using orchadmin, you should make sure that either the working directory or the $APT_ORCHHOME/etc contains the file config.apt OR The environment variable $APT_CONFIG_FILE should be defined for your session. Orchadmin commands The various commands available with orchadmin are 1. CHECK: $orchadmin check Validates the configuration file contents like , accesibility of all nodes defined in the configuration file, scratch disk definitions and accesibility of all the nodes etc. Throws an error when config file is not found or not defined properly 2. COPY : $orchadmin copy <source.ds> <destination.ds> Makes a complete copy of the datasets of source with new destination descriptor file name. Please note that a. You cannot use UNIX cp command as it justs copies the config file to a new name. The data is not copied. b. The new datasets will be arranged in the form of the config file that is in use but not according to the old confing file that was in use with the source. 3. DELETE : $orchadmin < delete | del | rm > [-f | -x] descriptorfiles. The unix rm utility cannot be used to delete the datasets. The orchadmin delete or rm command should be used to delete one or more persistent data sets. -f options makes a force delete. If some nodes are not accesible then -f forces to delete the dataset partitions from accessible nodes and leave the other partitions in inaccesible nodes as orphans. -x forces to use the current config file to be used while deleting than the one stored in data set. 4. DESCRIBE: $orchadmin describe [options] descriptorfile.ds This is the single most important command. 1. Without any option lists the no.of.partitions, no.of.segments, valid segments, and preserve partitioning flag details of the persistent dataset. -c : Print the configuration file that is written in the dataset if any
-p: Lists down the partition level information. -f: Lists down the file level information in each partition -e: List down the segment level information . -s: List down the meta-data schema of the information. -v: Lists all segemnts , valid or otherwise -l : Long listing. Equivalent to -f -p -s -v -e 5. DUMP: $orchadmin dump [options] descriptorfile.ds The dump command is used to dump(extract) the records from the dataset. Without any options the dump command lists down all the records starting from first record from first partition till last record in last partition. -delim <string> : Uses the given string as delimtor for fields instead of space. -field <name> : Lists only the given field instead of all fields. -name : List all the values preceded by field name and a colon -n numrecs : List only the given number of records per partition. -p period(N) : Lists every Nth record from each partition starting from first record. -skip N: Skip the first N records from each partition. -x : Use the current system configuration file rather than the one stored in dataset. 6. TRUNCATE: $orchadmin truncate [options] descriptorfile.ds Without options deletes all the data(ie Segments) from the dataset. -f: Uses force truncate. Truncate accessible segments and leave the inaccesible ones. -x: Uses current system config file rather than the default one stored in the dataset. -n N: Leaves the first N segments in each partition and truncates the remaining. 7. HELP: $orchadmin -help OR $orchadmin <command> -help Help manual about the usage of orchadmin or orchadmin commands. Datastage: Job design tips
I am just collecting the general design tips that helps the developers to build clean & effective jobs. 1. Turn off Runtime Column propagation wherever its not required.
2.Make use of Modify, Filter, and Aggregation, Col. Generator etc stages instead of Transformer stage only if the anticipated volumes are high and performance becomes a problem. Otherwise use Transformer. Its very easy to code a transformer than a modify stage. 3. Avoid propagation of unnecessary metadata between the stages. Use Modify stage and drop the metadata. Modify stage will drop the metadata only when explicitey specified using DROP clause. 4. One of the most important mistake that developers often make is not to have a volumetric analyses done before you decide to use Join or Lookup or Merge stages. Estimate the volumes and then decide which stage to go for. 5.Add reject files wherever you need reprocessing of rejected records or you think considerable data loss may happen. Try to keep reject file at least at Sequential file stages and writing to Database stages. 6.Make use of Order By clause when a DB stage is being used in join. The intention is to make use of Database power for sorting instead of datastage reources. Keep the join partitioning as Auto. Indicate dont sort option between DB stage and join stage using sort stage when using order by clause. 7. While doing Outer joins, you can make use of Dummy variables for just Null checking instead of fetching an explicit column from table. 8. Use Sort stages instead of Remove duplicate stages. Sort stage has got more grouping options and sort indicator options. 9. One of the most frequent mistakes that developers face is lookup failures by not taking care of String padchar that datastage appends when converting strings of lower precision to higher precision.Try to decide on the APT_STRING_PADCHAR, APT_CONFIG_FILE parameters from the beginning. Ideally APT_STRING_PADCHAR should be set to OxOO (C/C++ end of string) and Configuration file to the maximum number of nodes available. 10. Data Partitioning is very important part of Parallel job design. Its always advisable to have the data partitioning as Auto unless you are comfortable with partitioning, since all DataStage stages are designed to perform in the required way with Auto partitioning. 11.Do remember that Modify drops the Metadata only when it is explicitly asked to do so using KEEP/DROP clauses. which partitioning follows in join,merge and lookup? Join Stage follows Modulus partitioning method.Merge follows same partitioning method as well as Auto partitioning method.Lookup follows Entire partitioning method. These functions can be used in a job control routine, which is defined as part of a jobs properties and allows other jobs to be run and controlled from the first job. Some of the functions can also be used for getting status information on the current job; these are useful in active stage expressions and before- and after-stage subroutines. To do this ... Use this function ... Specify the job you want to control DSAttachJob Set parameters for the job you want to control DSSetParam Set limits for the job you want to control DSSetJobLimit Request that a job is run DSRunJob Wait for a called job to finish DSWaitForJob Gets the meta data details for the specified link DSGetLinkMetaData Get information about the current project DSGetProjectInfo Get buffer size and timeout value for an IPC or Web DSGetIPCStageProps Service stage Get information about the controlled job or current DSGetJobInfo job Get information about the metabag properties DSGetJobMetaBag associated with the named job Get information about a stage in the controlled job or DSGetStageInfo current job Get the names of the links attached to the specified DSGetStageLinks stage Get a list of stages of a particular type in a job. DSGetStagesOfType Get information about the types of stage in a job. DSGetStageTypes Get information about a link in a controlled job or DSGetLinkInfo current job
Get information about a controlled jobs parameters DSGetParamInfo Get the log event from the job log DSGetLogEntry Get a number of log events on the specified subject DSGetLogSummary from the job log Get the newest log event, of a specified type, from DSGetNewestLogId the job log Log an event to the job log of a different job DSLogEvent Stop a controlled job DSStopJob Return a job handle previously obtained from DSDetachJob DSAttachJob Log a fatal error message in a job's log file and DSLogFatal aborts the job. Log an information message in a job's log file. DSLogInfo Put an info message in the job log of a job controlling DSLogToController current job. Log a warning message in a job's log file. DSLogWarn Generate a string describing the complete status of a DSMakeJobReport valid attached job. Insert arguments into the message template. DSMakeMsg Ensure a job is in the correct state to be run or DSPrepareJob validated. Interface to system send mail facility. DSSendMail Log a warning message to a job log file. DSTransformError Convert a job control status or error code into an DSTranslateCode explanatory text message. Suspend a job until a named file either exists or does DSWaitForFile not exist. Checks if a BASIC routine is cataloged, either in DSCheckRoutine VOC as a callable item, or in the catalog space. Execute a DOS or DataStage Engine command from DSExecute a before/after subroutine. Set a status message for a job to return as a DSSetUserStatus termination message when it finishes The number of macros are provided in the JOBCONTROL.H file to facilitate getting information about the current job, and links and stages belonging to the current job. These can be used in expressions (for example for use in Transformer stages), job control routines, filenames and table names, and before/after subroutines. The available macros are: DSHostName DSProjectName DSJobStatus DSJobName DSJobController DSJobStartDate DSJobStartTime DSJobStartTimestamp DSJobWaveNo DSJobInvocations DSJobInvocationId DSStageName DSStageLastErr DSStageType DSStageInRowNum DSStageVarList DSLinkRowCount DSLinkLastErr DSLinkName These macros provide the functionality of using the DSGetProjectInfo, DSGetJobInfo, DSGetStageInfo, and DSGetLinkInfo functions with the DSJ.ME token as the JobHandle and can be used in all active stages and before/after subroutines. The macros provide the functionality for all the possible InfoType arguments for the DSGetInfo functions. See the Function call help topics for more details.
Datastage Parallel jobs Vs Datastage Server jobs -------------------------------------------------------------------------------1) The basic difference between server and parallel jobs is the degree of parallelism Server job stages do not have in built partitoning and parallelism mechanism for extracting and loading data between different stages. All you can do to enhance the speed and performance in server jobs is to enable inter process row buffering through the administrator. This helps stages to exchange data as soon as it is available in the link. You could use IPC stage too which helps one passive stage read data from another as soon as data is available. In other words, stages do not have to wait for the entire set of records to be read first and then transferred to the next stage. Link practitioner and link collector stages can be used to achieve a certain degree of partitioning parallelism. All of the above features which have to be explored in server jobs are built in Datastage Px. 2) The Px engine runs on a multiprocessor system and takes full advantage of the processing nodes defined in the configuration file. Both SMP and MMP architecture is supported by datastage Px. 3) Px takes advantage of both pipeline parallelism and partitoning paralellism. Pipeline parallelism means that as soon as data is available between stages( in pipes or links), it can be exchanged between them without waiting for the entire record set to be read. Partitioning parallelism means that entire record set is partitioned into small sets and processed on different nodes(logical processors). For example if there are 100 records, then if there are 4 logical nodes then each node would process 25 records each. This enhances the speed at which loading takes place to an amazing degree. Imagine situations where billions of records have to be loaded daily. This is where datastage PX comes as a boon for ETL process and surpasses all other ETL tools in the market. 4) In parallel we have Dataset which acts as the intermediate data storage in the linked list, it is the best storage option it stores the data in datastage internal format. 5) In parallel we can choose to display OSH , which gives information about the how job works. 6) In Parallel Transformer there is no reference link possibility, in server stage reference could be given to transformer. Parallel stage can use both basic and parallel oriented functions. 7) Datastage server executed by datastage server environment but parallel executed under control of datastage runtime environment 8) Datastage compiled in to BASIC(interpreted pseudo code) and Parallel compiled to OSH(Orchestrate Scripting Language). 9) Debugging and Testing Stages are available only in the Parallel Extender. 10) More Processing stages are not included in Server example, Join, CDC, Lookup etc.. 11) In File stages, Hash file available only in Server and Complex flat file , dataset , lookup file set avail in parallel only. 12) Server Transformer supports basic transforms only, but in parallel both basic and parallel transforms. 13) Server transformer is basic language compatability, pararllel transformer is c++ language compatabillity 14) Look up of sequntial file is possible in parallel jobs 15) . In parallel we can specify more file paths to fetch data from using file pattern similar to Folder stage in Server, while in server we can specify one file name in one O/P link. 16). We can simulteneously give input as well as output link to a seq. file stage in Server. But an output link in parallel means a reject link, that is a link that collects records that fail to load into the sequential file for some reasons. 17). The difference is file size Restriction. Sequential file size in server is : 2GB Sequential file size in parallel is : No Limitation.. 18). Parallel sequential file has filter options too. Where you can specify the file pattern.
Datastage: Join or Lookup or Merge or CDC Many times this question pops up in the mind of Datastage developers. All the above stages can be used to do the same task. Match one set of data (say primary) with another set of data(references) and see the results. DataStage normally uses different execution plans (hmm i should ignore my Oracle legacy when posting on Datastage). Since DataStage is not so nice as Oracle, to show its Execution plan easily, we need to fill in the gap of Optimiser and analyze our requiremets. Well I have come up with a nice table , Most importantly its the Primary/Reference ratio that needs to be considered not the actual counts.
Primary Source Volume Little (< 5 million) Little ( < 5 million) Huge (> 10 million) Little (< 5 million) Huge (> 10 million) Huge (> 10 million)
Reference Volume Very Huge ( > 50 million) Little (< 5 million) Little (< 5 million) Huge ( > 10 million) Huge (> 10 million) Huge (> 10 million)
Datastage: Warning removals Jump to Comments Here I am collecting most of the warnings developers encounter when coding datastage jobs and trying to resolve them. 1. Warning: Possible truncation of input string when converting from a higher length string to lower length string in Modify. Resolution: In the Modify stage explicitly specify the length of output column. Ex: CARD_ACCOUNT:string[max=16] = string_trim[" ", end, begin](CARD_ACCOUNT) instead of just CARD_ACCOUNT = string_trim[" ", end, begin](CARD_ACCOUNT); 2. Warning: A Null result can come up in Non Nullable field. Mostly thrown by DataStage when aggregate functions are used in Oracle DB stage. Resolution: Use a Modify or Transformer stage in between lookup and Oracle stage. When using a modify stage, use the handle_null clause. EX: CARD_ACCOUNT:string[max=19] = handle_null(CARD_ACCOUNT,-128); -128 will be replaced in CARD_ACCOUNT wherever CARD_ACCOUNT is Null. 3. Warning: Some Decimal precision error converting from decimal [p,s] to decimal[x,y]. Resolution: Specify the exact scale and precision of the output column in the Modfiy stage specification and use trunc_zero (the default r_type with decimal_from_decimal conversion)
Ex: CREDIT_LIMIT:decimal[10,2] = decimal_from_decimal[trunc_zero](CREDIT_LIMIT);instead of just CREDIT_LIMIT = decimal_from_decimal[trunc_zero](CREDIT_LIMIT); For further information on where to specify the output column type explicitly and where not necessary, refer to the data type default/manual conversion guide For default data type conversion (d) size specification is not required. For manual conversion (m) explicit size specification is required. The table is available in parallel job developers guide 4. Warning: A sequential operator cannot preserve the partitioning of input data set on input port 0 Resolution: Clear the preserve partition flag before Sequential file stages. 5. Warning: A user defined sort operator does not satisfy the requirements. Resolution: In the job flow just notice the columns on which sort is happening . The order of the columns also must be the same. i.e if you specify sort on columns in the order X,Y in sort stage and specify join on the columns Y,X in order then join stage will throw the warning, since it cannot recognise the change in order. Also , i guess DataStage throws this warning at compile time . So if you rename a column in between stages, then also this warning is thrown. Say i have sorted on Column X in sort stage, but the column name is changed to Y at the output interface, then also the warning is thrown. Just revent the output interface column to X and the warning disappears. What is Bit Mapped Index Bit Map Indexing is a technique commonly used in relational databases where the application uses binary coding in representing data. This technique was originally used for low cardinality data but recent applications like the Sybase IQ have used this technique efficiently what is the difference between datasatge and datastage TX? 1. IBM DATASTAGE TX is the Any to Any message Transformation tool. It accepts message of any format ( xml, fixed length ) and can convert it to any desired format. DataStage is an ETL tool Datastage TX is an EAI tool. Datastage used in DWH ,TX used in EDI(Entrprise Data Interchange). The application of both tools is vary from both. What does a Config File in parallel extender consist of? Config file consists of the following. a) Number of Processes or Nodes. b) Actual Disk Storage Location. node "node1" { fastname "stvsauxpac01" pools "" resource disk "/local/datastage/Datasets" {pools ""} resource scratchdisk "/local/datastage/Scratch" {pools ""} } node "node2" { fastname "stvsauxpac01" pools "" resource disk "/local/datastage/Datasets" {pools ""} resource scratchdisk "/local/datastage/Scratch" {pools ""} }} The APT_Configuration file is having the information of resource disk,node pool,and scratch information, node information in the sence it contatins the how many nodes we given to run the jobs, because based on the nodes only datastage will create processors at backend while running the jobs,resource disk means this is the place where exactly jobs will be loading,scrach information will be useful whenever we using the lookups in the jobs What is exact difference between Parallel Jobs and server Jobs..
PIPELINING & PARTITIONING DOES NOT SUPPORT LOADS WHEN 1 JOB FINISH
SUPPORTS SYNCHRONIZED
SERVER JOBS USES SYMMETRICMULTIPROCESSING PARALLEL JOBS BOTH MASSIVE PARALLEL PROCESSING AND SYMMETRICMULTIPROCESSING
what is data set? and what is file set? I assume you are referring Lookup fileset only.It is only used for lookup stages only. Dataset: DataStage parallel extender jobs use data sets to manage data within a job. You can think of each link in a job as carrying a data set. The Data Set stage allows you to store data being operated on in a persistent form, which can then be used by other DataStage jobs. FileSet: DataStage can generate and name exported files, write them to their destination, and list the files it has generated in a file whose extension is, by convention, .fs. The data files and the file that lists them are called a file set. This capability is useful because some operating systems impose a 2 GB limit on the size of a file and you need to distribute files among nodes to prevent overruns. What is the max size of Data set stage? The Max size of a Dataset is equal to the summation of the space available in the Resource disk specified in the configuration file. What are the different types of data warehousing? There are four types: Native data warehouse Software data warehouse package data warehouse Data management fetching last row from a particular column of sequential file Develop a job source seq file--> Transformer--> output stage In the transformer write a stage variable as rowcount with the following derivation Goto DSfunctions click on DSGetLinkInfo.. you will get "DSGetLinkInfo(DSJ.ME,%Arg2%,%Arg3%,%Arg4%)" Arg 2 is your source stage name Arg 3 is your source link name Arg 4 --> Click DS Constant and select DSJ.LINKROWCOUNT. Now ur derivation is "DSGetLinkInfo(DSJ.ME,"source","link", DSJ.LINKROWCOUNT)" Create a constraint as @INROWNUM =rowcount and map the required column to output link.
What is BUS Schema? BUS Schema is composed of a master suite of confirmed dimension and standardized definition if facts. In a BUS schema we would eventually have conformed dimensions and facts defined to be shared across all enterprise data marts. This way all Data Marts can use the conformed dimensions and facts without having them locally. This is the first step towards building an enterprise Data Warehouse from Kimball's perspective. For (e.g) we may have different data marts for Sales, Inventory and Marketing and we need common entities like Customer,
Product etc to be seen across these data marts and hence would be ideal to have these as Conformed objects. The challenge here is that some times each line of business may have different definitions for these conformed objects and hence choosing conformed objects have to be designed with some extra care. What is a linked cube? Linked cube in which a sub-set of the data can be analysed into great detail. The linking ensures that the data in the cubes remain consistent. What is Data Modeling Hits: 1098
Description: Data Modeling is a method used to define and analyze data requirements needed to support the business functions of an enterprise. These data requirements are recorded as a conceptual data model with associated data definitions. Data modeling defines the relationships between data elements and structures. What is Data Cluster Hits: 854
Description: Clustering in the computer science world is the classification of data or object into different groups. It can also be referred to as partitioning of a data set into different subsets. Each data in the subset ideally shares some common traits. Data cluster are created to meet specific requirements that cannot created using any of the categorical levels. What is Data Aggregation Hits: 821
Description: In Data Aggregation, value is derived from the aggregation of two or more contributing data characteristics. Aggregation can be made from different data occurrences within the same data subject, business transactions and a de-normalized database and between the real world and detailed data resource design within the common data What is Data Dissemination Hits: 578
Description: The best example of dissemination is the ubiquitous internet. Every single second throughout the year, data gets disseminated to millions of users around the world. Data could sit on the millions of severs located in scattered geographical locations. Data dissemination on the internet is possible through many different kinds of communications protocols. What is Data Distribution Hits: 796 Description: Often, data warehouses are being managed by more than just one computer server. This is because the high volume of data cannot be handled by one computer alone. In the past, mainframes were used for processes involving big bulks of data. Mainframes were giant computers housed in big rooms to be used for critical applications involving lots of data What is Data Administration Hits: 717
Description: Data administration refers to the way in which data integrity is maintained within data warehouse. Data warehouses are very large repository of all sorts of data. These data maybe of different formats. To make these data useful to the company, the database running the data warehouse has to be configured so that it obeys the business What is Data Collection Frequency Hits: 537
Description: Data Collection Frequency, just as the name suggests refers to the time frequency at which data is collected at regular intervals. This often refers to whatever time of the day or the year in any given length of period. In a data warehouse, the relational database management systems continually gather, extract, transform and load data onto the storage What is Data Duplication Hits: 707
Description: The definition of what constitutes a duplicate has somewhat different interpretations. For instance, some define a duplicate as having the exact syntactic terms and sequence, whether having formatting differences or not. In effect, there are either no difference or only formatting differences and the contents of the data are exactly the same. What is Data Integrity Rule Hits: 784
Description: In the past, Data Integrity was defined and enforced with data edits but this method did not cope up with the growth of technology and data value quality greatly suffered at the cost of business operations. Organizations were starting to realize that the rules were no longer appropriate for the business. The concept of business rules is already widely
Top ten features in DataStage Hawk(Datastage 8.0) 1) The metadata server. To borrow a simile from that judge on American Idol "Using MetaStage is kind of like bathing in the ocean on a cold morning. You know it's good for you but that doesn't stop it from freezing the crown jewels." MetaStage is good for ETL projects but none of the projects I've been on has actually used it. Too much effort required to install the software, setup the metabrokers, migrate the metadata, and learn how the product works and write reports. Hawk brings the common repository and improved metadata reporting and we can get the positive effectives of bathing in sea water without the shrinkage that comes with it. 2) QualityStage overhaul. Data Quality reporting can be another forgotten aspect of data integration projects. Like MetaStage the QualityStage server and client had an additional install, training and implementation overhead so many DataStage projects did not use it. I am looking forward to more integration projects using standardisation, matching and survivorship to improve quality once these features are more accessible and easier to use. 3) Frictionless Connectivity and Connection Objects. I've called DB2 every rude name under the sun. Not because it's a bad database but because setting up remote access takes me anywhere from five minutes to five weeks depending on how obscure the error message and how hard it is to find the obscure setup step that was missed during installation. Anything that makes connecting to database easier gets a big tick from me. 4) Parallel job range lookup. I am looking forward to this one because it will stop people asking for it on forums. It looks good, it's been merged into the existing lookup form and seems easy to use. Will be interested to see the performance. 5) Slowly Changing Dimension Stage. This is one of those things that Informatica were able to trumpet at product comparisons, that they have more out of the box DW support. There are a few enhancements to make updates to dimension tables easier, there is the improved surrogate key generator, there is the slowly changing dimension stage and updates passed to in memory lookups. That's it for me with DBMS generated keys, I'm only doing the keys in the ETL job from now on! DataStage server jobs have the hash file lookup where you can read and write to it at the same time, parallel jobs will have the updateable lookup. 6) Collaboration: better developer collaboration. Everyone hates opening a job and being told it is locked. "Bloody what his name has gone to lunch, locked the job and now his password protected screen saver is up! Unplug his PC!" Under Hawk you can open a read only copy of a locked job plus you get told who has locked the job so you know whom to curse. 7) Session Disconnection. Accompanied by the metallic cry of "exterminate! exterminate!" an administrator can disconnect sessions and unlock jobs. 8) Improved SQL Builder. I know a lot of people cross the street when they see the SQL Builder coming. Getting the SQL builder to build complex SQL is a bit like teaching a monkey how to play chess. What I do like about the current SQL builder is that it synchronises your SQL select list with your ETL column list to avoid column mismatches. I am hoping the next version is more flexible and can build complex SQL. 9) Improved job startup times. Small parallel jobs will run faster. I call it the death of a thousand cuts, your very large parallel job takes too long to run because a thousand smaller jobs are starting and stopping at the same time and cutting into CPU and memory. Hawk makes these cuts less painful. 10) Common logging. Log views that work across jobs, log searches, log date constraints, wildcard message filters, saved queries. It's all good. You no longer need to send out a search party to find an error message. How does Relational Data Modeling differ from that of Dimensional Data Modeling? In relational models data is normalized to 1st, 2nd or 3rd Normal form. In dimensional models data is denormalized. The typical design is that of a Star where their is a central fact table containing additive, measurable facts and this central fact table is in relationship with dimensional tables which generally contain text filters that normally occurs in the WHERE clause of a query. What is data sparsity and how it effect on aggregation? Data sparsity is term used for how much data we have for a particular dimension/entity of the model. It affects aggregation depending on how deep the combination of members of the sparse dimension make up. If the combination is a lot and those combinations do not have any factual data then creating space to store those aggregations will be a waste as a result, the database will become huge. What is weak entity?
A weak entity is a part of one-to-many relationship, with the identifying entity on one-side of the relationship, and with total participation on many-side. The weak entity relies on the identifying entity for its identification. The primary key of a weak entity is a composite key (PK of identifying entity(identifier), discriminator). There could exist more than one value of disciminator for each identifier. How we are going to decide which schema we are going to implement in the data warehouse? Pro Star Schema: Users find easier to query and faster query times. Space consequences of repeated data makes little difference as Dimensional table hold relatively fewer rows then the large Fact tables.
Pro Snowflake Schema: DBA find easier to maintain May exist in stage but should be de-normalized in Production. Cleanest designs use surrogate keys for all dimension levels. Smaller storage size because normalized data takes less space.
In which normal form is the dimension table and fact table in the schema? Unlike OLTP, the goal of dimensional and fact modeling is not to achieve the highest Normal Form but rather to make key performance indicators (often sought after measures) readily accessible to ad-doc query. That being said, Dimensions can strive to be in Boyce-Codd 3rd normal form, while fact tables may be in 1st normal form - having only a primary key being unique. De-normalized dimensional tables may be in only 1st normal form but have the advantage of low storage space, while de-normalized 1st normal form dimensional table take more space but perform faster. Every schema deals with Facts and Dimensions. Fact will be the central table in the Schema were as Dimensions Table are the surrounded table of Fact.Dimensions table is one who give the description of Business Transactions and always having Primary Key .Fact table is one dealing with Measures and having Foriegn Key.
What is the difference between logical data model and physical data model in Erwin? The Logical Data Model (LDM) is derived from the Conceptual Data Model (CDM). The CDM consists of the major entity sets and the relationship sets, and does not state anything about the attributes of the entity sets. The LDM consists of the Entity Sets, their attributes, the Relationship sets, the cardinality, type of replationship, etc. The Physical Data Model (PDM) consists of the Entity Sets (Tables), their attributes (columns of tables), the Relationship sets (whose attributes are also mapped to columns of tables), along with the Datatype of the columns, the various integrity constraints, etc. Erwin calls the conversion / transformation of LDM => PDM as Forward Engineering which further leads to the actual code generation and the conversion of Code => PDM => LDM as Reverse Engineering!
Unix Programming Interview Questions what is the UNIX command to wait for a specified number of seconds before exit? sleep seconds
DataStage Q & A Configuration Files APT_CONFIG_FILE is the file using which DataStage determines the configuration file (one can have many configuration files for a project) to be used. In fact, this is what is generally used in production. However, if this environment variable is not defined then how DataStage determines which file to use? If the APT_CONFIG_FILE environment variable is not defined then DataStage look for default configuration file (config.apt) in following path: Current working directory. INSTALL_DIR/etc, where INSTALL_DIR ($APT_ORCHHOME) is the top level directory of DataStage installation. What are the different options a logical node can have in the configuration file? fastname The fastname is the physical node name that stages use to open connections for high volume data transfers. The attribute of this option is often the network name. Typically, you can get this name by using Unix command uname -n. pools Name of the pools to which the node is assigned to. Based on the characteristics of the processing nodes you can group nodes into set of pools. A pool can be associated with many nodes and a node can be part of many pools. A node belongs to the default pool unless you explicitly specify a pools list for it, and omit the default pool name (") from the list. A parallel job or specific stage in the parallel job can be constrained to run on a pool (set of processing nodes). In case job as well as stage within the job are constrained to run on specific processing nodes then stage will run on the node which is common to stage as well as job. resource resource resource_type location [{pools disk_pool_name}] | resource resource_type value . resource_type can be canonicalhostname (Which takes quoted ethernet name of a node in cluster that is unconnected to Conductor node by the hight speed network.) or disk (To read/write persistent data to this directory.) or scratchdisk (Quoted absolute path name of a directory on a file system where intermediate data will be temporarily stored. It is local to the processing node.) or RDBMS Specific resourses (e.g. DB2, INFORMIX, ORACLE, etc.) How datastage decides on which processing node a stage should be run? If a job or stage is not constrained to run on specific nodes then parallel engine executes a parallel stage on all nodes defined in the default node pool. (Default Behavior) If the node is constrained then the constrained processing nodes are choosen while executing the parallel stage. (Refer to 2.2.3 for more detail). When configuring an MPP, you specify the physical nodes in your system on which the parallel engine will run your parallel jobs. This is called Conductor Node. For other nodes, you do not need to specify the physical node. Also, You need to copy the (.apt) configuration file only to the nodes from which you start parallel engine applications. It is possible that conductor node is not connected with the high-speed network switches. However, the other nodes are connected to each other using a very high-speed network switches. How do you configure your system so that you will be able to achieve optimized parallelism? Make sure that none of the stages are specified to be run on the conductor node. Use conductor node just to start the execution of parallel job. Make sure that conductor node is not the part of the default pool. Although, parallelization increases the throughput and speed of the process, why maximum parallelization is not necessarily the optimal parallelization? Datastage creates one process for every stage for each processing node. Hence, if the hardware resource is not available to support the maximum parallelization, the performance of overall system goes down. For example, suppose we have a SMP system with three CPU and a Parallel job with 4 stage. We have 3 logical node (one corresponding to each physical node (say CPU)). Now DataStage will start 3*4 = 12 processes, which has to be managed by a single operating system. Significant time will be spent in switching context and scheduling the process. Since we can have different logical processing nodes, it is possible that some node will be more suitable for some stage while other nodes will be more suitable for other stages. So, when to decide which node will be suitable for which stage? If a stage is performing a memory intensive task then it should be run on a node which has more disk space available for it. E.g. sorting a data is memory intensive task and it should be run on such nodes. If some stage depends on licensed version of software (e.g. SAS Stage, RDBMS related stages, etc.) then you need to associate those stages with the processing node, which is physically mapped to the machine on which the licensed software is installed. (Assumption: The machine on which licensed software is installed is connected through other machines using high speed network.) If a job contains stages, which exchange large amounts of data then they should be assigned to nodes where stages communicate by either shared memory (SMP) or high-speed link (MPP) in most optimized manner. Basically nodes are nothing but set of machines (specially in MPP systems). You start the execution of parallel jobs from the conductor node. Conductor nodes creates a shell of remote machines (depending on the processing nodes) and copies the same environment on them. However, it is possible to create a startup script which will selectively change the environment on a specific node. This script has a default name of startup.apt. However, like main configuration file, we can also have many startup configuration files. The appropriate configuration file can be picked up using the environment variable APT_STARTUP_SCRIPT. What is use of APT_NO_STARTUP_SCRIPT environment variable?
Using APT_NO_STARTUP_SCRIPT environment variable, you can instruct Parallel engine not to run the startup script on the remote shell. What are the generic things one must follow while creating a configuration file so that optimal parallelization can be achieved? Consider avoiding the disk/disks that your input files reside on. Ensure that the different file systems mentioned as the disk and scratchdisk resources hit disjoint sets of spindles even if theyre located on a RAID (Redundant Array of Inexpensive Disks) system. Know what is real and what is NFS: Real disks are directly attached, or are reachable over a SAN (storage-area network -dedicated, just for storage, lowlevel protocols). Never use NFS file systems for scratchdisk resources, remember scratchdisk are also used for temporary storage of file/data during processing. If you use NFS file system space for disk resources, then you need to know what you are doing. For example, your final result files may need to be written out onto the NFS disk area, but that doesnt mean the intermediate data sets created and used temporarily in a multi-job sequence should use this NFS disk area. Better to setup a final disk pool, and constrain the result sequential file or data set to reside there, but let intermediate storage go to local or SAN resources, not NFS. Know what data points are striped (RAID) and which are not. Where possible, avoid striping across data points that are already striped at the spindle level. What is a conductor node? Actually every process contains a conductor process where the execution was started and a section leader process for each processing node and a player process for each set of combined operators and a individual player process for each uncombined operator. whenever we want to kill a process we should have to destroy the player process and then section leader process and then conductor process Relational vs Dimensional Relational Data Modeling Data is stored in RDBMS Tables are units of storage Data is normalized and used for OLTP. Optimized for OLTP processing Several tables and chains of relationships among them Volatile(several updates) and time variant SQL is used to manipulate data Detailed level of transactional data Normal Reports Dimensional Data Modeling Data is stored in RDBMS or Multidimensional databases Cubes are units of storage Data is denormalized and used in data warehouse and data mart. Optimized for OLAP Few tables and fact tables are connected to dimensional tables Non volatile and time invariant MDX is used to manipulate data Summary of bulky transactional data(Aggregates and Measures) used in business decisions User friendly, interactive, drag and drop multidimensional OLAP Reports
ETL process and concepts ETL stands for extraction, transformation and loading. Etl is a process that involves the following tasks:
extracting data from source operational or archive systems which are the primary source of data for the data warehouse transforming the data - which may involve cleaning, filtering, validating and applying business rules loading the data into a data warehouse or any other database or application that houses data
The ETL process is also very often referred to as Data Integration process and ETL tool as a Data Integration platform. The terms closely related to and managed by ETL processes are: data migration, data management, data cleansing, data synchronization and data consolidation. The main goal of maintaining an ETL process in an organization is to migrate and transform data from the source OLTP systems to feed a data warehouse and form data marts. What is Data Modeling Hits: 1098
Description: Data Modeling is a method used to define and analyze data requirements needed to support the
business functions of an enterprise. These data requirements are recorded as a conceptual data model with associated data definitions. Data modeling defines the relationships between data elements and structures. What is Data Cluster Hits: 854
Description: Clustering in the computer science world is the classification of data or object into different groups. It can also be referred to as partitioning of a data set into different subsets. Each data in the subset ideally shares some common traits. Data cluster are created to meet specific requirements that cannot created using any of the categorical levels. What is Data Aggregation Hits: 821
Description: In Data Aggregation, value is derived from the aggregation of two or more contributing data characteristics. Aggregation can be made from different data occurrences within the same data subject, business transactions and a de-normalized database and between the real world and detailed data resource design within the common data What is Data Dissemination Hits: 578
Description: The best example of dissemination is the ubiquitous internet. Every single second throughout the year, data gets disseminated to millions of users around the world. Data could sit on the millions of severs located in scattered geographical locations. Data dissemination on the internet is possible through many different kinds of communications protocols. What is Data Distribution Hits: 796
Description: Often, data warehouses are being managed by more than just one computer server. This is because the high volume of data cannot be handled by one computer alone. In the past, mainframes were used for processes involving big bulks of data. Mainframes were giant computers housed in big rooms to be used for critical applications involving lots of data What is Data Administration Hits: 717
Description: Data administration refers to the way in which data integrity is maintained within data warehouse. Data warehouses are very large repository of all sorts of data. These data maybe of different formats. To make these data useful to the company, the database running the data warehouse has to be configured so that it obeys the business What is Data Collection Frequency Hits: 537
Description: Data Collection Frequency, just as the name suggests refers to the time frequency at which data is collected at regular intervals. This often refers to whatever time of the day or the year in any given length of period. In a data warehouse, the relational database management systems continually gather, extract, transform and load data onto the storage What is Data Duplication Hits: 707
Description: The definition of what constitutes a duplicate has somewhat different interpretations. For instance, some define a duplicate as having the exact syntactic terms and sequence, whether having formatting differences or not. In effect, there are either no difference or only formatting differences and the contents of the data are exactly the same. What is Data Integrity Rule Hits: 784
Description: In the past, Data Integrity was defined and enforced with data edits but this method did not cope up with the growth of technology and data value quality greatly suffered at the cost of business operations. Organizations were starting to realize that the rules were no longer appropriate for the business. The concept of business rules is already widely What is data Migration? Data migration is actually the translation of data from one format to another format or from one storage device to another storage device. Data migration typically has four phases: analysis of source data, extraction and transformation of data, validation and repair of data, and use of data in the new program. What is Data Management?
Data management is the development and execution of architectures, policies, practices and procedures in order to manage the information lifecycle needs of an enterprise in an effective manner. What is Data Synchronization? And why is it important? When suppliers and retailers attempt to communicate with one another using unsynchronized data, there is confusion. Neither party completely understands what the other is requesting. The inaccuracies cause costly errors in a variety of business systems. By synchronizing item and supplier data, each organization works from identical information, therefore, minimizing miscommunication. Data synchronization is the key to accurate and timely exchange of item and supplier data across enterprises and organizations. What is Data Cleansing? Also referred to as data scrubbing, the act of detecting and removing and/or correcting a databases dirty data (i.e., data that is incorrect, out-of-date, redundant, incomplete, or formatted incorrectly). The goal of data cleansing is not just to clean up the data in a database but also to bring consistency to different sets of data that have been merged from separate databases What is Data Consolidation? Data consolidation is usually associated with moving data from remote locations to a central location or combining data due to an acquisition or merger. Data Warehouse Architecture The main difference between the database architecture in a standard, on-line transaction processing oriented system (usually ERP or CRM system) and a Data Warehouse is that the systems relational model is usually de-normalized into dimension and fact tables which are typical to a data warehouse database design. The differences in the database architectures are caused by different purposes of their existence. In a typical OLTP system the database performance is crucial, as end-user interface responsiveness is one of the most important factors determining usefulness of the application. That kind of a database needs to handle inserting thousands of new records every hour. To achieve this usually the database is optimized for speed of Inserts, Updates and Deletes and for holding as few records as possible. So from a technical point of view most of the SQL queries issued will be INSERT, UPDATE and DELETE. Opposite to OLTP systems, a Data Warehouse is a system that should give response to almost any question regarding company performance measure. Usually the information delivered from a data warehouse is used by people who are in charge of making decisions. So the information should be accessible quickly and easily but it doesn't need to be the most recent possible and in the lowest detail level. Usually the data warehouses are refreshed on a daily basis (very often the ETL processes run overnight) or once a month (data is available for the end users around 5th working day of a new month). Very often the two approaches are combined. The main challenge of a DataWarehouse architecture is to enable business to access historical, summarized data with a read-only access of the end-users. Again, from a technical standpoint the most SQL queries would start with a SELECT statement. In Data Warehouse environments, the relational model can be transformed into the following architectures: Star schema Snowflake schema Constellation schema
Star schema architecture Star schema architecture is the simplest data warehouse design. The main feature of a star schema is a table at the center, called the fact table and the dimension tables which allow browsing of specific categories, summarizing, drilldowns and specifying criteria. Typically, most of the fact tables in a star schema are in database third normal form, while dimensional tables are denormalized (second normal form). Despite the fact that the star schema is the simpliest datawarehouse architecture, it is most commonly used in the datawarehouse implementations across the world today (about 90-95% cases). Fact table
The fact table is not a typical relational database table as it is de-normalized on purpose - to enhance query response times. The fact table typically contains records that are ready to explore, usually with ad hoc queries. Records in the fact table are often referred to as events, due to the time-variant nature of a data warehouse environment. The primary key for the fact table is a composite of all the columns except numeric values / scores (like QUANTITY, TURNOVER, exact invoice date and time). Typical fact tables in a global enterprise data warehouse are (usually there may be additional company or business specific fact tables): sales fact table - contains all details regarding sales orders fact table - in some cases the table can be split into open orders and historical orders. Sometimes the values for historical orders are stored in a sales fact table. budget fact table - usually grouped by month and loaded once at the end of a year. forecast fact table - usually grouped by month and loaded daily, weekly or monthly. inventory fact table - report stocks, usually refreshed daily Dimension table Nearly all of the information in a typical fact table is also present in one or more dimension tables. The main purpose of maintaining Dimension Tables is to allow browsing the categories quickly and easily. The primary keys of each of the dimension tables are linked together to form the composite primary key of the fact table. In a star schema design, there is only one de-normalized table for a given dimension. Typical dimension tables in a data warehouse are: time dimension table customers dimension table products dimension table key account managers (KAM) dimension table sales office dimension table
Star schema example An example of a star schema architecture is depicted below. Star schema DW architecture
Snowflake Schema architecture Snowflake schema architecture is a more complex variation of a star schema design. The main difference is that dimensional tables in a snowflake schema are normalized, so they have a typical relational database design. Snowflake schemas are generally used when a dimensional table becomes very big and when a star schema cant represent the complexity of a data structure. For example if a PRODUCT dimension table contains millions of rows, the use of snowflake schemas should significantly improve performance by moving out some data to other table (with BRANDS for instance). The problem is that the more normalized the dimension table is, the more complicated SQL joins must be issued to query them. This is because in order for a query to be answered, many tables need to be joined and aggregates generated.
Fact constellation schema architecture For each star schema or snowflake schema it is possible to construct a fact constellation schema. This schema is more complex than star or snowflake architecture, which is because it contains multiple fact tables. This allows dimension tables to be shared amongst many fact tables. That solution is very flexible, however it may be hard to manage and support. The main disadvantage of the fact constellation schema is a more complicated design because many variants of aggregation must be considered. In a fact constellation schema, different fact tables are explicitly assigned to the dimensions, which are for given facts relevant. This may be useful in cases when some facts are associated with a given dimension level and other facts with a deeper dimension level. Use of that model should be reasonable when for example, there is a sales fact table (with details down to the exact date and invoice header id) and a fact table with sales forecast which is calculated based on month, client id and product id. In that case using two different fact tables on a different level of grouping is realized through a fact constellation model. An example of a constellation schema architecture is depicted below.
Question ************* I am getting the records from the Source table and after doing the Look up with target I have to insert the new records and also the Updated records in the Target table. If i Use the Change Data capture Stage Or Difference stage then it will gives the Updated records only in the output, but I have to Maintain the History in the Target table i.e. I want the Existing record and also the Updated record in the Target Table. I dont have Flag Columns in the Target Tables. Is this Possible in Parallel jobs Without Using the Transformer Logic ? Answers ************* Use a Compare stage; as well as the result you get each source row as a single column subrecord (you can promote these later with Promote Subrecord stages). There are two ways to get and apply your change capture. You start with a Before set of data and an After set of data. If you use the change capture stage alone it gives you new/changed/deleted/unchanged records. There is a flag to turn each one of these on or off and an extra field is written out that indicates what type of change it is. You can now either apply this to a target database table using insert/update/delete/load stages and transformers and filters OR you can merge it with your Before set of data using the Change Apply stage. The Change Apply will give you the new changes and the old history.
Questions
**************** Suppose I have to create a reject file with todays Date and Time Stamp then What is the Solution? In the Job Properties After Job routine type the file name and concat the date and time as follows. cat <filename1> <file name2> >> <file name>`date +"%Y%m%d_%H%M%S"`.txt Question *****************Hi, I have a problem using the iconv and oconv fuctions for the date conversion in Datastage 8.0.1. The format that i entered the date was mm/dd/yyyy and the output desired was yyyy/mm/dd but i got an output of yyyy mm dd.How can i add the slash in my output. And, i would also like to know how I could convert to date in the form of yy MON dd. The fuction that I used is as follows. Oconv(Iconv(InputFieldName,"D DMY[2,2,4]"),"D YMD[4,2,2]") Solution: Code: Oconv(Iconv(InputFieldName,"DMDY"),"D/YMD[4,2,2]") For the second question, try Code: Oconv(Iconv(InputFieldName,"DMDY"),"D YMD[2,A3,2]") Difference between FTP and SMTP? FTP is a file transfer protocol as send the file inside the network but SMTP is send the email inside the nework (as use mailing purpose), or ftp and smtp supported for iis. Difference between SUBSTR and INSTR Hi All INSTR function finds the numeric starting position of a string within a string. As eg. Select INSTR('Mississippi' 'i' 3 3) test1 INSTR('Mississippi' 'i' 1 3) test2 INSTR('Mississippi' 'i' -2 3) test3 from dual; Its output would be like this Test1 Test2 Test3 ___________________________________________ 11 8 2 SUBSTR function returns the selection of the specified string specified by numeric character positions.
As eg. Select SUBSTR('The Three Musketeers' 1 3) from dual; will return 'The'. What is the difference between a Filter and a Switch... A Filter stage is used to filter the incoming data for suppose u want to get the details of customer 20 if u give customer 20 as the constraint in filter it will display only the customer 20 files and u can also give a reject link the rest of the records will go into reject link.where as in the switch we need to give as cases like case1 case2. case1 10; case2 20; it will give the outputs of 10 and 20 customer records. switch will check the cases and execute them. In filter Stage we can give multiple conditions on multiple columns, But every time data come from source system and filtered data loads into target .where as in switch stage we can give Multiple conditions on Single column, but data come only once from the source and checks all the conditions in the switch stage and loads in to target. What is the difference between Filter and External Filter Stage? Filter is used to pass the records basing on some condition. We will have multiple where conditions to specify. We can send to different output links External filter is used to execute commands of unix.You can give the commands in the specified box with the corresponding arguments. cardinality From an OLTP perspective, this refers to the number of rows in a table. From a data warehousing perspective, this typically refers to the number of distinct values in a column. For most data warehouse DBAs, a more important issue is the degree of cardinality. degree of cardinality The number of unique values of a column divided by the total number of rows in the table. This is particularly important when deciding which indexes to build. You typically want to use bitmap indexes on low degree of cardinality columns and B-tree indexes on high degree of cardinality columns. As a general rule, a cardinality of under 1% makes a good candidate for a bitmap index. fact Data, usually numeric and additive, that can be examined and analyzed. Examples include sales, cost, and profit. Fact and measure are synonymous; fact is more commonly used with relational environments, measure is more commonly used with multidimensional environments. derived fact (or measure) A fact (or measure) that is generated from existing data using a mathematical operation or a data transformation. Examples include averages, totals, percentages, and differences.
10 Ways to Make DataStage Run Slower Everyone wants to tell you how to make your ETL jobs run faster, well here is how to make them slower!
The Structured Data blog has posted a list Top Ways How Not To Scale Your Data Warehouse that is a great chat about bad ways to manage an Oracle Data Warehouse. It inspired me to find 10 ways to make DataStage jobs slower! How do you puts the breaks on a DataStage job that supposed to be running on a massively scalable parallel architecture. 1. Use the same configuration file for all your jobs. You may have two nodes configured for each CPU on your DataStage server and this allows your high volume jobs to run quickly but this works great for slowing down your small volume jobs. A parallel job with a lot of nodes to partition across is a bit like the solid wheel on a velodrome racing bike, they take a lot of time to crank up to full speed but once you are there they are lightning fast. If you are processing a handful of rows the configuration file will instruct the job to partition those rows across a lot of processes and then repartition them at the end. So a job that would take a second or less on a single node can run for 5-10 seconds across a lot of nodes and a squadron of these jobs will slow down your entire DataStage batch run! 2. Use a sparse database lookup on high volumes. This is a great way to slow down any ETL tool, it works on server jobs or parallel jobs. The main difference is that server jobs only do sparse database lookups - the only way to avoid a sparse lookup is to dump the table into a hash file. Parallel jobs by default do cached lookups where the entire database table is moved into a lookup fileset either in memory of if it's too large into scratch space on the disk. You can slow parallel jobs down by changing the lookup to a sparse lookup and for every row processed it will send a lookup SQL statement to the database. So if you process 10 million rows you can send 10 million SQL statements to the database! That will put the brakes on! 3. Keep resorting your data. Sorting is the Achilles heel of just about any ETL tool, the average ETL job is like a busy restaurant, it makes a profit by getting the diners in and out quickly and serving multiple seatings. If the restaurant fits 100 people can feed several hundred in a couple hours by processing each diner quickly and getting them out the door. The sort stage is like having to waiting until every person who is going to eat at that restaurant for that night has arrived and has been put in order of height before anyone gets their food. You need to read every row before you can output your sort results. You can really slow your DataStage parallel jobs down by putting in more than one sort, or giving a job data that is already sorted by the SQL select statement but sorting it again anyway! 4. Design single threaded bottlenecks This is really easy to do in server edition and harder (but possible) in parallel edition. Devise a step on the critical path of your batch processing that takes a long time to finish and only uses a small part of the DataStage engine. Some good bottlenecks: a large volume Server Job that hasn't been made parallel by multiple instance or interprocess functionality. A script FTP of a file that keeps an entire DataStage Parallel engine waiting. A bulk database load via a single update stream. Reading a large sequential file from a parallel job without using multiple readers per node. 5. Turn on debugging and forget that it's on In a parallel job you can turn on a debugging setting that forces it to run in sequential mode, forever! Just turn it on to debug a problem and then step outside the office and get run over by a tram. It will be years before anyone spots the bottleneck!
6. Let the disks look after themselves Never look at what is happening on your disk I/O - that's a Pandora's Box of better performance! You can get some beautiful drag and slow down by ignoring your disk I/O as parallel jobs write a lot of temporary data and datasets to the scratch space on each node and write out to large sequential files. Disk striping or partitioning or choosing the right disk type or changing the location of your scratch space are all things that stand between you and slower job run times. 7. Keep Writing that Data to Disk Staging of data can be a very good idea. It can give you a rollback point for failed jobs, it can give you a transformed dataset that can be picked up and used by multiple jobs, it can give you a modular job design. It can also slow down Parallel Jobs like no tomorrow - especially if you stage to sequential files! All that repartitioning to turn native parallel datasets into a stupid ASCII metadata dumb file and then import and repartition to pick it up and process it again. Sequential files are the Forest Gump of file storage, simple and practical but dumb as all hell. It costs time to write to one and time to read and parse them so designing an end to end process that writes data to sequential files repeatedly will give you massive slow down times. 8. Validate every field A lot of data comes from databases. Often DataStage pulls straight out of these databases or saves the data to an ASCII file before being processed by DataStage. One way to slow down your job and slow down your ETL development and testing is to validate and transform metadata even though you know there is nothing wrong with it. For example, validating that a field is VARCHAR(20) using DataStage functions even though the database defines the source field as VARCHAR(20). DataStage has implicit validation and conversion of all data imported that validates that it's the metadata you say it is. You can then do explicit metadata conversion and validation on top of that. Some fields need explicit metadata conversion - such as numbers in VARCHAR fields and dates in string fields and packed fields, but most don't. Adding a layer of validation you don't need should slow those jobs down. 9. Write extra steps in database code The same phrase gets uttered on many an ETL project. "I can write that in SQL", or "I can write that in Java", or "I can do that in an Awk script". Yes, we know, we know that just about any programming language can do just about anything - but leaving a complex set of steps as a prequel or sequel to an ETL job is like leaving a turd on someones doorstep. You'll be long gone when someone comes to clean it up. This is a sure fire way to end up with a step in the end to end integration that is not scalable, is poorly documented, cannot be easily modified and slows everything down. If someone starts saying "I can write that in..." just say "okay, if you sign a binding contract to support it for every day that you have left on this earth". 10. Don't do Performance Testing Do not take your highest volume jobs into performance testing, just keep the default settings, default partitioning and your first draft design and throw that into production and get the hell out of there. How to check Datastage internal error descriptions
To check the description of a number go to the datastage shell (from administrator or telnet to the server machine) and invoke the following command:
SELECT * FROM SYS.MESSAGE WHERE @ID='081021'; - where in that case the number 081021 is an error number The command will produce a brief error description which probably will not be helpful in resolving an issue but can be a good starting point for further analysis. How to stop a job when its status is running? To stop a running job go to DataStage Director and click the stop button (or Job -> Stop from menu). If it doesn't help go to Job -> Cleanup Resources, select a process with holds a lock and click Logout If it still doesn't help go to the datastage shell and invoke the following command: ds.tools It will open an administration panel. Go to 4.Administer processes/locks , then try invoking one of the clear locks commands (options 7-10). How to run and schedule a job from command line? To run a job from command line use a dsjob command Command Syntax: dsjob [-file | [-server ][-user ][-password ]] [] The command can be placed in a batch file and run in a system scheduler. Is it possible to run two versions of datastage on the same pc? Yes, even though different versions of Datastage use different system dll libraries. To dynamically switch between Datastage versions install and run DataStage Multi-Client Manager. That application can unregister and register system libraries used by Datastage. How to release a lock held by jobs? Go to the data stage shell and invoke the following command: ds.tools It will open an administration panel. Go to 4.Administer processes/locks , then try invoking one of the clear locks commands (options 7-10). What is a command to analyze hashed file? There are two ways to analyze a hashed file. Both should be invoked from the datastage command shell. These are: FILE.STAT command
ANALYZE.FILE command what is the difference between logging text and final text message in terminator stage
Every stage has a 'Logging Text' area on their General tab which logs an informational message when the stage is triggered or started. Informational - is a green line, DSLogInfo() type message. The Final Warning Text - the red fatal, the message which is included in the sequence abort message How to invoke an Oracle PLSQL stored procedure from a server job To run a pl/sql procedure from Datastage a Stored Procedure (STP) stage can be used. However it needs a flow of at least one record to run. It can be designed in the following way:
source odbc stage which fetches one record from the database and maps it to one column - for example: select sysdate from dual A transformer which passes that record through. If required, add pl/sql procedure parameters as columns on the right-hand side of tranformer's mapping Put Stored Procedure (STP) stage as a destination. Fill in connection parameters, type in the procedure name and select Transform as procedure type. In the input tab select 'execute procedure for each row' (it will be run once).
Datastage routine to open a text file with error catching Note! work dir and file1 are parameters passed to the routine. * open file1 OPENSEQ work_dir : '\' : file1 TO H.FILE1 THEN CALL DSLogInfo("******************** File " : file1 : " opened successfully", "JobControl") END ELSE CALL DSLogInfo("Unable to open file", "JobControl") ABORT END Datastage routine which reads the first line from a text file Note! work dir and file1 are parameters passed to the routine. * open file1 OPENSEQ work_dir : '\' : file1 TO H.FILE1 THEN CALL DSLogInfo("******************** File " : file1 : " opened successfully", "JobControl") END ELSE CALL DSLogInfo("Unable to open file", "JobControl") ABORT END READSEQ FILE1.RECORD FROM H.FILE1 ELSE Call DSLogWarn("******************** File is empty", "JobControl") END firstline = Trim(FILE1.RECORD[1,32]," ","A") ******* will read the first 32 chars Call DSLogInfo("******************** Record read: " : firstline, "JobControl") CLOSESEQ H.FILE1 How to adjust commit interval when loading data to the database? In earlier versions of datastage the commit interval could be set up in: General -> Transaction size (in version 7.x it's obsolete) Starting from Datastage 7.x it can be set up in properties of ODBC or ORACLE stage in Transaction handling -> Rows per transaction. If set to 0 the commit will be issued at the end of a successfull transaction. Database update actions in ORACLE stage The destination table can be updated using various Update actions in Oracle stage. Be aware of the fact that it's crucial to select the key columns properly as it will determine which column will appear in the WHERE part of the SQL statement. Available actions:
Clear the table then insert rows - deletes the contents of the table (DELETE statement) and adds new rows (INSERT). Truncate the table then insert rows - deletes the contents of the table (TRUNCATE statement) and adds new rows (INSERT). Insert rows without clearing - only adds new rows (INSERT statement). Delete existing rows only - deletes matched rows (issues only the DELETE statement). Replace existing rows completely - deletes the existing rows (DELETE statement), then adds new rows (INSERT). Update existing rows only - updates existing rows (UPDATE statement). Update existing rows or insert new rows - updates existing data rows (UPDATE) or adds new rows (INSERT). An UPDATE is issued first and if succeeds the INSERT is ommited.
Insert new rows or update existing rows - adds new rows (INSERT) or updates existing rows (UPDATE). An INSERT is issued first and if succeeds the UPDATE is ommited. User-defined SQL - the data is written using a user-defined SQL statement. User-defined SQL file - the data is written using a user-defined SQL statement from a file.
Use and examples of ICONV and OCONV functions? ICONV and OCONV functions are quite often used to handle data in Datastage. ICONV converts a string to an internal storage format and OCONV converts an expression to an output format. Syntax: Iconv (string, conversion code) Oconv(expression, conversion ) Some useful iconv and oconv examples: Iconv("10/14/06", "D2/") = 14167 Oconv(14167, "D-E") = "14-10-2006" Oconv(14167, "D DMY[,A,]") = "14 OCTOBER 2006" Oconv(12003005, "MD2$,") = "$120,030.05" That expression formats a number and rounds it to 2 decimal places: Oconv(L01.TURNOVER_VALUE*100,"MD2") Iconv and oconv can be combined in one expression to reformat date format easily: Oconv(Iconv("10/14/06", "D2/"),"D-E") = "14-10-2006" Can Datastage use Excel files as a data input? Microsoft Excel spreadsheets can be used as a data input in Datastage. Basically there are two possible approaches available: Access Excel file via ODBC - this approach requires creating an ODBC connection to the Excel file on a Datastage server machine and use an ODBC stage in Datastage. The main disadvantage is that it is impossible to do this on an Unix machine. On Datastage servers operating in Windows it can be set up here: Control Panel -> Administrative Tools -> Data Sources (ODBC) -> User DSN -> Add -> Driver do Microsoft Excel (.xls) -> Provide a Data source name -> Select the workbook -> OK Save Excel file as CSV - save data from an excel spreadsheet to a CSV text file and use a sequential stage in Datastage to read the data. Error timeout waiting for mutex The error message usually looks like follows: ... ds_ipcgetnext() - timeout waiting for mutex There may be several reasons for the error and thus solutions to get rid of it. The error usually appears when using Link Collector, Link Partitioner and Interprocess (IPC) stages. It may also appear when doing a lookup with the use of a hash file or if a job is very complex, with the use of many transformers. There are a few things to consider to work around the problem: - increase the buffer size (up to to 1024K) and the Timeout value in the Job properties (on the Performance tab). - ensure that the key columns in active stages or hashed files are composed of allowed characters get rid of nulls and try to avoid language specific chars which may cause the problem. - try to simplify the job as much as possible (especially if its very complex). Consider splitting it into two or three smaller jobs, review fetches and lookups and try to optimize them (especially have a look at the SQL statements).