Resume BI DS
Resume BI DS
Resume BI DS
Business intelligence systems are designed to help businesses make better decisions by providing them with real-
time, accurate, and relevant information. By analyzing data from various sources, such as sales, customer
feedback, and market trends, BI systems enable organizations to identify patterns and trends, monitor
performance, and identify opportunities for growth.
the BI process
1. Data collection: The first stage of the BI process is data collection, which involves gathering data from
various sources such as databases, spreadsheets, files, and other data repositories. The data collected in this
stage can include structured data, such as sales records and customer data, as well as unstructured data,
such as customer feedback and social media posts.
2. ETL (Extraction, Transformation, and Loading): The next stage of the BI process is ETL, which involves
extracting data from its source, transforming it into a format that can be easily analyzed, and loading it into a
data warehouse or data lake. ETL involves a number of sub-steps, such as data profiling, data cleaning, data
mapping, and data transformation, which are used to ensure the quality and consistency of the data.
3. Data integration: Once the data is loaded into the data warehouse or data lake, the next stage is data
integration. This stage involves integrating the data from different sources into a single, unified view, which
can be easily analyzed. Data integration may involve merging data from different databases, reconciling
differences in data definitions, and identifying and resolving data quality issues.
4. Data diffusion: The next stage of the BI process is data diffusion, which involves making the data available
to users across the organization. This may involve creating user-friendly interfaces that enable users to
access and analyze the data, as well as setting up security protocols to ensure that data is only accessible to
authorized users.
5. Presentation: The final stage of the BI process is presentation, which involves presenting the data in a
format that is easy to understand and use. This may involve creating dashboards, reports, and other
visualizations that enable users to analyze the data and make informed decisions. The presentation stage
may also involve setting up alerts and notifications that alert users to changes in the data or the occurrence
of certain events.
Terminology
OLAP systems are specifically designed for analytical processing. These systems use multidimensional data
models, which are optimized for fast query performance, even on large volumes of data. OLAP systems enable
users to analyze data from multiple dimensions, such as time, geography, product, or customer, and perform
complex calculations, such as averages, percentages, or trends.
The purpose of the staging area is to provide a buffer between the source systems and the data warehouse/ODS,
and to facilitate the ETL process. In the staging area, data can be validated, cleaned, transformed, and
consolidated before it is loaded into the target system. The staging area is typically used to handle high volumes
of data and to ensure the quality and consistency of the data before it is loaded into the data warehouse or ODS.
An Operational Data Store (ODS) is a type of data repository that is designed to support operational reporting
and decision-making. The ODS typically contains a subset of the data that is stored in the data warehouse, and it
is updated frequently (often in real-time) to reflect changes in the operational systems. The purpose of the ODS is
to provide users with timely, accurate, and up-to-date data that can be used to support operational decision-
making.
Data warehouse
A data warehouse is a large, centralized repository of data that is used to support business intelligence (BI)
activities. A data warehouse is designed to store data from various sources and transform it into a format that is
optimized for querying and analysis.
The data in a data warehouse is typically organized into subject areas, such as sales, customers, or products, and
is optimized for querying and reporting. The data is also structured in a way that is easy to understand and use,
and is often modeled using a dimensional model, such as a star schema or snowflake schema.
Data Marts
A datamart is a subset of a data warehouse that is designed to support a specific business function or
department within an organization. Unlike a data warehouse, which is a centralized repository of data, a datamart
is focused on a specific business unit or department and is designed to meet their specific reporting and analysis
needs. Datamarts typically store data in a multidimensional format. This multidimensional data can be thought of
as an “hypercube” where each dimension represents a different axis or dimension of the cube. which makes it
well-suited for analysis using OLAP tools. OLAP tools can access the data in the datamart and present it in a way
that is easy to analyze and understand .
Data denormalization is the process of intentionally introducing redundancy into a database design by creating
duplicate copies of data.
The passage from OLTP to OLAP is necessary because OLAP systems provide a number of benefits that OLTP
systems do not. Some of these benefits include:
1. Fast query performance: OLAP systems are optimized for fast query performance, even on large volumes of
data. This enables users to analyze data quickly and make informed decisions.
2. Multidimensional analysis: OLAP systems enable users to analyze data from multiple dimensions. This
provides users with a more complete picture of their data and enables them to identify trends and patterns
that may not be apparent in a two-dimensional view of the data.
3. Complex calculations: OLAP systems enable users to perform complex calculations, such as averages,
percentages, or trends. This enables users to gain deeper insights into their data and make more informed
decisions.
EXTRACT :
The extraction process of ETL involves extracting data from its original environment, which could be a variety of
sources such as relational databases, flat files, and more. However, it is important to note that the extraction
process should not disrupt the normal operations of the production environment. Specific tools are required that
can access production databases and perform queries on heterogeneous databases.
Approaches :
1. Push: The logic for loading data into the staging area is located in the source system. The source system
pushes the data to the staging area whenever it has the opportunity to do so.
2. Pull: The staging system initiates the data extraction process by pulling data from the source system into
the staging area.
3. Push-Pull: This approach is a combination of push and pull methods. The source system prepares the data
to be sent and notifies the staging system when it is ready. The staging system then retrieves the data. If the
source system is busy, the staging system will request the data at a later time.
Types of Extractions :
1. Full extraction:
Full extraction involves extracting all of the data from the source system, which can be time-consuming and
resource-intensive.
Full extraction is often used when there is no existing data in the target system or when the source system
has undergone significant changes.
Full extraction can result in duplicate data if the same data is extracted multiple times, so care must be
taken to avoid this.
Full extraction can be scheduled to run during off-peak hours to minimize the impact on the source
system’s performance.
The first step in the ETL process is typically a full extraction of data from the source system.
2. Incremental extraction:
Incremental extraction involves extracting only the data that has changed since the last extraction.
Incremental extraction is faster and less resource-intensive than full extraction since it only extracts a
subset of the data.
Incremental extraction is often used for systems with large amounts of data or those with limited bandwidth
between source and target systems.
Incremental extraction requires a way to identify the changed data, typically by using a timestamp or a
sequence number.
Use log-based change data capture (CDC) to identify and extract only changed data (transaction logs ).
Utilize triggers on the source system to capture data changes in real-time.
Apply change tracking or change tables to the source system to track changes and extract only those that
have occurred since the last extraction.
2. b . Batch Extraction:
Use date and time stamps on records to identify and extract only those that have been updated since the
last extraction.
Compare checksums or hash values on records to identify and extract only those that have changed.
TRANSFORM
types of transformations :
1. Format revision
2. Field decoding
3. Pre-calculation of derived values
4. Complex field splitting
5. Merging multiple fields:
6. Character set conversion:
7. Unit of measure conversion:
8. Date conversion.
9. Pre-calculation of aggregations:
10. Deduplication:
(SCDs) refer to dimensions in a data warehouse that change slowly over time. For example, the customer’s address
or marital status can change over time, but these changes happen infrequently.
types of SCD’s :
1. Type *1: Overwrites the old data with new data, without keeping a record of the previous value. This
method is appropriate when history is not important, and only the latest information is required.
2. Type 2: Adds a new row with the updated data, while keeping the old row as well. This method allows
keeping track of the history of the data, but it can create a large number of rows in the table.
3. Type 3: Adds a column to the table to record changes in the data. This method only keeps track of limited
historical data, but it is simpler and easier to manage than type 2.
LOAD :
The Load stage is the third stage of the ETL process, where the transformed data is loaded into the data
warehouse. This process involves mapping the data to the target structure, applying any necessary data quality
checks, and writing the data into the target database.
Types :
1. Initial load: The first time the ETL process runs, it extracts and loads all data from the source systems into
the target data warehouse.
2. Incremental load: After the initial load, the ETL process runs periodically to extract only the new or
changed data from the source systems and loads it into the target data warehouse.
3. Complete reload: In some cases, it may be necessary to reload all data into the target data warehouse, for
example, when major changes have been made to the ETL process or the data model.
DataWareHouse DWH
A data warehouse is a large, centralized repository of data that is specifically designed for reporting and analysis.
It is a system that is used to consolidate data from various sources, transform it into a consistent format, and
store it in a way that allows users to easily access and analyze it.
Data characteristics :
Thematic or subject-oriented, meaning it contains relevant data for a specific topic or theme.
Integrated, meaning it combines data from different sources, even if they are heterogeneous.
Historical, meaning it represents the activity of a company over a certain period of time, often several years.
Non-volatile, meaning it is mostly used for querying and cannot be modified.
Architectures :
A data mart is a subset of a larger data warehouse that is specific to a particular business unit or
department.
In this architecture, each department or business unit has its own independent data mart.
Data marts are typically easier and quicker to implement than a full data warehouse, but they can result in
data redundancy and inconsistencies if there is not proper coordination between them.
3. Architecture Hub-and-Spoke:
This architecture involves a single, centralized data warehouse that stores all enterprise-wide data.
Data is extracted from various source systems and loaded into the central warehouse.
This architecture provides a consistent and integrated view of enterprise data, but it can be complex and
costly to implement and maintain.