410259: AC6-1 Business Intelligence: For BE Computer Engineering Students

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 49

SNJB’s KBJ College of Engineering Chandwad

Department of Computer Engineering


Audit Course Report on,

410259: AC6-1 Business


Intelligence

Submitted to Savitribai Phule Pune University, Pune

12th April 2021

For BE Computer Engineering Students.


Submitted

By, NAME: Ahire Nikita

Kashinath

Roll No: 11

University Seat No:

1
Syllabus
The course aims at examining Business Intelligence (BI) as a broad
category of applications and technologies for gathering, storing,
analyzing, sharing and providing access to data to help enterprise users
make better managerial decisions.
Course Objectives:
 To understand the concept of Business Intelligence
 To know the details of Decision Support System
 To inculcate the concepts of Data Warehousing
 To understand the basics of design and management of BI systems

Course Outcome:
On completion of the course, learner will be able to–
 Apply the concepts of Business Intelligence in real world
applications
 Explore and use the data warehousing wherever necessary
 Design and manage practical BI

systems Course Contents:

1.Concepts with Mathematical treatment : Introduction to data,


Information and knowledge, Decision Support System, Theory of
Operational data and informational data, Introduction to Business
Intelligence, Determining BI Cycle, BI Environment and Architecture,
Identify BI opportunities, Benefits of BI. Role of Mathematical model in BI,
Factors Responsible for successful BI Project, Obstacle to Business
Intelligence in an Organization

2. Decision Making Concepts : Concepts of Decision Making, Techniques


of Decision Support System (DSS), Development of Decision Support
System (DSS), Applications of DSS, Role of Business Intelligence in DSS.

3. Data-Warehouse : Introduction: Data warehouse Modeling, data


warehouse design, data-ware-house technology, Distributed data
warehouse, and materialized view

4.Data Pre-processing and outliers: Data Analytics life cycle, Discovery,


Data preparation, Preprocessing requirements, data cleaning, data
integration, data reduction, data transformation, Data discretization, and
concept hierarchy generation, Model Planning, Model building,
Communicating Results and Findings, Operationalizing, Introduction to
OLAP. Real-world Applications, types of outliers, outlier challenges, Outlier
detection Methods, Proximity-Based Outlier analysis, Clustering Based
Outlier analysis.

5. Designing and managing BI systems : Determining infrastructure


requirements, planning for scalability and availability, managing and
maintenance of BI systems, managing BI operations or business continuity
.
Chapter 1
Concepts with Mathematical
treatmen
t Data:
Data is the storage of intrinsic meaning, a mere representation. The main purpose of
data is to record activities or situations, to attempt to capture the true picture or real
event. Therefore, all data are historical, unless used for illustration purposes, such as
forecasting.

Information:
Information is a message that contains relevant meaning, implication, or input for
decision and/or action. Information comes from both current (communication) and
historical (processed data or reconstructed picture) sources. In essence, the
purpose of information is to aid in making decisions and/or solving problems or
realizing an opportunity.

Knowledge:
Knowledge is the cognition or recognition (know-what), capacity to act (know-how),
and understanding (know-why) that resides or is contained within the mind or in the
brain. The purpose of knowledge is to better our lives. In the context of business,
the purpose of knowledge is to create or increase value for the enterprise and all its
stakeholders. In short, the ultimate purpose of knowledge is for value creation.

Decision Support System:


A decision support system (DSS) is an information system that supports business or
organizational decision-making activities. DSSs serve the management, operations and
planning levels of an organization (usually mid and higher management) and help
people make decisions about problems that may be rapidly changing and not easily
specified in advance — i.e. unstructured and semi-structured decision problems.
Decision support systems can be either fully computerized or human-powered, or a
combination of both.
While academics have perceived DSS as a tool to support decision making
processes, DSS users see DSS as a tool to facilitate organizational processes.
Some authors have extended the definition of DSS to include any system that
might support decision
making and some DSS include a decision-making software component; Sprague
(1980)defines a properly termed DSS as follows:

 DSS tends to be aimed at the less well structured, underspecified problem


that upper level managers typically face;
 DSS attempts to combine the use of models or analytic techniques with
traditional data access and retrieval functions;
 DSS specifically focuses on features which make them easy to use by non-
computer-proficient people in an interactive mode; and
 DSS emphasizes flexibility and adaptability to accommodate changes in the
environment and the decision making approach of the user.
DSSs include knowledge-based systems. A properly designed DSS is an interactive
software-based system intended to help decision makers compile useful information
from a combination of raw data, documents, and personal knowledge, or business
models to identify and solve problems and make decisions.
Typical information that a decision support application might gather and present
includes:
 inventories of information assets (including legacy and relational data sources,
cubes, data warehouses, and data marts),
 comparative sales figures between one period and the next,
 Projected revenue figures based on product sales assumptions.

Operational Data:
 An operational data store (or "ODS") is used for operational reporting and as a
source of data for the Enterprise Data Warehouse (EDW). It is a complementary
element to an EDW in a decision support landscape, and is used for operational
reporting, controls and decision making, as opposed to the EDW, which is used
for tactical and strategic decision support.
 An ODS is a database designed to integrate data from multiple sources for
additional operations on the data, for reporting, controls and operational
decision support. Unlike a production master data store, the data is not
passed back to operational systems. It may be passed for further operations
and to the data warehouse for reporting.
 An ODS should not be confused with an enterprise data hub (EDH). An
operational data store will take transactional data from one or more
production system and loosely integrate it, in some respects it is still subject
oriented, integrated and time variant, but without the volatility constraints.
This integration
is mainly achieved through the use of EDW structures and content.
 An ODS is not an intrinsic part of an EDH solution, although an EDH may be
used to subsume some of the processing performed by an ODS and the EDW.
An EDH is a broker of data. An ODS is certainly not.
 Because the data originates from multiple sources, the integration often involves
cleaning, resolving redundancy and checking against business rules for integrity.
An ODS is usually designed to contain low-level or atomic (indivisible) data (such
as transactions and prices) with limited history that is captured "real time" or
"near real time" as opposed to the much greater volumes of data stored in the
data warehouse generally on a less-frequent basis.

Introduction of BI:
The purpose of Business Intelligence (BI) is to provide decision makers with the
information necessary to make informed decisions. The information is delivered by
the Business Intelligence system via reports. This book focuses on the architecture
and infrastructure needed to deliver the information. Architecture is a set of rules or
structures providing a framework for the overall design of a system or product (Poe
et al. 1998). The BI system includes the following parts:
 Interested parties and their respective information needs
 Input of data
 Storage of data
 Analysis of data
 Automatic and selective dissemination of information
A BI system includes the rules (architecture) providing a framework for the
organization of the technologies, platforms, databases, gateways, people and
processes. To implement an architecture the Business Intelligence architect must
implement an infrastructure. Technical infrastructures are the technologies,
platforms, databases, gateways, people and processes necessary to make the
architecture functional within the corporation (Poe et al. 1998).
In sum, decision makers need reports that deliver the information that allows them
to understand his/her organization and the world in order to make better decisions.
Given these functions of the BI system the most enduring definition focuses on the
system, tools, technology, process, and techniques that compose these four
elements that help decision makers understand their world. BI augments the ability
of decision makers to turn data into information by aiding in extracting data from
data sources, organizing the data based on established business knowledge and
then presenting the information in a manner that is organized in a way to be useful
to the decision maker. It merges technology with knowledge in order to provide
useful information to
management as quickly as possible.
In sum, Business Intelligence system includes the rules (architecture) that outline
how to organize the parts of the system (infrastructure) to deliver the information
needed to thrive in a competitive market (business) or to provide the best service to
the people (government) (Poe et al.1998). Regardless of the technology, which is
simply tools, the core of a Business intelligence system has and will not change.

Determining BI Cycle:

Intelligence is not just a set of tools to analyze raw data to help make strategic and
operational decisions. It is a framework that offers guidance in understanding what to
look for in the volumes of disparate data. As a framework, BI is a continuous cycle of
analysis, insight, action and measurement.
Analyzing a business is based on what we know and feel to be important while
filtering out the aspects of the business not considered mission critical or
detrimental to the growth of the organization. Deciding what is important is based on
our understanding and assumptions of what is important to customers, supplies,
competitors and employees. All of this knowledge is unique to a business and is an
incredible resource when formulating a BI strategy. However, having such granular
grassroots knowledge of the business can subconsciously limit the ability to see
patterns obvious to others. One of the benefits of BI reporting is performing ad hoc
reporting by drilling down layers of data pivoting on the rows and columns. Such
flexibility opens up humans inquisitive nature to ask more questions that wouldnt
necessarily be asked if such access to data wasnt available. Effective analysis
helps to understand the business better so to challenge conventional wisdom and
assumptions as well as what is considered to be the right analysis.
Insight comes in many forms. There are operational insights, such as determining
the effect on production costs with the installation of new more energy efficient
machines
that have slightly lower production yields per unit of measure. There are strategic
insights analyzing, for example, new market opportunities by conducting research
on the barriers to entry. Insight is the intangible product of analysis developed from
asking questions that only humans can ask. Computers can be used for the
identification of patterns, but only humans can recognize what patterns are useful.
The issue with having insight is convincing others to believe or support the new
perspective in order for the insight to be useful. As in life, anything new or different
is slow to acceptance or given credibility. Well organized business intelligence that
supports the insight by providing clear data, patterns, logic, presentation (i.e.
graphs, reports) and calculations are the drivers to help sell the new insight.
Once the analysis is done and the insight has been sold the next process in the BI
cycle is performing the action or decision making. Well thought out decisions
backed up by good analysis and insight gives confidence and courage to the
proposed action. Otherwise, decisions not supported by quality analytics are made
with overbearing safety measures or less dedication or commitment from the
stakeholders. In addition, quality business intelligence delivered quickly improves
the speed to action. Today’s organizations need to react more quickly, develop new
approaches faster, conduct more agile R&D and get products and services to
market faster than ever before. BI based decision making with faster access to
information and feedback provides more opportunity for quicker prototyping and
testing.

BI Environment and Architecture :

Business intelligence architecture refers to the infrastructure that organizations use


to define their data collection streams, information administration, and all the
technology that supports their business intelligence.
Creating a sustainable architecture depends on understanding the different components
that are involved with developing successful business intelligence tools. The process is
broadly divided into three areas: data collection, information management, and business
intelligence.
The first area refers to the different channels and methods of collecting data from
activities carried out within your organization. This includes understanding which data
different users need to meet their requirements, as well as a clear idea of the quality,
type, and currency of the data. This step is vital for adding value as the right data
produces the best BI insights.
The second major component is data management. This covers various aspects of
integrating data, scrubbing datasets, and fashioning the overall structures that will
house and administer data.
Finally, business intelligence is the part of an organizations architecture that analyzes
properly organized data sets to produce insights. This area involves using real-time
analytics, data visualizations, and other BI tools.

Benefits of BI:
 Faster reporting, analysis or planning

 More accurate reporting, analysis or planning

 Better business decisions

 Improved data quality

 Improved employee satisfaction

 Improved operational efficiency

 Improved customer satisfaction

 Increased competitive advantage

 Reduced costs

 Increased revenues

 Saved headcount
Factors Responsible for successful BI Project:
 Dont boil the ocean.

 Focus on ease of use.

 Performance is key.

 Pick your technology carefully.

 Understand where the cost is.

 Nominate advocates.

 Reconcile, reconcile, reconcile.

 Ensure executive involvement.

Obstacle to Business Intelligence in an Organization:


 Lack of BI Strategy

 Business Intelligence When You Dont Know How to Code

 Lack of Training & Execution

 Lack of BI Impact (Low utilization)

 Business Intelligence with Unstructured Data

 Installation and deployment


Chapter 2

Decision Making
Concepts

Concepts of Decision
Making:

Decision-making is the act of making a choice among available alternatives. There are
innumerable decisions that are taken by human beings in day-to-day life. In business
undertakings, decisions are taken at every step. It is also regarded as one of the
important function of management. Managerial functions like planning, organizing,
staffing, directing, coordinating and controlling are carried through decisions.
Decision making is possible when there are two or more alternatives to solve a single
problem or difficulty. If there is only one alternative then there is no question of decision
making. It is believed that the management without a decision is a man without a
backbone. Therefore, decision making is a problem-solving approach by choosing a
specific course of action among various alternatives.
"Decision-making is the selection, based on some criteria from two or more possible
alternatives. “- George R.Terry
"A decision can be defined as a course of action consciously chosen from available
alternatives for the purpose of the desired result" -J.L. Massie
In conclusion, we can say that decision making is the process of choosing a specific
course of action from various alternatives to solve the organizational problems or
difficulties.

Importance of Decision Tree:


Decision making is considered as the backbone for the business management because
without taking the right decision at right time, nothing can be performed. The further
importance of decision making can be discussed under the following points:

 Proper utilization of resources

 Selecting the best alternative

 Evaluation of the managerial performance

 Employees motivation

 Indispensable element/ component

 Achievement of goal/ objectives

 Pervasive function

Steps of Decision Making Process:


For the rationality, reliability, and enforceability of decisions, managers should follow a
sequential set of steps. It is said that a decision is rational if appropriate means are
chosen to reach desired ends. In this regards, various management authorities have
recognized and described different steps in the process of decision-making. Ricky W.
Griffin has suggested six steps in the process of decision making. Accordingly, the
steps are:

 Identification of problem

 Analysis of problem

 Developing an alternative course of action

 Evaluating alternative course of action

Development of Decision Support System (DSS):


Similarly to other systems, DSS systems require a structured approach. Such a
framework includes people, technology, and the development approach.
The Early Framework of Decision Support System consists of four phases:
 Intelligence — Searching for conditions that call for decision;

 Design — Developing and analyzing possible alternative actions of solution;

 Choice — Selecting a course of action among those;

 Implementation — Adopting the selected course of action in decision situation.


DSS technology levels (of hardware and software) may include:
 The actual application that will be used by the user. This is the part of the application
that allows the decision maker to make decisions in a particular problem area. The
user can act upon that particular problem.

 Generator contains Hardware/software environment that allows people to easily


develop specific DSS applications. This level makes use of case tools or systems
such as Crystal.

 Tools include lower level hardware/software. DSS generators including special


languages, function libraries and linking modules
An iterative developmental approach allows for the DSS to be changed and redesigned
at various intervals. Once the system is designed, it will need to be tested and revised
where necessary for the desired outcome.

Applications of DSS:
 Medical diagnosis

 Business and Management

 Agricultural Production
Chapter
3
Dat
Warehouse a

Introduction: Data warehouse


Modeling:
Data warehousing is the process of constructing and using a data warehouse. A
data warehouse is constructed by integrating data from multiple heterogeneous
sources that support analytical reporting, structured and/or ad hoc queries, and
decision making. Data warehousing involves data cleaning, data integration, and
data consolidations.

A data warehouse is a technique for collecting and managing data from varied
sources to provide meaningful business insights. It is a blend of technologies and
components which allows the strategic use of data. Data Warehouse is electronic
storage of a large amount of information by a business
A data warehouse is a relational database that aggregates structured data from across
an entire organization. It pulls together data from multiple sources — much of it is
typically online transaction processing (OLTP) data. The data warehouse selects,
organizes and aggregates data for efficient comparison and analysis. Data Warehousing
may be defined as a collection of corporate information and data derived from
operational systems and external data sources. A data warehouse is designed with the
purpose of inducing business decisions by allowing data consolidation, analysis, and
reporting at different aggregate levels. Data is populated into the DW by extraction,
transformation, and loading.
Data Warehousing incorporates data stores and conceptual, logical, and physical
models to support business goals and end-user information needs. Creating a DW
requires mapping data between sources and targets, then capturing the details of the
transformation in a metadata repository. The data warehouse provides a single,
comprehensive source of current and historical information.

How It Works:
Data Warehousing combines information collected from multiple sources into one
comprehensive database. To cite an example from the business world, I might say that
data warehouse incorporates customer information from a companies point-of-sale
systems (the cash registers), its website, its mailing lists, and its comment cards. Data
Warehousing may also consider confidential information about employee details, salary
information, etc.
Companies use this information to analyze their customers. Data warehousing also
related to data mining which means looking for meaningful data patterns in the huge
data volumes and devise newer strategies for higher sales and profits.

Introduction to Data Warehousing Types:


The three main types of Data Warehouses are:
 Enterprise Data Warehouse
 Operational Data Store
 Data Mart
Enterprise Data Warehouse: Enterprise Data Warehouse is a centralized warehouse,
which provides decision support service across the enterprise. It offers a unified
approach to organizing and representing data. It also provides the ability to classify data
according to the subject and give access according to those divisions.
Operational Data Store: Operational Data Store, also called ODS, is data store
required when neither Data warehouse nor OLTP systems support organizations
reporting needs. In ODS, Data warehouse is refreshed in real time. Hence, it is
widely preferred for routine activities like storing records of the Employees.
Data Mart: A Data Mart is a subset of the data warehouse. It specially designed for
specific segments like sales, finance, sales, or finance. In an independent data mart,
data can collect directly from sources.
Data Warehouses and data marts are mostly built on dimensional data modeling
where fact tables relate to dimension tables. This is useful for users to access data
since a database can be visualized as a cube of several dimensions. A data
warehouse allows a user to splice the cube along each of its dimensions.
Data warehouse design:
Data Warehouse design approaches are very important aspect of building data
warehouse. Selection of right data warehouse design could save lot of time and project
cost.
There are two different Data Warehouse Design Approaches normally followed
when designing a Data Warehouse solution and based on the requirements of your
project you can choose which one suits your particular scenario. These
methodologies are a result of research from Bill Inmon and Ralph Kimball.

Bottom-up design:
In the bottom-up approach, data marts are first created to provide reporting and
analytical capabilities for specific business processes. These data marts can then
be integrated to create a comprehensive data warehouse. The data warehouse bus
architecture is primarily an implementation of "the bus", a collection of conformed
dimensions and conformed facts, which are dimensions that are shared (in a
specific way) between facts in two or more data marts. Ralph Kimball is a renowned
author on the subject of data warehousing. His data warehouse design approach is
called dimensional modelling or the Kimball methodology. This methodology follows
the bottom-up approach.
As per this method, data marts are first created to provide the reporting and
analytics capability for specific business process, later with these data marts
enterprise data warehouse is created.
The above image depicts how the bottom-up approach works.
Basically, Kimball model reverses the Inmon model i.e. Data marts are directly loaded
with the data from the source systems and then ETL process is used to load in to
Data Warehouse. The above image depicts how the top-down approach works.
Below are the steps that are involved in bottom-up approach:
 The data flow in the bottom up approach starts from extraction of data from
various source system into the stage area where it is processed and loaded
into the data marts that are handling specific business process.
 After data marts are refreshed the current data is once again extracted in
stage area and transformations are applied to create data into the data mart
structure. The data is the extracted from Data Mart to the staging area is
aggregated, summarized and so on loaded into EDW and then made
available for the end user for analysis and enables critical business decisions.

Top-down design: The top-down approach is designed using a normalized enterprise


data model. "Atomic" data, that is, data at the lowest level of detail, are stored in the
data warehouse. Dimensional data marts containing data needed for specific business
processes or specific departments are created from the data warehouse. “Bill Inmon” is
sometimes also referred to as the “father of data warehousing”; his design methodology
is based on a top-down approach. In the top-down approach, the data warehouse is
designed first and then data mart are built on top of data warehouse.
The above image depicts how the top-down approach works.
Below are the steps that are involved in top-down approach:
 Data is extracted from the various source systems. The extracts are loaded
and validated in the stage area. Validation is required to make sure the
extracted data is accurate and correct. You can use the ETL tools or
approach to extract and push to the data warehouse.
 Data is extracted from the data warehouse in regular basis in stage area. At
this step, you will apply various aggregation, summerization techniques on
extracted data and loaded back to the data warehouse.
 Once the aggregation and summerization is completed, various data marts
extract that data and apply the some more transformation to make the data
structure as defined by the data marts.

Hybrid design: Data warehouses (DW) often resemble the hub and spokes
architecture. Legacy systems feeding the warehouse often include customer
relationship management and enterprise resource planning, generating large
amounts of data. To consolidate these various data models, and facilitate the
extract transform load process, data warehouses often make use of an operational
data store, the information from which is parsed into the actual DW. To reduce data
redundancy, larger systems often store the data in a normalized way. Data marts for
specific reports can then be built on top of the DW.
Distributed data warehouse:
The Distributed Data Warehouse. Most organizations build and maintain a single
centralized data warehouse environment. This setup makes sense for many
reasons: The volume of data in the data warehouse is such that a single centralized
repository of data makes sense.

Data warehousing is one of the major research topics of applied-side database


investigators. Most of the work to date has focused on building large centralized
systems that are integrated repositories founded on pre-existing systems upon
which all corporate-wide data are based. Unfortunately, this approach is very
expensive and tends to ignore the advantages realized during the past decade in
the area of distribution and support for data localization in a geographically
dispersed corporate structure. This research investigates building distributed data
warehouses with particular emphasis placed on distribution design for the data
warehouse environment. This dissertation contributes by providing an architectural
model for a distributed data warehouse, the formal definition of the relational data
model for data warehouse, a methodology for distributed data warehouse design
along with two approaches to horizontally fragment the huge fact relation in the data
warehouse and an analytical technique that provides insight into which approach
presented is most suited to a particular distributed data warehouse environment.

Data Warehousing with Materialized Views:


An enterprise data warehouse contains historical detailed data about the organization.
Typically, data flows from one or more online transaction processing (OLTP) databases
into the data warehouse on a monthly, weekly, or daily basis. The data is usually
processed in a staging file before being added to the data warehouse. Data warehouses
typically range in size from tens of gigabytes to a few terabytes, usually with the vast
majority of the data stored in a few very large fact tables.
A data mart contains a subset of corporate data that is of value to a specific business
unit, department, or set of users. Typically, a data mart is derived from an enterprise
data warehouse.
One of the techniques employed in data warehouses to improve performance is the
creation of summaries, or aggregates. They are a special kind of aggregate view
which improves query execution times by precalculating expensive joins and
aggregation operations prior to execution, and storing the results in a table in the
database. For example, a table may be created which would contain the sum of
sales by region and by product.
Today, organizations using summaries spend a significant amount of time manually
creating summaries, identifying which ones to create, indexing the summaries,
updating them, and advising their users on which ones to use. The introduction of
summary management in the Oracle server changes the workload of the DBA
dramatically and means the end-user no longer has to be aware of which
summaries have been defined.
The DBA creates one or more materialized views, which are the equivalent of a
summary. The end-user queries the tables and views in the database and the query
rewrite mechanism in the Oracle server automatically rewrites the SQL query to use
the summary tables. This results in a significant improvement in response time for
returning results from the query and eliminates the need for the end-user or
database application to be aware of the summaries that exist within the data
warehouse.
Although summaries are usually accessed indirectly via the query rewrite
mechanism, an end-user or database application can construct queries which
directly access the summaries. However, serious consideration should be given to
whether users should be allowed to do this because, once the summaries are
directly referenced in queries, the DBA will not be free to drop and create
summaries without affecting applications.
The summaries or aggregates that are referred to in this book and in literature on
data warehousing are created in Oracle using a schema object called a materialized
view. Materialized views can be used to perform a number of roles, such as
improving query performance or providing replicated data, as described below.

Materialized Views for Data Warehouses:


In data warehouses, materialized views can be used to precompute and store
aggregated data such as sum of sales. Materialized views in these environments are
typically referred to as summaries since they store summarized data. They can also be
used to precompute joins with or without aggregations. So a materialized view is used
to eliminate overhead associated with expensive joins or aggregations for a large or
important class of queries.
Chapter 4
Data Pre-processing and
outliers

Data Analytics LifeCycle-

Phase 1—Discovery: In Phase 1, the team learns the business domain, including
relevant history such as whether the organization or business unit has attempted
similar projects in the past from which they can learn. The team assesses the
resources available to support the project in terms of people, technology, time, and
data. Important activities in this phase include framing the business problem as an
analytics challenge that can be addressed in subsequent phases and formulating
initial hypotheses (IHs) to test and begin learning the data.

Phase 2 — Data preparation: Phase 2 requires the presence of an analytic sandbox, in


which the team can work with data and perform analytics for the duration of the project.
The team needs to execute extract, load, and transform (ELT) or extract, transform and
load (ETL) to get data into the sandbox. The ELT and ETL are sometimes abbreviated
as ETLT. Data should be transformed in the ETLT process so the team can work with it
and analyze it. In this phase, the team also needs to familiarize itself with the data
thoroughly and take steps to condition the data

Phase 3 — Model planning: Phase 3 is model planning, where the team determines the
methods, techniques, and workflow it intends to follow for the subsequent model
building phase. The team explores the data to learn about the relationships between
variables and subsequently selects key variables and the most suitable models.

Phase 4 — Model building: In Phase 4, the team develops datasets for testing, training,
and production purposes. In addition, in this phase the team builds and executes
models based on the work done in the model planning phase. The team also considers
whether its existing tools will suffice for running the models, or if it will need a more
robust environment for executing models and workflows (for example, fast hardware and
parallel processing, if applicable).

Phase 5 — Communicate results: In Phase 5, the team, in collaboration with major


stakeholders, determines if the results of the project are a success or a failure based on
the criteria developed in Phase 1. The team should identify key findings, quantify the
business value, and develop a narrative to summarize and convey findings to
stakeholders.

Phase 6—Operationalize: In Phase 6, the team delivers final reports, briefings, code,
and technical documents. In addition, the team may run a pilot project to implement the
models in a production environment.

Data Preprocessing
Data preprocessing is a data mining technique that involves transforming raw data into
an understandable format. Real-world data is often incomplete, inconsistent, and/or
lacking in certain behaviors or trends, and is likely to contain many errors. Data
preprocessing is a proven method of resolving such issues.

Data preprocessing is a data mining technique that involves transforming raw data into
an understandable format. Real-world data is often incomplete, inconsistent, and/or
lacking in certain behaviors or trends, and is likely to contain many errors. Data
preprocessing is a proven method of resolving such issues. Data preprocessing
prepares raw data for further processing.

Data preprocessing is used database-driven applications such as customer relationship


management and rule-based applications (like neural networks)

Data preprocessing is a data mining technique that involves transforming raw data into
an understandable format. Real-world data is often incomplete, inconsistent, and/or
lacking in certain behaviors or trends, and is likely to contain many errors. Data
preprocessing is a proven method of resolving such issues. Data preprocessing
prepares raw data for further processing.

Data preprocessing is used database-driven applications such as customer relationship


management and rule-based applications (like neural networks)
Steps in Data preprocessing:
1. Data Cleaning : Data cleaning, also called data cleansing or scrubbing.
Fill in missing values, smooth noisy data, identify or remove the outliers, and
resolve inconsistencies. Data cleaning is required because source systems
contain “dirty data” that must be cleaned.

Steps in Data cleaning:


Parsing:
Parsing locates and identifies individual data elements in the source files and then
isolates these data elements in the target files.

Example includes parsing the first, middle and the last name.

Correcting:
Correct parsed individual data components using sophisticated data algorithms and
secondary data sources.

Example includes replacing a vanity address and adding a zip code.

Standardizing:
Standardizing applies conversion routines to transform data into its preferred and
consistent format using both standard and custom business rules.

Examples include adding a pre name, replacing a nickname.

Matching:
Searching and matching records within and across the parsed, corrected and
standardized data based on predefined business rules to eliminate duplications.

Examples include identifying similar names and addresses.

Consolidating:
Analyzing and identifying relationships between matched records
and consolidating/merging them into one representation.

Data cleansing
It must deal with many types of possible errors:
These include missing data and incorrect data at one source.

Data Staging:
Accumulates data from asynchronous sources.

At a predefined cutoff time, data in the staging file is transformed and loaded to the
warehouse.

There is usually no end user access to the staging

file. An operational data store may be used for data

staging.

2. Data integration and Transformation:


Data integration: Combines data from multiple sources into a coherent data store e.g.
data warehouse. Sources may include multiple databases, data cubes or data files.

Issues in data integration:

Schema integration:

Integrate metadata from different sources.

Entity identification problem: identify real world entities from multiple data sources,
e.g. A cust- id=B.cust#.

Detecting and resolving data value conflicts:

For the same real world entity, attribute values from different sources are different.
Possible reasons: different representations, different scales. Redundant data occur
often when integration of multiple databases:

The same attribute may have different names in different databases.


Data Transformation:

Transformation process deals with rectifying any inconsistency (if any).

One of the most common transformation issues is Attribute Naming Inconsistency. It is


common for the given data element to be referred to by different data names in different
databases. Eg Employee Name may be EMP_NAME in one database, ENAME in the
other.

Thus one set of Data Names are picked and used consistently in the data
warehouse.

Once all the data elements have right names, they must be converted to common
formats.

3. Data Reduction:
Obtains reduced representation in volume but produces the same or similar
analytical results.

Need for data reduction:

Reducing the number of attributes

Reducing the number of attribute

values Reducing the number of tuples

Introduction to OLAP
OLAP allows business users to slice and dice data at will. Normally data in an
organization is distributed in multiple data sources and are incompatible with each
other. A retail example: Point-of-sales data and sales made via call-center or the
Web are stored in different location and formats. It would a time consuming process
for an executive to obtain OLAP reports such as - What are the most popular
products purchased by customers between the ages 15 to 30?Part of the OLAP
implementation
process involves extracting data from the various data repositories and making
them compatible. Making data compatible involves ensuring that the meaning of the
data in one repository matches all other repositories. An example of incompatible
data: Customer ages can be stored as birth date for purchases made over the web
and stored as age categories (i.e. between 15 and 30) for in store sales.

It is not always necessary to create a data warehouse for OLAP analysis. Data stored
by operational systems, such as point-of-sales, are in types of databases called OLTPs.
OLTP, Online Transaction Process, databases do not have any difference from a
structural perspective from any other databases. The main difference, and only,
difference is the way in which data is stored.

Examples of OLTPs can include ERP, CRM, SCM, Point-of-Sale applications, Call
Center.

OLTPs are designed for optimal transaction speed. When a consumer makes a
purchase online, they expect the transactions to occur instantaneously. With a database
design, call data modeling, optimized for transactions the record 'Consumer name,
Address, Telephone, Order Number, Order Name, Price, Payment Method' is created
quickly on the database and the results can be recalled by managers equally quickly if
needed.
Types Of Outliers:

Type1:GlobalOutliers:
A data point is considered a global outlier if its value is far outside the entirety of the
data set in which it is found (similar to how “global variables” in a computer program can
be accessed by any function in the program).

Global Anomaly:
Type 2: Contextual (Conditional) Outliers:
A data point is considered a contextual outlier if its value significantly deviates from
the rest of the data points in the same context. Note that this means that same
value may not be considered an outlier if it occurred in a different context. If we limit
our discussion to time series data, the “ context” is almost always temporal,
because time series data are records of a specific quantity over time. It’s no surprise
then that contextual outliers are common in time series data.

Contextual Anomaly: Values are not outside the normal global range, but are abnormal
compared to the seasonal pattern.

Type3:CollectiveOutliers:
A subset of data points within a data set is considered anomalous if those values as
a collection deviate significantly from the entire data set, but the values of the
individual data points are not themselves anomalous in either a contextual or global
sense. In time series data, one way this can manifest is as normal peaks and
valleys occurring outside of a time frame when that seasonal sequence is normal or
as a combination of time series that is in an outlier state as a group.

Collective Anomaly: In the example, two time series that were discovered to be related
to each other, are combined into a single anomaly. For each time series the individual
behavior does not deviate significantly from the normal range, but the combined
anomaly indicated a bigger issue.
Outlier detection Methods
Z-Score

The z-score or standard score of an observation is a metric that indicates how many
standard deviations a data point is from the samples mean, assuming a gaussian
distribution. This makes z-score a parametric method. Very frequently data points
are not to described by a gaussian distribution, this problem can be solved by
applying transformations to data ie: scaling it.

Some Python libraries like Scipy and Sci-kit Learn have easy to use functions and
classes for a easy implementation along with Pandas and Numpy.

After making the appropriate transformations to the selected feature space of the
dataset, the z-score of any data point can be calculated with the following
expression:

When computing the z-score for each sample on the data set a threshold must be
specified. Some good thumb-rule thresholds can be: 2.5, 3, 3.5 or more standard
deviations.
By tagging or removing the data points that lay beyond a given threshold we are
classifying data into outliers and not outliers.

Z-score is a simple, yet powerful method to get rid of outliers in data if you are
dealing with parametric distributions in a low dimensional feature space. For
nonparametric problems Dbscan and Isolation Forests can be good solutions.

Chapter 5
Designing and managing BI
systems
1. End-User Experience

These are the core capabilities for all the end users of your application. Business
intelligence requirements in this category may include dashboards and reports as well
as the interactive and analytical functions users can perform.

Ideally, such self-service capabilities let users answer their own questions without
having to involve IT. That frees up IT to work on more strategic projects rather than
answering ad hoc requests. It also empowers users to be more self-sufficient.

During your evaluation, make sure the capabilities important to your project are
demonstrated and understand how you will deliver and iterate upon these
capabilities inside your application.
These common self-service capabilities may affect your business intelligence
requirements:

 User personas: Increase the adoption of analytics by providing a broad range of users
with a tailored experience that matches their needs and skills.
 Presentation and information delivery: These requirements affect you present data in
visualizations, dashboards, and reports, as well as the compatibility of your BI solution
across different devices and formats.
 Interactivity and automation: Do users need to be able to interact with your
dashboards? Consider whether you need to personalize visualizations, let users kick off
workflows, or drill down into information.
 Analysis and authoring: Empowering your users to query their own data, create
visualizations and reports, and share their findings with colleagues can add value to
your analytics application.

2. Data Environment

The BI solutions you evaluate should be compatible with your current data environment,
while at the same time have enough flexibility to meet future demands as your data
architecture evolves. These are the diverse data requirements commonly evaluated by
application providers:

 Data sources: Make sure your primary data source is supported by your BI solution.
Also look for a vendor that supports generic connectors and has flexibility through APIs
or plug-ins.
 Data management: Consider what business intelligence requirements you have for
directly querying the data source for real-time reporting, caching data, and blending
data from multiple sources. Also ask yourself if your users need to transform or
enrich data for analysis.

3. Embeddability and Customization

Depending on your organizations need for embedding analytics into an application


or existing software, a primary consideration may be the level of integration with
your applications environment. How much emphasis do you place on customization
and integration capabilities compared to standard business intelligence
implementations? Do you want to offer a seamless user experience within the
context of an existing application? Have you struggled with adopting analytics in the
past?

One way to look at embeddability is to focus on driving adoption. The deeper you
integrate analytics into the fabric of the applications your users leverage every day,
the higher your user adoption will be. For instance, if youre a salesperson and you
spend much of your time using Salesforce, you dont want to go into applications
that are siloed from your daily work. Organizations need to infuse that analytic
workflow within users daily activities.

Consider these requirements around embeddability and customization:

 Security: It should be easy to adopt the security from your application to the
analytics content. Scrutinize vendors on the flexibility of their security models,
implementation of single sign-on, and whether data needs to synchronized or
replicated between applications.
 Multi-tenancy: With built-in multi-tenancy, you can create a report once and deploy for
multiple customers.
 User experience: Many analytic applications need to be white-labeled to match your
brand. Look for embedding APIs to ensure visualizations are rendered in the correct
context.
 Workflow: Create an efficient experience where users can immediately take action
on what they see in any visualization or report.
 Extensibility: Choose a BI solution that supports custom coding and third-party
charting libraries to extend your application further.
 4. Development and Deployment

Since time-to-value is so critical to the success of the project, having a development


environment where you can create, style, embed, deploy, and iterate on analytics will
enable your team to deliver the functionality your business demands.
These common capabilities may be part of your business intelligence requirements:
 Development: A rich set of out-of-the-box functionalities is just the beginning. Also
look for sample applications and rapid development tools to make both small
functionality changes and mass changes that affect the entire application.
 Deployment: Quickly deploy and scale an implementation that is aligned with your
current technology stack. The best solution fits into your web architecture,
minimizing the need to deploy proprietary technology, and utilizes well-known
techniques to scale the implementation.

5. Licensing, Services, and Company Expertise

Choosing the right partner is not just about the technology. Its also about finding the
level of expertise you require for training, support, and services—as well as agreeing on
the business terms that ensure shared success — to get you to the finish line, and
beyond.

These factors may affect your business intelligence requirements:

 Licensing: Terms of the license can depend on a variety of factors, including number
of users/ customers, servers, usage, and whether you are embedding into a
commercial product. Be sure the terms make business sense over the short and
long runs.
 Services: Completing your project on time may require technical support from your
BI vendor, along with professional services to get you off the ground and ongoing
training.
 Customer success: Your BI vendor needs to be dedicated to your success. Look for
one that supports onboarding, gives you a dedicated account management, and
plenty of documentation and support offerings.
 Expertise: The best analytics vendors have experience in your specific use case or
industry.

Planning for scalability and availability


Scalability is the capability to make a software application available to those who need it
and want it. Scalable applications can help increase user adoption across an
organization, and may even increase the time users spend engaged in self-service
activities.

Scalability makes insights accessible to everyone. Dashboards and reports are available
to any number of users, not just restricted to the data analysts or executives in an
organization. In fact, many companies are embracing the concept of the extended
enterprise − sharing insights with field operations, customers, and external partners in
order to promote a common understanding of how to drive business forward.
Source code of Whatsapp Analysis using R programming:

#Load required packages

library(ggplot2)

library(lubridate)

library(Scale)

library(reshape2)

library(tm)

library(SnowballC)

library(wordcloud)

library(RColorBrewer)

library(stringr)

library(syuzhet)

library(dplyr )
#get the data from whatsapp chat

text <- readLines("chat.txt")

#let us create the corpus

docs <- Corpus(VectorSource(text))

#clean our chat data

trans <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))

docs <- tm_map(docs, trans, "/")

docs <- tm_map(docs, trans,

"@") docs <- tm_map(docs,

trans, "\\|")

docs <- tm_map(docs, content_transformer(tolower))

docs <- tm_map(docs, removeNumbers)

docs <- tm_map(docs, removeWords, stopwords("english"))

docs <- tm_map(docs, removeWords,

c("sudharsan","friendName")) docs <- tm_map(docs,

removePunctuation)

docs <- tm_map(docs, stripWhitespace)

docs <- tm_map(docs, stemDocument)

#create the document term matrix

dtm <- TermDocumentMatrix(docs)

mat <- as.matrix(dtm)

v <- sort(rowSums(mat),decreasing=TRUE)
#Data frame
d <- data.frame(word = names(v),freq=v)

head(d, 10)

#generate the wordcloud

set.seed(1056)

wordcloud(words = d$word, freq = d$freq, min.freq = 1,

max.words=200, random.order=FALSE, rot.per=0.35,

colors=brewer.pal(8, "Dark2"))

#fetch sentiment words from texts

Sentiment <- get_nrc_sentiment(text)

head(Sentiment)

text <- cbind(text,Sentiment)

#count the sentiment words by category

TotalSentiment <- data.frame(colSums(text[,c(2:11)]))

names(TotalSentiment) <- "count"

TotalSentiment <- cbind("sentiment" = rownames(TotalSentiment), TotalSentiment)

rownames(TotalSentiment) <- NULL

#total sentiment score of all texts

ggplot(data = TotalSentiment, aes(x = sentiment, y = count)) +

geom_bar(aes(fill = sentiment), stat = "identity") +

theme(legend.position = "none") +
xlab("Sentiment") + ylab("Total Count") + ggtitle("Total Sentiment Score")

Source Code of Twitter Analysis using R programming:

# Read file
apple <- read.csv(file.choose(), header =
T) str(apple)

# Build corpus

corpus <- iconv(apple$text, to = "utf-8-mac")


corpus <- Corpus(VectorSource(corpus))
inspect(corpus[1:5])

# Clean text
corpus <- tm_map(corpus,
tolower) inspect(corpus[1:5])

corpus <- tm_map(corpus, removePunctuation)


inspect(corpus[1:5])

corpus <- tm_map(corpus,


removeNumbers) inspect(corpus[1:5])

cleanset <- tm_map(corpus, removeWords, stopwords('english'))


inspect(cleanset[1:5])

removeURL <- function(x) gsub('http[[:alnum:]]*', '', x)


cleanset <- tm_map(cleanset, content_transformer(removeURL))
inspect(cleanset[1:5])

cleanset <- tm_map(cleanset, removeWords, c('aapl',


'apple')) cleanset <- tm_map(cleanset, gsub,
pattern = 'stocks',
replacement = 'stock')

cleanset <- tm_map(cleanset, stripWhitespace)


inspect(cleanset[1:5])

# Term document matrix


tdm <- TermDocumentMatrix(cleanset)
tdm
tdm <- as.matrix(tdm)
tdm[1:10, 1:20]

# Bar plot
w <- rowSums(tdm)
w <- subset(w, w>=25)
barplot(w,
las = 2,
col = rainbow(50))

# Word cloud
library(wordcloud)
w <- sort(rowSums(tdm), decreasing =
TRUE) set.seed(222)
wordcloud(words =
names(w), freq = w,
max.words =
150,
random.order =
F, min.freq = 5,
colors = brewer.pal(8,
'Dark2'), scale = c(5, 0.3),
rot.per = 0.7)

library(wordcloud2)
w <- data.frame(names(w), w)
colnames(w) <- c('word',
'freq') wordcloud2(w,
size = 0.7,
shape = 'triangle',
rotateRatio = 0.5,
minSize = 1)

letterCloud(w,
word =
"apple",
size=1)

# Sentiment analysis
library(syuzhet)
library(lubridate)
library(ggplot2)
library(scales)
library(reshape2)
library(dplyr)

# Read file
apple <- read.csv(file.choose(), header = T)
tweets <- iconv(apple$text, to = 'utf-8-mac')

# Obtain sentiment scores


s <- get_nrc_sentiment(tweets)
head(s)
tweets[4]
get_nrc_sentiment('delay')

# Bar plot
barplot(colSums(s),
las = 2,
col = rainbow(10),
ylab = 'Count',
main = 'Sentiment Scores for Apple Tweets')

You might also like