410259: AC6-1 Business Intelligence: For BE Computer Engineering Students
410259: AC6-1 Business Intelligence: For BE Computer Engineering Students
410259: AC6-1 Business Intelligence: For BE Computer Engineering Students
Kashinath
Roll No: 11
1
Syllabus
The course aims at examining Business Intelligence (BI) as a broad
category of applications and technologies for gathering, storing,
analyzing, sharing and providing access to data to help enterprise users
make better managerial decisions.
Course Objectives:
To understand the concept of Business Intelligence
To know the details of Decision Support System
To inculcate the concepts of Data Warehousing
To understand the basics of design and management of BI systems
Course Outcome:
On completion of the course, learner will be able to–
Apply the concepts of Business Intelligence in real world
applications
Explore and use the data warehousing wherever necessary
Design and manage practical BI
Information:
Information is a message that contains relevant meaning, implication, or input for
decision and/or action. Information comes from both current (communication) and
historical (processed data or reconstructed picture) sources. In essence, the
purpose of information is to aid in making decisions and/or solving problems or
realizing an opportunity.
Knowledge:
Knowledge is the cognition or recognition (know-what), capacity to act (know-how),
and understanding (know-why) that resides or is contained within the mind or in the
brain. The purpose of knowledge is to better our lives. In the context of business,
the purpose of knowledge is to create or increase value for the enterprise and all its
stakeholders. In short, the ultimate purpose of knowledge is for value creation.
Operational Data:
An operational data store (or "ODS") is used for operational reporting and as a
source of data for the Enterprise Data Warehouse (EDW). It is a complementary
element to an EDW in a decision support landscape, and is used for operational
reporting, controls and decision making, as opposed to the EDW, which is used
for tactical and strategic decision support.
An ODS is a database designed to integrate data from multiple sources for
additional operations on the data, for reporting, controls and operational
decision support. Unlike a production master data store, the data is not
passed back to operational systems. It may be passed for further operations
and to the data warehouse for reporting.
An ODS should not be confused with an enterprise data hub (EDH). An
operational data store will take transactional data from one or more
production system and loosely integrate it, in some respects it is still subject
oriented, integrated and time variant, but without the volatility constraints.
This integration
is mainly achieved through the use of EDW structures and content.
An ODS is not an intrinsic part of an EDH solution, although an EDH may be
used to subsume some of the processing performed by an ODS and the EDW.
An EDH is a broker of data. An ODS is certainly not.
Because the data originates from multiple sources, the integration often involves
cleaning, resolving redundancy and checking against business rules for integrity.
An ODS is usually designed to contain low-level or atomic (indivisible) data (such
as transactions and prices) with limited history that is captured "real time" or
"near real time" as opposed to the much greater volumes of data stored in the
data warehouse generally on a less-frequent basis.
Introduction of BI:
The purpose of Business Intelligence (BI) is to provide decision makers with the
information necessary to make informed decisions. The information is delivered by
the Business Intelligence system via reports. This book focuses on the architecture
and infrastructure needed to deliver the information. Architecture is a set of rules or
structures providing a framework for the overall design of a system or product (Poe
et al. 1998). The BI system includes the following parts:
Interested parties and their respective information needs
Input of data
Storage of data
Analysis of data
Automatic and selective dissemination of information
A BI system includes the rules (architecture) providing a framework for the
organization of the technologies, platforms, databases, gateways, people and
processes. To implement an architecture the Business Intelligence architect must
implement an infrastructure. Technical infrastructures are the technologies,
platforms, databases, gateways, people and processes necessary to make the
architecture functional within the corporation (Poe et al. 1998).
In sum, decision makers need reports that deliver the information that allows them
to understand his/her organization and the world in order to make better decisions.
Given these functions of the BI system the most enduring definition focuses on the
system, tools, technology, process, and techniques that compose these four
elements that help decision makers understand their world. BI augments the ability
of decision makers to turn data into information by aiding in extracting data from
data sources, organizing the data based on established business knowledge and
then presenting the information in a manner that is organized in a way to be useful
to the decision maker. It merges technology with knowledge in order to provide
useful information to
management as quickly as possible.
In sum, Business Intelligence system includes the rules (architecture) that outline
how to organize the parts of the system (infrastructure) to deliver the information
needed to thrive in a competitive market (business) or to provide the best service to
the people (government) (Poe et al.1998). Regardless of the technology, which is
simply tools, the core of a Business intelligence system has and will not change.
Determining BI Cycle:
Intelligence is not just a set of tools to analyze raw data to help make strategic and
operational decisions. It is a framework that offers guidance in understanding what to
look for in the volumes of disparate data. As a framework, BI is a continuous cycle of
analysis, insight, action and measurement.
Analyzing a business is based on what we know and feel to be important while
filtering out the aspects of the business not considered mission critical or
detrimental to the growth of the organization. Deciding what is important is based on
our understanding and assumptions of what is important to customers, supplies,
competitors and employees. All of this knowledge is unique to a business and is an
incredible resource when formulating a BI strategy. However, having such granular
grassroots knowledge of the business can subconsciously limit the ability to see
patterns obvious to others. One of the benefits of BI reporting is performing ad hoc
reporting by drilling down layers of data pivoting on the rows and columns. Such
flexibility opens up humans inquisitive nature to ask more questions that wouldnt
necessarily be asked if such access to data wasnt available. Effective analysis
helps to understand the business better so to challenge conventional wisdom and
assumptions as well as what is considered to be the right analysis.
Insight comes in many forms. There are operational insights, such as determining
the effect on production costs with the installation of new more energy efficient
machines
that have slightly lower production yields per unit of measure. There are strategic
insights analyzing, for example, new market opportunities by conducting research
on the barriers to entry. Insight is the intangible product of analysis developed from
asking questions that only humans can ask. Computers can be used for the
identification of patterns, but only humans can recognize what patterns are useful.
The issue with having insight is convincing others to believe or support the new
perspective in order for the insight to be useful. As in life, anything new or different
is slow to acceptance or given credibility. Well organized business intelligence that
supports the insight by providing clear data, patterns, logic, presentation (i.e.
graphs, reports) and calculations are the drivers to help sell the new insight.
Once the analysis is done and the insight has been sold the next process in the BI
cycle is performing the action or decision making. Well thought out decisions
backed up by good analysis and insight gives confidence and courage to the
proposed action. Otherwise, decisions not supported by quality analytics are made
with overbearing safety measures or less dedication or commitment from the
stakeholders. In addition, quality business intelligence delivered quickly improves
the speed to action. Today’s organizations need to react more quickly, develop new
approaches faster, conduct more agile R&D and get products and services to
market faster than ever before. BI based decision making with faster access to
information and feedback provides more opportunity for quicker prototyping and
testing.
Benefits of BI:
Faster reporting, analysis or planning
Reduced costs
Increased revenues
Saved headcount
Factors Responsible for successful BI Project:
Dont boil the ocean.
Performance is key.
Nominate advocates.
Decision Making
Concepts
Concepts of Decision
Making:
Decision-making is the act of making a choice among available alternatives. There are
innumerable decisions that are taken by human beings in day-to-day life. In business
undertakings, decisions are taken at every step. It is also regarded as one of the
important function of management. Managerial functions like planning, organizing,
staffing, directing, coordinating and controlling are carried through decisions.
Decision making is possible when there are two or more alternatives to solve a single
problem or difficulty. If there is only one alternative then there is no question of decision
making. It is believed that the management without a decision is a man without a
backbone. Therefore, decision making is a problem-solving approach by choosing a
specific course of action among various alternatives.
"Decision-making is the selection, based on some criteria from two or more possible
alternatives. “- George R.Terry
"A decision can be defined as a course of action consciously chosen from available
alternatives for the purpose of the desired result" -J.L. Massie
In conclusion, we can say that decision making is the process of choosing a specific
course of action from various alternatives to solve the organizational problems or
difficulties.
Employees motivation
Pervasive function
Identification of problem
Analysis of problem
Applications of DSS:
Medical diagnosis
Agricultural Production
Chapter
3
Dat
Warehouse a
A data warehouse is a technique for collecting and managing data from varied
sources to provide meaningful business insights. It is a blend of technologies and
components which allows the strategic use of data. Data Warehouse is electronic
storage of a large amount of information by a business
A data warehouse is a relational database that aggregates structured data from across
an entire organization. It pulls together data from multiple sources — much of it is
typically online transaction processing (OLTP) data. The data warehouse selects,
organizes and aggregates data for efficient comparison and analysis. Data Warehousing
may be defined as a collection of corporate information and data derived from
operational systems and external data sources. A data warehouse is designed with the
purpose of inducing business decisions by allowing data consolidation, analysis, and
reporting at different aggregate levels. Data is populated into the DW by extraction,
transformation, and loading.
Data Warehousing incorporates data stores and conceptual, logical, and physical
models to support business goals and end-user information needs. Creating a DW
requires mapping data between sources and targets, then capturing the details of the
transformation in a metadata repository. The data warehouse provides a single,
comprehensive source of current and historical information.
How It Works:
Data Warehousing combines information collected from multiple sources into one
comprehensive database. To cite an example from the business world, I might say that
data warehouse incorporates customer information from a companies point-of-sale
systems (the cash registers), its website, its mailing lists, and its comment cards. Data
Warehousing may also consider confidential information about employee details, salary
information, etc.
Companies use this information to analyze their customers. Data warehousing also
related to data mining which means looking for meaningful data patterns in the huge
data volumes and devise newer strategies for higher sales and profits.
Bottom-up design:
In the bottom-up approach, data marts are first created to provide reporting and
analytical capabilities for specific business processes. These data marts can then
be integrated to create a comprehensive data warehouse. The data warehouse bus
architecture is primarily an implementation of "the bus", a collection of conformed
dimensions and conformed facts, which are dimensions that are shared (in a
specific way) between facts in two or more data marts. Ralph Kimball is a renowned
author on the subject of data warehousing. His data warehouse design approach is
called dimensional modelling or the Kimball methodology. This methodology follows
the bottom-up approach.
As per this method, data marts are first created to provide the reporting and
analytics capability for specific business process, later with these data marts
enterprise data warehouse is created.
The above image depicts how the bottom-up approach works.
Basically, Kimball model reverses the Inmon model i.e. Data marts are directly loaded
with the data from the source systems and then ETL process is used to load in to
Data Warehouse. The above image depicts how the top-down approach works.
Below are the steps that are involved in bottom-up approach:
The data flow in the bottom up approach starts from extraction of data from
various source system into the stage area where it is processed and loaded
into the data marts that are handling specific business process.
After data marts are refreshed the current data is once again extracted in
stage area and transformations are applied to create data into the data mart
structure. The data is the extracted from Data Mart to the staging area is
aggregated, summarized and so on loaded into EDW and then made
available for the end user for analysis and enables critical business decisions.
Hybrid design: Data warehouses (DW) often resemble the hub and spokes
architecture. Legacy systems feeding the warehouse often include customer
relationship management and enterprise resource planning, generating large
amounts of data. To consolidate these various data models, and facilitate the
extract transform load process, data warehouses often make use of an operational
data store, the information from which is parsed into the actual DW. To reduce data
redundancy, larger systems often store the data in a normalized way. Data marts for
specific reports can then be built on top of the DW.
Distributed data warehouse:
The Distributed Data Warehouse. Most organizations build and maintain a single
centralized data warehouse environment. This setup makes sense for many
reasons: The volume of data in the data warehouse is such that a single centralized
repository of data makes sense.
Phase 1—Discovery: In Phase 1, the team learns the business domain, including
relevant history such as whether the organization or business unit has attempted
similar projects in the past from which they can learn. The team assesses the
resources available to support the project in terms of people, technology, time, and
data. Important activities in this phase include framing the business problem as an
analytics challenge that can be addressed in subsequent phases and formulating
initial hypotheses (IHs) to test and begin learning the data.
Phase 3 — Model planning: Phase 3 is model planning, where the team determines the
methods, techniques, and workflow it intends to follow for the subsequent model
building phase. The team explores the data to learn about the relationships between
variables and subsequently selects key variables and the most suitable models.
Phase 4 — Model building: In Phase 4, the team develops datasets for testing, training,
and production purposes. In addition, in this phase the team builds and executes
models based on the work done in the model planning phase. The team also considers
whether its existing tools will suffice for running the models, or if it will need a more
robust environment for executing models and workflows (for example, fast hardware and
parallel processing, if applicable).
Phase 6—Operationalize: In Phase 6, the team delivers final reports, briefings, code,
and technical documents. In addition, the team may run a pilot project to implement the
models in a production environment.
Data Preprocessing
Data preprocessing is a data mining technique that involves transforming raw data into
an understandable format. Real-world data is often incomplete, inconsistent, and/or
lacking in certain behaviors or trends, and is likely to contain many errors. Data
preprocessing is a proven method of resolving such issues.
Data preprocessing is a data mining technique that involves transforming raw data into
an understandable format. Real-world data is often incomplete, inconsistent, and/or
lacking in certain behaviors or trends, and is likely to contain many errors. Data
preprocessing is a proven method of resolving such issues. Data preprocessing
prepares raw data for further processing.
Data preprocessing is a data mining technique that involves transforming raw data into
an understandable format. Real-world data is often incomplete, inconsistent, and/or
lacking in certain behaviors or trends, and is likely to contain many errors. Data
preprocessing is a proven method of resolving such issues. Data preprocessing
prepares raw data for further processing.
Example includes parsing the first, middle and the last name.
Correcting:
Correct parsed individual data components using sophisticated data algorithms and
secondary data sources.
Standardizing:
Standardizing applies conversion routines to transform data into its preferred and
consistent format using both standard and custom business rules.
Matching:
Searching and matching records within and across the parsed, corrected and
standardized data based on predefined business rules to eliminate duplications.
Consolidating:
Analyzing and identifying relationships between matched records
and consolidating/merging them into one representation.
Data cleansing
It must deal with many types of possible errors:
These include missing data and incorrect data at one source.
Data Staging:
Accumulates data from asynchronous sources.
At a predefined cutoff time, data in the staging file is transformed and loaded to the
warehouse.
staging.
Schema integration:
Entity identification problem: identify real world entities from multiple data sources,
e.g. A cust- id=B.cust#.
For the same real world entity, attribute values from different sources are different.
Possible reasons: different representations, different scales. Redundant data occur
often when integration of multiple databases:
Thus one set of Data Names are picked and used consistently in the data
warehouse.
Once all the data elements have right names, they must be converted to common
formats.
3. Data Reduction:
Obtains reduced representation in volume but produces the same or similar
analytical results.
Introduction to OLAP
OLAP allows business users to slice and dice data at will. Normally data in an
organization is distributed in multiple data sources and are incompatible with each
other. A retail example: Point-of-sales data and sales made via call-center or the
Web are stored in different location and formats. It would a time consuming process
for an executive to obtain OLAP reports such as - What are the most popular
products purchased by customers between the ages 15 to 30?Part of the OLAP
implementation
process involves extracting data from the various data repositories and making
them compatible. Making data compatible involves ensuring that the meaning of the
data in one repository matches all other repositories. An example of incompatible
data: Customer ages can be stored as birth date for purchases made over the web
and stored as age categories (i.e. between 15 and 30) for in store sales.
It is not always necessary to create a data warehouse for OLAP analysis. Data stored
by operational systems, such as point-of-sales, are in types of databases called OLTPs.
OLTP, Online Transaction Process, databases do not have any difference from a
structural perspective from any other databases. The main difference, and only,
difference is the way in which data is stored.
Examples of OLTPs can include ERP, CRM, SCM, Point-of-Sale applications, Call
Center.
OLTPs are designed for optimal transaction speed. When a consumer makes a
purchase online, they expect the transactions to occur instantaneously. With a database
design, call data modeling, optimized for transactions the record 'Consumer name,
Address, Telephone, Order Number, Order Name, Price, Payment Method' is created
quickly on the database and the results can be recalled by managers equally quickly if
needed.
Types Of Outliers:
Type1:GlobalOutliers:
A data point is considered a global outlier if its value is far outside the entirety of the
data set in which it is found (similar to how “global variables” in a computer program can
be accessed by any function in the program).
Global Anomaly:
Type 2: Contextual (Conditional) Outliers:
A data point is considered a contextual outlier if its value significantly deviates from
the rest of the data points in the same context. Note that this means that same
value may not be considered an outlier if it occurred in a different context. If we limit
our discussion to time series data, the “ context” is almost always temporal,
because time series data are records of a specific quantity over time. It’s no surprise
then that contextual outliers are common in time series data.
Contextual Anomaly: Values are not outside the normal global range, but are abnormal
compared to the seasonal pattern.
Type3:CollectiveOutliers:
A subset of data points within a data set is considered anomalous if those values as
a collection deviate significantly from the entire data set, but the values of the
individual data points are not themselves anomalous in either a contextual or global
sense. In time series data, one way this can manifest is as normal peaks and
valleys occurring outside of a time frame when that seasonal sequence is normal or
as a combination of time series that is in an outlier state as a group.
Collective Anomaly: In the example, two time series that were discovered to be related
to each other, are combined into a single anomaly. For each time series the individual
behavior does not deviate significantly from the normal range, but the combined
anomaly indicated a bigger issue.
Outlier detection Methods
Z-Score
The z-score or standard score of an observation is a metric that indicates how many
standard deviations a data point is from the samples mean, assuming a gaussian
distribution. This makes z-score a parametric method. Very frequently data points
are not to described by a gaussian distribution, this problem can be solved by
applying transformations to data ie: scaling it.
Some Python libraries like Scipy and Sci-kit Learn have easy to use functions and
classes for a easy implementation along with Pandas and Numpy.
After making the appropriate transformations to the selected feature space of the
dataset, the z-score of any data point can be calculated with the following
expression:
When computing the z-score for each sample on the data set a threshold must be
specified. Some good thumb-rule thresholds can be: 2.5, 3, 3.5 or more standard
deviations.
By tagging or removing the data points that lay beyond a given threshold we are
classifying data into outliers and not outliers.
Z-score is a simple, yet powerful method to get rid of outliers in data if you are
dealing with parametric distributions in a low dimensional feature space. For
nonparametric problems Dbscan and Isolation Forests can be good solutions.
Chapter 5
Designing and managing BI
systems
1. End-User Experience
These are the core capabilities for all the end users of your application. Business
intelligence requirements in this category may include dashboards and reports as well
as the interactive and analytical functions users can perform.
Ideally, such self-service capabilities let users answer their own questions without
having to involve IT. That frees up IT to work on more strategic projects rather than
answering ad hoc requests. It also empowers users to be more self-sufficient.
During your evaluation, make sure the capabilities important to your project are
demonstrated and understand how you will deliver and iterate upon these
capabilities inside your application.
These common self-service capabilities may affect your business intelligence
requirements:
User personas: Increase the adoption of analytics by providing a broad range of users
with a tailored experience that matches their needs and skills.
Presentation and information delivery: These requirements affect you present data in
visualizations, dashboards, and reports, as well as the compatibility of your BI solution
across different devices and formats.
Interactivity and automation: Do users need to be able to interact with your
dashboards? Consider whether you need to personalize visualizations, let users kick off
workflows, or drill down into information.
Analysis and authoring: Empowering your users to query their own data, create
visualizations and reports, and share their findings with colleagues can add value to
your analytics application.
2. Data Environment
The BI solutions you evaluate should be compatible with your current data environment,
while at the same time have enough flexibility to meet future demands as your data
architecture evolves. These are the diverse data requirements commonly evaluated by
application providers:
Data sources: Make sure your primary data source is supported by your BI solution.
Also look for a vendor that supports generic connectors and has flexibility through APIs
or plug-ins.
Data management: Consider what business intelligence requirements you have for
directly querying the data source for real-time reporting, caching data, and blending
data from multiple sources. Also ask yourself if your users need to transform or
enrich data for analysis.
One way to look at embeddability is to focus on driving adoption. The deeper you
integrate analytics into the fabric of the applications your users leverage every day,
the higher your user adoption will be. For instance, if youre a salesperson and you
spend much of your time using Salesforce, you dont want to go into applications
that are siloed from your daily work. Organizations need to infuse that analytic
workflow within users daily activities.
Security: It should be easy to adopt the security from your application to the
analytics content. Scrutinize vendors on the flexibility of their security models,
implementation of single sign-on, and whether data needs to synchronized or
replicated between applications.
Multi-tenancy: With built-in multi-tenancy, you can create a report once and deploy for
multiple customers.
User experience: Many analytic applications need to be white-labeled to match your
brand. Look for embedding APIs to ensure visualizations are rendered in the correct
context.
Workflow: Create an efficient experience where users can immediately take action
on what they see in any visualization or report.
Extensibility: Choose a BI solution that supports custom coding and third-party
charting libraries to extend your application further.
4. Development and Deployment
Choosing the right partner is not just about the technology. Its also about finding the
level of expertise you require for training, support, and services—as well as agreeing on
the business terms that ensure shared success — to get you to the finish line, and
beyond.
Licensing: Terms of the license can depend on a variety of factors, including number
of users/ customers, servers, usage, and whether you are embedding into a
commercial product. Be sure the terms make business sense over the short and
long runs.
Services: Completing your project on time may require technical support from your
BI vendor, along with professional services to get you off the ground and ongoing
training.
Customer success: Your BI vendor needs to be dedicated to your success. Look for
one that supports onboarding, gives you a dedicated account management, and
plenty of documentation and support offerings.
Expertise: The best analytics vendors have experience in your specific use case or
industry.
Scalability makes insights accessible to everyone. Dashboards and reports are available
to any number of users, not just restricted to the data analysts or executives in an
organization. In fact, many companies are embracing the concept of the extended
enterprise − sharing insights with field operations, customers, and external partners in
order to promote a common understanding of how to drive business forward.
Source code of Whatsapp Analysis using R programming:
library(ggplot2)
library(lubridate)
library(Scale)
library(reshape2)
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(stringr)
library(syuzhet)
library(dplyr )
#get the data from whatsapp chat
trans, "\\|")
removePunctuation)
v <- sort(rowSums(mat),decreasing=TRUE)
#Data frame
d <- data.frame(word = names(v),freq=v)
head(d, 10)
set.seed(1056)
colors=brewer.pal(8, "Dark2"))
head(Sentiment)
theme(legend.position = "none") +
xlab("Sentiment") + ylab("Total Count") + ggtitle("Total Sentiment Score")
# Read file
apple <- read.csv(file.choose(), header =
T) str(apple)
# Build corpus
# Clean text
corpus <- tm_map(corpus,
tolower) inspect(corpus[1:5])
# Bar plot
w <- rowSums(tdm)
w <- subset(w, w>=25)
barplot(w,
las = 2,
col = rainbow(50))
# Word cloud
library(wordcloud)
w <- sort(rowSums(tdm), decreasing =
TRUE) set.seed(222)
wordcloud(words =
names(w), freq = w,
max.words =
150,
random.order =
F, min.freq = 5,
colors = brewer.pal(8,
'Dark2'), scale = c(5, 0.3),
rot.per = 0.7)
library(wordcloud2)
w <- data.frame(names(w), w)
colnames(w) <- c('word',
'freq') wordcloud2(w,
size = 0.7,
shape = 'triangle',
rotateRatio = 0.5,
minSize = 1)
letterCloud(w,
word =
"apple",
size=1)
# Sentiment analysis
library(syuzhet)
library(lubridate)
library(ggplot2)
library(scales)
library(reshape2)
library(dplyr)
# Read file
apple <- read.csv(file.choose(), header = T)
tweets <- iconv(apple$text, to = 'utf-8-mac')
# Bar plot
barplot(colSums(s),
las = 2,
col = rainbow(10),
ylab = 'Count',
main = 'Sentiment Scores for Apple Tweets')