Module 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

BIG DATA AND BUSINESS ANALYTICS

Module: 1

INTRODUCTION TO ANALYTICS

1.1 Introduction – Business Analytics – Role of Analytics in Industry –

Current trends – Technologies & Domains involved in Analytics ,

Different types of Analytics – Descriptive, Predictive and Prescriptive

Analytics

1.2 Types of Data – Structured, Semi-structured and Unstructured Data.

Scales of Measurement – Nominal, Ordinal, Interval and Ratio .Big data

analytics. Framework for Data driven Decision Making.

1.3 Descriptive, Predictive, and Prescriptive Analytics Technique

1.4 Introduction to R and Excel

Introduction

Business analytics is a field of technology which is gaining an increasing role


in driving businesses. The term business analytics has been used loosely these days
to define a wide range of data driven analytics activities and processes.

Definition:

Business analytics as a continuous and iterative exploration of past


business performance. We know that businesses capture data about past events,

MODULE :01 1
activities, and transactions. This data is the basis of business analytics. Business
analytics is used to gain insights into the past performance of a business, things that
happened, that improved, and that declined and that has changed.

One of the goals for these analytics is to predict the future of the business.
Look at past trends and predict how the future of the business will be. It helps
decision makers to make data driven business decisions. Exploring data and
finding patterns and reasons will help in understanding the behavior of
customers and help in adjusting the business activities to improve business
outcomes.

Role of Analytics in Industry

If you have been watching the analytics horizon over the past few years, there
are a few buzz words that stand out, the most important of them being: Business
analytics, business intelligence, data engineering, and data science. How are these
terms different? How are they similar? Let us find out.

When you look at the various activities associated with data inside a business
they can be defined as the following. It starts from getting data from data sources
and building data pipelines. Then data is processed, transformed, and stored. The
processed data is used to build dashboards and reports. This then becomes the basis
for exploratory analytics, statistical Modeling, and machine learning. They then
translate into business recommendations and actions. So, which of these process
elements are covered in each of these terms? Let us start with data engineering. Data
engineering covers the acquisition, processing, transformation, and storing of data.
This is the heavy-lifting work to get the data ready for analytics.

Business intelligence is the basic analytics of data, which includes dashboards,


report, exploratory analytics, and getting business recommendations out of them.
Business analytics covers all the activities of business intelligence. In addition, it
includes advanced activities, like statistical Modeling, machine learning, and

MODULE :01 2
delivering business actions out of these. Data science covers all the process activities.
It is data engineering and business analytics added into one.

Stages of Business Analytics

The first stage of business analytics is descriptive analytics. This is the


primitive stage that has been there for decades. Here, we try to answer the question,
what happened in the past? Simple and readymade analytics.

The second stage is exploratory data analytics. This stage answers the
question, what is going on? This is a deep dive into the data to understand behavior
and discover patterns.

The third stage is explanatory analytics. This stage answers the question,
why did it happen? This is a targeted exploration and recent finding effort to answer
a specific business question.

The fourth stage is predictive analytics. This stage answers the question,
what will happen? It tries to predict the future based on the events that happened in
the past.

The fifth stage is prescriptive analytics. This stage recommends steps and
actions to take advantage of future predictions and to come up with business plans.
The final stage is experimental analytics. This stage provides guidance of how well
the prescriptions will work based on expected environmental behavior.

Now that you know a bit about the stages, let's take a brief look at the
processes involved in business analytics.

MODULE :01 3
Process of Business Analytics

An organization has multiple data sources. There are data files or databases that
capture enterprise data. That is data captured from the internet, like keywords and
trending items from social media. That is also data captured from mobile devices. All
these data elements are first captured or acquired and then fed into a data transport
system.

The data transport system is a data bus that can contain multiple technologies
to reliably transport data from the sources to the central data center. The data is
stored in data repositories inside the center. It then goes through an iterative process
of cleaning, filtering, and transformation to become ready for analytics purposes.

This is then stored again in the central repository. There are analytics products
that will run on top of the transformed data and provide users and analysts with
reports, dashboards, and exploratory capabilities. This data then will be the source
for machine learning.

MODULE :01 4
Data scientists build machine learning models based on this data. The
findings of all analytics lead to prescriptions for the business. They are then taken up
by executives and converted into business actions and implemented for
improvements.

1. Descriptive analytics

Descriptive analytics tries to answer the question what happened? The goal of
this analytics is to present numerical and summarized facts about the performance of
the business in the past to help analysts understand the events that happened during
the past period.

It is the earliest form of analytics that has been there for many decades. Most
primitive and manual reporting systems during the age typewriters had it too. It is
called reporting in many software applications.

Descriptive analytics summarizes data in many forms to understand how the


business performed in each period. It compares different segments of data, and also
different time periods to get a sense of the trends and performance. How does it
work? Typically, reports and descriptive analytics are predefined and pre-canned.
They are bundled into software products and applications. Customers simply
execute the report by providing a few parameters and view the reports in a monitor
dashboard.

The same applies to applications that are custom built by IT organizations.


The users typically give requirements up front, during the requirements study stage,
and the IT developers build the reports for them. There are capabilities to schedule
reports at regular intervals.

These reports then can be exported to various formats like CSV, Excel, et
cetera, and then distributed through emails or even hard copies. Typically, there are
no user driven views in descriptive analytics. All users get the same data. Let us now
look at some tools and techniques for descriptive analytics.

MODULE :01 5
What are the tools and techniques that are used in descriptive
analytics?

The first common technique used in descriptive analytics is aggregation. This


is the same as an "SQL Group By" or an "Excel Pivot". An example of aggregation is
shown here. Where the heart health data is grouped by age group to show the total
number of patients and their average cholesterol level by the age group. It is based
on a segment or a time or both.

This example uses the segment age group to aggregate the data. It typically
contains totals and subtotals for profiling variables. The profiling variables in this
tripled are total patients and average cholesterol.

And it has totalled for both. On the list could also include percentage values
to indicate what percentage of a given segment counted for the profiling variable.
The next technique is time range comparisons. The goal for time range comparisons
is to show how various segments performed during time ranges.

It helps to see if that is an improvement or decline in business performance,


possibly due to change in business initiatives, like new marketing campaigns or
business environmental changes like new competition. It also shows percentage
changes. The example here shows how patients of different age groups perform in
tests during the month of May and September.

This is a measurement taken before and after undergoing treatments. It shows


how the patient's health improved by age group based on the treatment. The text
tool is pre-canned and pre-packaged reports. The tools defined before are used in
these stripouts.

MODULE :01 6
They are usually pre-canned or by standard time periods like yesterday, this
week, last week, et cetera. They can also be scheduled to run periodically and
produce outputs which can then be distributed to a set of viewers or views through a
dashboard.

2. Exploratory analytics

The goal of exploratory data analytics is to answer the question what is going
on? It is essentially a deep dive into the data in an ad-hoc, yet structured manner to
understand patterns, and confirm hypotheses.

The best analogy for exploratory analytics is a hound that picks up a scent
and chases it. The analyst typically starts with picking up a scent, which is a trend or
clue, and then chases it down through the data pile until he uncovers something
interesting and useful.

So, what is the purpose of exploratory analytics? First, it is about getting


familiar with the data itself. Look at different data attributes to understand different
values, segments, data ranges, et cetera. Then it is about deep diving into data in
order to understand a pattern or confirm a hypothesis.

This kind of deep dive may take a few minutes or extend to a few days,
depending on the hypothesis and the data size. Exploratory analytics is usually done
in an ad-hoc manner. It does not use standard reports and dashboards. Rather, the
analyst queries the data through SQL, spreadsheets, or other ad-hoc analytics tools.
They may even write programs to do the work.

Exploratory analytics is also need based. It is not regularly done like a weekly
or a daily activity. It is rather triggered by a business event or a question that needs
answers. This kind of analytics is usually done by analysts or statisticians in the
business. So, what is the process? How does it work?

First, it starts off with a problem or a question being asked by someone,


usually an executive. An analyst or a team of analysts then explore the data,

MODULE :01 7
segment, and profile it to understand patterns related to the question and find
answers or root causes.

Then they share results with the people who asked the question in the first
place. This process is iterative. Answers to a question will trigger more questions
that would further require additional rounds of this process until a satisfactory
answer is arrived.

Tools and techniques

The first and most popular tool is called segmentation and profiling. In this
technique, data is repeatedly grouped by different data columns, typically of text or
ID type. These are called segments. These are groups of similar entities we want to
divide and analyze like age group, gender, race, education, et cetera.

We take a summary number and then split it by individual segment values.


Then we have profiling variables. These are variables that represent facts or
measurements or metrics that we want to analyze for each segment.

They are usually Boolean or numeric variables that are aggregated in some
fashion, like sum, average, maximum, et cetera. In the example shown, age group is
the segment. We break down the grand set of patients by individual segment values,
namely 20 to 40, 40 to 60 and 60 to 80.

Then we profile these segment values. We profile for total patients and
average cholesterol. This table helps us understand how the profiling variables, total
patients and average cholesterol vary by individual group segments.

Then comes graphical tools, which help to plot data on a graph and then look
at trends. There are several such tools like pie charts, bar charts, histograms, et
cetera. Graphical tools help reveal patterns in data, especially when data sizes are
too large to look at in a table form. They are also a great tool set to present data and
findings to other interested folks.

MODULE :01 8
Finally, there is statistics. There are several statistical techniques that help us
gain an understanding of data patterns. That is descriptive or summary statistics that
help in getting an overall picture about the data, like mean, standard deviation,
range, et cetera. Then there are distributions like normal and binomial distributions
that help understand data patterns. Fitting data to a specific distribution helps
extrapolate patterns to future timelines. There are other tests and analysis of
variants, which again, help in understanding how well a data set confirms to a
specific pattern.

3. Explanatory analytics

Explanatory analytics deals with the question, why did it happen? The goal of


explanatory analytics is to identify reasons and root causes for business
results. Explanatory analytics is closely related to exploratory data analytics and is
commonly clubbed together as a single stage. 
The differences are very few, except that here, we focus on trying to find the
root cause, not just patterns. Explanatory analytics seeks to tell stories with data. It is
about following a methodical process of starting with a question, exploring through
data, and finding an answer. It seeks to present things to an audience. 
It includes tools and techniques that deal with data presentation. It is usually
done by analysts and managers. It requires the ability to communicate well to the
expected audience. This stage is usually a prelude to next actions taken in the
business based on business results and answers produced by analytics. 
EPA tools and techniques

Let us now look at the various tools and techniques that are helpful in
explanatory analytics. We first look at drill downs. In drill downs, we take a specific
aggregation and then try to drill down on that data into further segments to discover
abnormal patterns.

Graphical tools are great presentation aids to present numeric information to


a wide set of audience. There are special tools like fishbone diagram used to present

MODULE :01 9
causal analysis. A fishbone diagram breaks down a given effect into possible causes
and how much these causes have influenced the result.

4. Predictive Analytics definition

Let us now start exploring the hot topic in analytics, predictive analytics.
What is predictive analytics?

It answers the question, what will happen. The goal of predictive analytics is
to identify the likelihood of future outcomes based on historical data, statistics, and
machine learning. It tries to predict the behavior of humans or systems based on
trends identified earlier. What does it cover?

First and foremost, predictive analytics deals with data-driven prediction, not
logic or intuition driven. Humans have predicted, for centuries, using intuition and
experience. But predictive analytics is allowing a machine, or an algorithm, or a
formula to do it based on data. Predictive analytics uses past historical data to
understand how various entities behave under business situations.

It usually requires large quantities of data to build reliable models of behavior


or performance. Predictive analytics uses automation and machine learning to digest
large quantities of data and build models. While building models is usually a
batch-mode activity, predicting performance, or behavior itself can be done in real
time. So how does it work?

Here is the process. First, data engineers collect data from various sources
and prepare them for analytics. Preparation includes cleaning, filtering,
transformation, and aggregation. Analysts then explore the data to identify trends
and performance of behavior indicators. Data scientists then use the data to build
machine-learning models that can be used to predict behavior. Then they test the

MODULE :01 10
model to ensure accuracy of predictions. These models can then be deployed in
production. They can be used in real time or batch mode to predict future behavior.

Predictive Analytics tools and techniques

Let us investigate the tools and techniques used in machine learning for predictive
analytics.

We start off with data preparation techniques used for predictive analytics.
Data cleansing involves moving bad data or badly formatted data. Standardization
involves converting data elements to a standard format like date format, name
format, etc. Binning involves converting a continuous variable into ranges, like
converting age into age ranges.

Indicator variables are created for converting text data to integer values, like
converting male and female to one and two. Data imputation involves providing
data for missing values. Centering and scaling involves adjusting values so that
different data elements are in the same scale.

For example, salary is in the range of thousands and eight is in the range of
tens. Centering and scaling means they will all be changed into a range of one.
Additionally, techniques like text frequency, inverse document frequency, are used to
convert text data into numerical data for prediction purposes.

Machine learning types.

MODULE :01 11
Next, we look at machine learning types. Machine learning is usually
classified into two types. Supervised learning and unsupervised learning.

In supervised learning, we are trying to predict a specific target variable. Say,


will a customer buy a product or not? The subtypes within supervised learning
include regression, classification, and recommendation.

Unsupervised learning deals with grouping data based on similarity in active


use. The subtypes within unsupervised learning include clustering and association
rules. Let us now explore a little more into supervised learning.

First, we have regression. In regression, we try to predict a


continuous variable based on a regression formula. For example,
predicting cholesterol level for a patient based on age, weight, and
blood pressure.

In classification, we are trying


to predict a discreet variable or a
class. For example, we tried to do a
binary prediction as to whether the
prospect will turn into a customer or
predict a customer's class, say,
platinum, gold, or silver based on
various attributes and historical data.

There are several algorithms available for classification. Finally, we have


recommendation engines. Recommendation engines deals with user item affinities.
The goal of these engines is to recommend a new product to a user based on what
other users like him bought or used.

Then, we have unsupervised learning algorithms.


First, there is clustering, which tries to group customers

MODULE :01 12
based on similar attribute values. One example is to group patients into groups of
three, based on similar medical addresses. Association rules is used to determine the
affinity of one item to another item and then use that affinity to make business
decisions.

A very popular association rules algorithm is market basket


analysis, which deals with items that are frequently bought together.
Businesses use this affinity score to offer additional products.

What are some of the best practices for predictive analytics?

First, when you try to use a model for actual business purposes, focus on the
business gain or return on investment, not just model accuracy. Data scientists
sometimes get obsessed with model performance, and not focus on business ROI.
More data means better trends and better predictions. Focus on data and getting
more data. Algorithms can only do so much with insufficient data.

Test model building with multiple algorithms and see which one fits best for
your use case based on accuracy and response times. Ensure that all relevant
variables that impact the outcome are considered during prediction. It is very
important to include relevant variables and eliminate irrelevant ones. Test for model
accuracy repeatedly with multiple subsets of data.

The accuracy should be stable across multiple data sets for the model to be
effective in the field. Predictive analytics teams should have the right composition of
talent including data engineering, statistics, machine learning, and business.

This will help in planning, building, and executing the right model, and the
business can focus on customers who will buy and spend more marketing dollars on
them. We then move on to the next stage in analytics, prescriptive analytics.

5. Prescriptive Analytics

MODULE :01 13
The goal of prescriptive analytics is to identify ways and means to take
advantage of the findings and predictions provided by earlier stages of analytics,
namely exploratory, explanatory, and predictive analytics. Exploratory, explanatory,
and predictive analytics generate several findings, patterns, and predictions.

But not all of them can be used in the business field. Using these findings
from analytics in business requires additional analysis to understand how they will
work in the field and what benefits and risks they have. We need to consider things
like budget, time, and human resources that is required to implement the findings.

It is important to understand both budgetary costs and opportunity costs.


Environmental factors, and changes, also need to be considered, like economic
downturns, competition, and change in the demand for the product. Costs and
benefits need to be evaluated to make sure that business gains something at the end
of the exercise. How does it work?

First, results from various analytics projects and efforts are collected by the
team. Out of these, key findings that can be taken advantage of or extracted for
further discussions based on these strategies for the field or device. For all the
strategies and alternatives, costs and benefits are analysed.

Simulation is done, if required, to create how the future environment would


behave against these strategies and evaluate the business outcomes for these
strategies. Finally, the team makes recommendations to management for the best
course of action.

Tools and techniques used for prescriptive analytics

The first tool used in prescriptive analytics is linear programming.

Linear programming is a technique used to maximize an outcome given the


list of constraints. Outcome could be units sold, profit margin, etc. It assumes a linear
relationship between variables. It makes simplifying assumptions. Constraints used

MODULE :01 14
in linear programming include budget, time, resource limits, et cetera. Here is an
example of how linear programming looks like.

We are trying to find the values of x and y, such that we get the maximum
value for zed. The constraints that are there include that x could be between zero and
hundred and y should be between 50 and 150 and x is greater than y. Z could be the
total units sold, while x and y could be the number of individual products that can
be sold based on inventory capacity. The second technique used is decision analysis.

Decision analysis consists of a set of procedures, methods, and tools that


are used to analyze business decisions. This is usually structured brainstorming. The
goal of decision analysis is to create alternatives available, including a no-action or
status quo alternative.

First, you evaluate the individual alternatives for benefits to the business,
both monetary and non-monetary. Next, you estimate costs for the business,
including budget, time, and resources. You then look at the outside world and
evaluate threats to the strategy. This includes environmental, political, economic, and
competition related threats that can impact the performance of the strategy. You also

MODULE :01 15
try to look at new opportunities created by the strategies, like new customers,
upsells, et cetera.

One of the key items you want to model is uncertainty. Several statistical
techniques exist for this purpose. Decision analysis involves a lot of teamwork. It
requires members from different departments, like marketing, sales, IT, and
analytics to work together to come up with the overall analysis. This is then
provided to management as recommendations.

The final tool in prescriptive analytics is simulation. The goal of a


simulation exercise is to simulate a real business situation and measure outcomes.
Simulation is either done manually, semi-automated in a spreadsheet, or fully
automated through a computer program. Inputs to the simulation process can be
either actual or can be simulated all through mathematical models. It is important to
consider all environmental variables that might impact the outcome. Simulation is
only as good as the variables considered and their proper Modeling.

It is run for multiple scenarios and options, and the outcomes are measured
and compared. Controlled simulation is also done by modifying one input and
measuring its impact on the output. Next, we look at the use case to see how
prescriptive analytics can be used.

What are some of the best practices for prescriptive analytics?

Analytics teams tend to skip prescriptive analytics and jump straight into
implementation.

Keep prescriptive analytics as a key step to your analytics program. Do not


undermine it. Ensure that all relevant internal and environmental threats and
constraints are considered while doing linear programming, decision analysis, and
simulations.

MODULE :01 16
If key elements got missed out, then the analysis will not be accurate and
relevant. Choose the right mathematical models for simulation. Do test runs to make
sure that the simulation works as desired.

Decision analysis requires contributions from multiple teams. Please


ensure that all relevant parties participate in the analysis. When decisions are put
into practice in the field, have a feedback and review mechanism to take the actual
results, compare them with projected results, and identify any improvements
required for the process or models.

R Programming Language – Introduction


R is an open-source programming language that is widely used as a statistical
software and data analysis tool. R generally comes with the Command-line interface.
R is available across widely used platforms like Windows, Linux, and macOS. Also,
the R programming language is the latest cutting-edge tool.

It was designed by Ross Ihaka and Robert Gentleman at the University of


Auckland, New Zealand, and is currently developed by the R Development Core
Team. R programming language is an implementation of the S programming
language. It also combines with lexical scoping semantics inspired by Scheme.
Moreover, the project conceives in 1992, with an initial version released in 1995 and a
stable beta version in 2000.

Why R Programming Language?

MODULE :01 17
● R programming is used as a leading tool for machine learning, statistics, and
data analysis. Objects, functions, and packages can easily be created by R.
● It’s a platform-independent language. This means it can be applied to all
operating system.
● It’s an open-source free language. That means anyone can install it in any
organization without purchasing a license.
● R programming language is not only a statistic package but also allows us to
integrate with other languages (C, C++). Thus, you can easily interact with
many data sources and statistical packages.
● The R programming language has a vast community of users and it’s growing
day by day.
● R is currently one of the most requested programming languages in the Data
Science job market that makes it the hottest trend nowadays.

Features of R Programming Language

Statistical Features of R:

Basic Statistics: The most common basic statistics terms are the mean, mode,
and median. These are all known as “Measures of Central Tendency.” So using the R
language we can measure central tendency very easily.

Static graphics: R is rich with facilities for creating and developing interesting
static graphics. R contains functionality for many plot types including graphic maps,
mosaic plots, biplots, and the list goes on.

Probability distributions: Probability distributions play a vital role in statistics


and by using R we can easily handle various types of probability distribution such as

MODULE :01 18
Binomial Distribution, Normal Distribution, Chi-squared Distribution and many
more.

Data analysis: It provides a large, coherent and integrated collection of tools for data
analysis.

Programming Features of R

Since R is much similar to other widely used languages syntactically, it is


easier to code and learn in R. Programs can be written in R in any of the widely used
IDE like R Studio, Rattle, Tinn-R, etc. After writing the program save the file with the
extension .r. To run the program, use the following command on the command line:

R file_name.r

Advantages of R:  
● R is the most comprehensive statistical analysis package. As
new technology and concepts often appear first in R.
● As R programming language is an open source. Thus, you can
run R anywhere and at any time.
● R programming language is suitable for GNU/Linux and
Windows operating system.
● R programming is cross-platform which runs on any operating
system.
● In R, everyone is welcome to provide new packages, bug fixes,
and code enhancements.
Disadvantages of R:  
● In the R programming language, the standard of some packages
is less than perfect.

MODULE :01 19
● Although, R commands give little pressure to memory
management. So R programming language may consume all
available memory.
● In R basically, nobody to complain if something doesn’t work.
● R programming language is much slower than other
programming languages such as Python and MATLAB.
Applications of R:  
● We use R for Data Science. It gives us a broad variety of libraries
related to statistics. It also provides the environment for
statistical computing and design.
● R is used by many quantitative analysts as its programming
tool. Thus, it helps in data importing and cleaning.
● R is the most prevalent language. So many data analysts and
research programmers use it. Hence, it is used as a fundamental
tool for finance.
● Tech giants like Google, Facebook, bing, Twitter, Accenture,
Wipro and many more using R nowadays.
What is Microsoft Excel?

Microsoft Excel is a spreadsheet program used to record and analyze


numerical and statistical data. Microsoft Excel provides multiple features to perform
various operations like calculations, pivot tables, graph tools, macro programming,
etc. It is compatible with multiple OS like Windows, macOS, Android and iOS.

A Excel spreadsheet can be understood as a collection of columns and rows


that form a table. Alphabetical letters are usually assigned to columns, and numbers
are usually assigned to rows. The point where a column and a row meet is called a

MODULE :01 20
cell. The address of a cell is given by the letter representing the column and the
number representing a row.

Refer : https://www.guru99.com/introduction-to-microsoft-excel.html

**********************************************************************************

MODULE :01 21
MODULE :01 22
Data Mining

MODULE :01 23
MODULE :01 24

You might also like