Module 1
Module 1
Module 1
Module: 1
INTRODUCTION TO ANALYTICS
Analytics
Introduction
Definition:
MODULE :01 1
activities, and transactions. This data is the basis of business analytics. Business
analytics is used to gain insights into the past performance of a business, things that
happened, that improved, and that declined and that has changed.
One of the goals for these analytics is to predict the future of the business.
Look at past trends and predict how the future of the business will be. It helps
decision makers to make data driven business decisions. Exploring data and
finding patterns and reasons will help in understanding the behavior of
customers and help in adjusting the business activities to improve business
outcomes.
If you have been watching the analytics horizon over the past few years, there
are a few buzz words that stand out, the most important of them being: Business
analytics, business intelligence, data engineering, and data science. How are these
terms different? How are they similar? Let us find out.
When you look at the various activities associated with data inside a business
they can be defined as the following. It starts from getting data from data sources
and building data pipelines. Then data is processed, transformed, and stored. The
processed data is used to build dashboards and reports. This then becomes the basis
for exploratory analytics, statistical Modeling, and machine learning. They then
translate into business recommendations and actions. So, which of these process
elements are covered in each of these terms? Let us start with data engineering. Data
engineering covers the acquisition, processing, transformation, and storing of data.
This is the heavy-lifting work to get the data ready for analytics.
MODULE :01 2
delivering business actions out of these. Data science covers all the process activities.
It is data engineering and business analytics added into one.
The second stage is exploratory data analytics. This stage answers the
question, what is going on? This is a deep dive into the data to understand behavior
and discover patterns.
The third stage is explanatory analytics. This stage answers the question,
why did it happen? This is a targeted exploration and recent finding effort to answer
a specific business question.
The fourth stage is predictive analytics. This stage answers the question,
what will happen? It tries to predict the future based on the events that happened in
the past.
The fifth stage is prescriptive analytics. This stage recommends steps and
actions to take advantage of future predictions and to come up with business plans.
The final stage is experimental analytics. This stage provides guidance of how well
the prescriptions will work based on expected environmental behavior.
Now that you know a bit about the stages, let's take a brief look at the
processes involved in business analytics.
MODULE :01 3
Process of Business Analytics
An organization has multiple data sources. There are data files or databases that
capture enterprise data. That is data captured from the internet, like keywords and
trending items from social media. That is also data captured from mobile devices. All
these data elements are first captured or acquired and then fed into a data transport
system.
The data transport system is a data bus that can contain multiple technologies
to reliably transport data from the sources to the central data center. The data is
stored in data repositories inside the center. It then goes through an iterative process
of cleaning, filtering, and transformation to become ready for analytics purposes.
This is then stored again in the central repository. There are analytics products
that will run on top of the transformed data and provide users and analysts with
reports, dashboards, and exploratory capabilities. This data then will be the source
for machine learning.
MODULE :01 4
Data scientists build machine learning models based on this data. The
findings of all analytics lead to prescriptions for the business. They are then taken up
by executives and converted into business actions and implemented for
improvements.
1. Descriptive analytics
Descriptive analytics tries to answer the question what happened? The goal of
this analytics is to present numerical and summarized facts about the performance of
the business in the past to help analysts understand the events that happened during
the past period.
It is the earliest form of analytics that has been there for many decades. Most
primitive and manual reporting systems during the age typewriters had it too. It is
called reporting in many software applications.
These reports then can be exported to various formats like CSV, Excel, et
cetera, and then distributed through emails or even hard copies. Typically, there are
no user driven views in descriptive analytics. All users get the same data. Let us now
look at some tools and techniques for descriptive analytics.
MODULE :01 5
What are the tools and techniques that are used in descriptive
analytics?
This example uses the segment age group to aggregate the data. It typically
contains totals and subtotals for profiling variables. The profiling variables in this
tripled are total patients and average cholesterol.
And it has totalled for both. On the list could also include percentage values
to indicate what percentage of a given segment counted for the profiling variable.
The next technique is time range comparisons. The goal for time range comparisons
is to show how various segments performed during time ranges.
MODULE :01 6
They are usually pre-canned or by standard time periods like yesterday, this
week, last week, et cetera. They can also be scheduled to run periodically and
produce outputs which can then be distributed to a set of viewers or views through a
dashboard.
2. Exploratory analytics
The goal of exploratory data analytics is to answer the question what is going
on? It is essentially a deep dive into the data in an ad-hoc, yet structured manner to
understand patterns, and confirm hypotheses.
The best analogy for exploratory analytics is a hound that picks up a scent
and chases it. The analyst typically starts with picking up a scent, which is a trend or
clue, and then chases it down through the data pile until he uncovers something
interesting and useful.
This kind of deep dive may take a few minutes or extend to a few days,
depending on the hypothesis and the data size. Exploratory analytics is usually done
in an ad-hoc manner. It does not use standard reports and dashboards. Rather, the
analyst queries the data through SQL, spreadsheets, or other ad-hoc analytics tools.
They may even write programs to do the work.
Exploratory analytics is also need based. It is not regularly done like a weekly
or a daily activity. It is rather triggered by a business event or a question that needs
answers. This kind of analytics is usually done by analysts or statisticians in the
business. So, what is the process? How does it work?
MODULE :01 7
segment, and profile it to understand patterns related to the question and find
answers or root causes.
Then they share results with the people who asked the question in the first
place. This process is iterative. Answers to a question will trigger more questions
that would further require additional rounds of this process until a satisfactory
answer is arrived.
The first and most popular tool is called segmentation and profiling. In this
technique, data is repeatedly grouped by different data columns, typically of text or
ID type. These are called segments. These are groups of similar entities we want to
divide and analyze like age group, gender, race, education, et cetera.
They are usually Boolean or numeric variables that are aggregated in some
fashion, like sum, average, maximum, et cetera. In the example shown, age group is
the segment. We break down the grand set of patients by individual segment values,
namely 20 to 40, 40 to 60 and 60 to 80.
Then we profile these segment values. We profile for total patients and
average cholesterol. This table helps us understand how the profiling variables, total
patients and average cholesterol vary by individual group segments.
Then comes graphical tools, which help to plot data on a graph and then look
at trends. There are several such tools like pie charts, bar charts, histograms, et
cetera. Graphical tools help reveal patterns in data, especially when data sizes are
too large to look at in a table form. They are also a great tool set to present data and
findings to other interested folks.
MODULE :01 8
Finally, there is statistics. There are several statistical techniques that help us
gain an understanding of data patterns. That is descriptive or summary statistics that
help in getting an overall picture about the data, like mean, standard deviation,
range, et cetera. Then there are distributions like normal and binomial distributions
that help understand data patterns. Fitting data to a specific distribution helps
extrapolate patterns to future timelines. There are other tests and analysis of
variants, which again, help in understanding how well a data set confirms to a
specific pattern.
3. Explanatory analytics
Let us now look at the various tools and techniques that are helpful in
explanatory analytics. We first look at drill downs. In drill downs, we take a specific
aggregation and then try to drill down on that data into further segments to discover
abnormal patterns.
MODULE :01 9
causal analysis. A fishbone diagram breaks down a given effect into possible causes
and how much these causes have influenced the result.
Let us now start exploring the hot topic in analytics, predictive analytics.
What is predictive analytics?
It answers the question, what will happen. The goal of predictive analytics is
to identify the likelihood of future outcomes based on historical data, statistics, and
machine learning. It tries to predict the behavior of humans or systems based on
trends identified earlier. What does it cover?
First and foremost, predictive analytics deals with data-driven prediction, not
logic or intuition driven. Humans have predicted, for centuries, using intuition and
experience. But predictive analytics is allowing a machine, or an algorithm, or a
formula to do it based on data. Predictive analytics uses past historical data to
understand how various entities behave under business situations.
Here is the process. First, data engineers collect data from various sources
and prepare them for analytics. Preparation includes cleaning, filtering,
transformation, and aggregation. Analysts then explore the data to identify trends
and performance of behavior indicators. Data scientists then use the data to build
machine-learning models that can be used to predict behavior. Then they test the
MODULE :01 10
model to ensure accuracy of predictions. These models can then be deployed in
production. They can be used in real time or batch mode to predict future behavior.
Let us investigate the tools and techniques used in machine learning for predictive
analytics.
We start off with data preparation techniques used for predictive analytics.
Data cleansing involves moving bad data or badly formatted data. Standardization
involves converting data elements to a standard format like date format, name
format, etc. Binning involves converting a continuous variable into ranges, like
converting age into age ranges.
Indicator variables are created for converting text data to integer values, like
converting male and female to one and two. Data imputation involves providing
data for missing values. Centering and scaling involves adjusting values so that
different data elements are in the same scale.
For example, salary is in the range of thousands and eight is in the range of
tens. Centering and scaling means they will all be changed into a range of one.
Additionally, techniques like text frequency, inverse document frequency, are used to
convert text data into numerical data for prediction purposes.
MODULE :01 11
Next, we look at machine learning types. Machine learning is usually
classified into two types. Supervised learning and unsupervised learning.
MODULE :01 12
based on similar attribute values. One example is to group patients into groups of
three, based on similar medical addresses. Association rules is used to determine the
affinity of one item to another item and then use that affinity to make business
decisions.
First, when you try to use a model for actual business purposes, focus on the
business gain or return on investment, not just model accuracy. Data scientists
sometimes get obsessed with model performance, and not focus on business ROI.
More data means better trends and better predictions. Focus on data and getting
more data. Algorithms can only do so much with insufficient data.
Test model building with multiple algorithms and see which one fits best for
your use case based on accuracy and response times. Ensure that all relevant
variables that impact the outcome are considered during prediction. It is very
important to include relevant variables and eliminate irrelevant ones. Test for model
accuracy repeatedly with multiple subsets of data.
The accuracy should be stable across multiple data sets for the model to be
effective in the field. Predictive analytics teams should have the right composition of
talent including data engineering, statistics, machine learning, and business.
This will help in planning, building, and executing the right model, and the
business can focus on customers who will buy and spend more marketing dollars on
them. We then move on to the next stage in analytics, prescriptive analytics.
5. Prescriptive Analytics
MODULE :01 13
The goal of prescriptive analytics is to identify ways and means to take
advantage of the findings and predictions provided by earlier stages of analytics,
namely exploratory, explanatory, and predictive analytics. Exploratory, explanatory,
and predictive analytics generate several findings, patterns, and predictions.
But not all of them can be used in the business field. Using these findings
from analytics in business requires additional analysis to understand how they will
work in the field and what benefits and risks they have. We need to consider things
like budget, time, and human resources that is required to implement the findings.
First, results from various analytics projects and efforts are collected by the
team. Out of these, key findings that can be taken advantage of or extracted for
further discussions based on these strategies for the field or device. For all the
strategies and alternatives, costs and benefits are analysed.
MODULE :01 14
in linear programming include budget, time, resource limits, et cetera. Here is an
example of how linear programming looks like.
We are trying to find the values of x and y, such that we get the maximum
value for zed. The constraints that are there include that x could be between zero and
hundred and y should be between 50 and 150 and x is greater than y. Z could be the
total units sold, while x and y could be the number of individual products that can
be sold based on inventory capacity. The second technique used is decision analysis.
First, you evaluate the individual alternatives for benefits to the business,
both monetary and non-monetary. Next, you estimate costs for the business,
including budget, time, and resources. You then look at the outside world and
evaluate threats to the strategy. This includes environmental, political, economic, and
competition related threats that can impact the performance of the strategy. You also
MODULE :01 15
try to look at new opportunities created by the strategies, like new customers,
upsells, et cetera.
One of the key items you want to model is uncertainty. Several statistical
techniques exist for this purpose. Decision analysis involves a lot of teamwork. It
requires members from different departments, like marketing, sales, IT, and
analytics to work together to come up with the overall analysis. This is then
provided to management as recommendations.
It is run for multiple scenarios and options, and the outcomes are measured
and compared. Controlled simulation is also done by modifying one input and
measuring its impact on the output. Next, we look at the use case to see how
prescriptive analytics can be used.
Analytics teams tend to skip prescriptive analytics and jump straight into
implementation.
MODULE :01 16
If key elements got missed out, then the analysis will not be accurate and
relevant. Choose the right mathematical models for simulation. Do test runs to make
sure that the simulation works as desired.
MODULE :01 17
● R programming is used as a leading tool for machine learning, statistics, and
data analysis. Objects, functions, and packages can easily be created by R.
● It’s a platform-independent language. This means it can be applied to all
operating system.
● It’s an open-source free language. That means anyone can install it in any
organization without purchasing a license.
● R programming language is not only a statistic package but also allows us to
integrate with other languages (C, C++). Thus, you can easily interact with
many data sources and statistical packages.
● The R programming language has a vast community of users and it’s growing
day by day.
● R is currently one of the most requested programming languages in the Data
Science job market that makes it the hottest trend nowadays.
Statistical Features of R:
Basic Statistics: The most common basic statistics terms are the mean, mode,
and median. These are all known as “Measures of Central Tendency.” So using the R
language we can measure central tendency very easily.
Static graphics: R is rich with facilities for creating and developing interesting
static graphics. R contains functionality for many plot types including graphic maps,
mosaic plots, biplots, and the list goes on.
MODULE :01 18
Binomial Distribution, Normal Distribution, Chi-squared Distribution and many
more.
Data analysis: It provides a large, coherent and integrated collection of tools for data
analysis.
Programming Features of R
R file_name.r
Advantages of R:
● R is the most comprehensive statistical analysis package. As
new technology and concepts often appear first in R.
● As R programming language is an open source. Thus, you can
run R anywhere and at any time.
● R programming language is suitable for GNU/Linux and
Windows operating system.
● R programming is cross-platform which runs on any operating
system.
● In R, everyone is welcome to provide new packages, bug fixes,
and code enhancements.
Disadvantages of R:
● In the R programming language, the standard of some packages
is less than perfect.
MODULE :01 19
● Although, R commands give little pressure to memory
management. So R programming language may consume all
available memory.
● In R basically, nobody to complain if something doesn’t work.
● R programming language is much slower than other
programming languages such as Python and MATLAB.
Applications of R:
● We use R for Data Science. It gives us a broad variety of libraries
related to statistics. It also provides the environment for
statistical computing and design.
● R is used by many quantitative analysts as its programming
tool. Thus, it helps in data importing and cleaning.
● R is the most prevalent language. So many data analysts and
research programmers use it. Hence, it is used as a fundamental
tool for finance.
● Tech giants like Google, Facebook, bing, Twitter, Accenture,
Wipro and many more using R nowadays.
What is Microsoft Excel?
MODULE :01 20
cell. The address of a cell is given by the letter representing the column and the
number representing a row.
Refer : https://www.guru99.com/introduction-to-microsoft-excel.html
**********************************************************************************
MODULE :01 21
MODULE :01 22
Data Mining
MODULE :01 23
MODULE :01 24