Intro To DA, Data Sources and Representation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 119

DATA ANALYTICS

UE21CS342AA2
UNIT-1
Lecture 1 : Introduction to DA , Data
Sources and Representations
Department of Computer Science and Engineering
Data Analytics
Unit 1
Lecture 1 : Introduction to Data Analytics , Data Sources and
Representations
DATA ANALYTICS
Evaluation Policy (tentative)
Component Description Weight
ISA 1 CBT 40 marks scaled to 20 (Unit 1 and Unit 2) 20
ISA-2 CBT 40 marks scaled to 20 (Unit 3 and Unit 4) 20
Experiential Learning: • Worksheets/ Assignments ( 1 per unit, with multiple parts) 5
Worksheets • Hackathon (1 for the entire course): 5 marks
(Lab + Assignment) 2 mark for participation + 2 marks for working
+ Hackathon + 1 mark for finishing in the top 30% of the leaderboard: 2 marks (for those
not in the top 30%, class participation (attendance + questions asked,
answered) in the lecture hours + invited talk can count towards the 1 mark)

Experiential Learning: • Problem statement + team - 2 marks 5


Seminar/ • Literature review + EDA - 2 marks
presentation • Presentation - 2 marks
• Q and A – 2 marks
• Peer review participation + evaluation report - 2 marks
Total of 10 marks scaled to 5 marks
ESA – pen and paper 100 marks scaled to 50 50
DATA ANALYTICS
Evaluation Policy (tentative)

6 marks: hackathon
2 marks - timely submission that follows instructions
2 marks - working (includes their similarity score using MOSS tool with a reasonable threshold)
(for those who get a high similarity score, a chance can be given through a viva-voce;
if they have understood what they submitted and can answer a few questions, they can
get these 2 marks)
2 marks - for being in the top 30% of the leader board
(for those who miss this any additional assignment/ task can be defined by each faculty
for their class to allow students to earn those points; a case study presentation has been
included in the slides as an example only)
Hope this is fair and acceptable to everyone on the team.
About ISA/ ESA - A reference list of questions/ question for each Unit from which questions will be
asked.
DATA ANALYTICS
What is data analytics?

⚫ The science of examining raw data to elicit patterns, develop insights , and
draw conclusions to help take a business decision.

⚫ The need :
Business decisions are very complex. There exist several alternate solutions,
complex interdependent factors and lack of available time to take a decision.

⚫ Analysis vs analytics
⚫ Analysis – Examining and understanding past data.
⚫ Analytics – Analysis + forecasting (or predictive modeling).
DATA ANALYTICS
Pyramid of analytics applications and Data driven decision making
DATA ANALYTICS
What does it involve?
DATA ANALYTICS
Why is it important?

Data Analytics is used in all these application areas…

… and more!
DATA ANALYTICS
Few examples

• Banking : To reduce cheque


clearance time, in determining
loan approval and interest rate.
• E-Commerce : To analyze buyer
behavior to plan inventory and
recommend products.
• Retail stores : Shelf space
allocation to drive the profits
up.
• OTT Platforms : Recommend
content a user would like.
DATA ANALYTICS
Lifecycle
DATA ANALYTICS
Skills Required

And…
a secret ingredient

Intuition or
deductive reasoning
and domain knowledge
DATA ANALYTICS
Case Study

Indian online grocery store bigbasket.com

Problem context driving analytics : “Did you forget?” feature

The ability to predict the items that a customer may have


forgotten to order can have a significant impact on the profits of
online grocers such as bigbasket.com

The ability to ask right questions is an important success criteria


for analytics projects.
DATA ANALYTICS
Case Study

Indian online grocery store bigbasket.com

Technology:
To find out whether a customer has forgotten to place an order
for an item

Information technology is used for data capture, data storage,


data preparation, data analysis, data share and to deploy
solution

An important output of analytics is automation of actionable


items derived from analytical models which is usually achieved
using IT
DATA ANALYTICS
Case Study
Indian online grocery store bigbasket.com

Data science is the most important component of analytics, it consists of statistical


and operations research techniques, machine learning and deep learning
algorithms.

The objective of the data science component of analytics is to identify the most
appropriate statistical model/machine learning algorithm that is best based on a
measure of accuracy.

Example: “did you forget?” prediction is a classification problem in which customers


are classified into:
1.Forget
2.Not forget
DATA ANALYTICS
Data Sources
DATA ANALYTICS
Data Sources

• Lots of data is collected and warehoused


every day

• Yahoo has peta bytes of web data

• Facebook has billions of active users

• Purchases atdepartment/ grocery stores, e-commerce


• Amazon handles millions of visits/day

• Bank/Credit Card transactions


DATA ANALYTICS
How large is big (data)?
• 1 bit
• 1 byte = 8 bits
• 1 KB = 1024 bytes
• 1 MB = 1024 KB (kilobytes)
• 1 GB = 1024 MB (megabytes)
• 1 TB = 1024 GB (gigabytes) ≈ 1012 bytes
• 1 PB = 1024 TB (terabytes) ≈ 1015 bytes
20 PB = amt of data processed by Google per day!
• 1 EB = 1024 PB (petabytes)
• 1 ZB = 1024 EB (exabytes)
• 1 YB = 1024 ZB (zettabytes)

• What is a Domegemegrottebyte?
DATA ANALYTICS
How large is big (data)?
• 1 bit
• 1 byte = 8 bits
• 1 KB = 1024 bytes
• 1 MB = 1024 KB (kilobytes)
• 1 GB = 1024 MB (megabytes)
• 1 TB = 1024 GB (gigabytes) ≈ 1012 bytes
• 1 PB = 1024 TB (terabytes) ≈ 1015 bytes
20 PB = amt of data processed by Google per day!
• 1 EB = 1024 PB (petabytes)
• 1 ZB = 1024 EB (exabytes)
• 1 YB = 1024 ZB (zettabytes)

• What is a Domegemegrottebyte?
DATA ANALYTICS
Data Sources

Data collected and stored at


enormous speeds
• Remote sensors on a satellite
- NASA EOSDIS archives over fMRI Data from Brain Sky Survey Data
petabytes of earth science
data / year
• Telescopes scanning the skies
- Sky survey data
• High-throughput biological Surface Temperature
of Earth Gene Expression Data
data
• Scientific simulations -
terabytes of data generated
in a few hours
DATA ANALYTICS
What is Data?
Attributes
• Collection of data objects and their attributes.
Tid Refund Marital Taxable
Cheat
• An attribute is a property or Status Income

characteristic of an object.
• Examples: eye color of a person, 1 Yes Single 125K No
2 No Married 100K No

Objects
temperature, etc.
• Attribute is also known as variable, 3 No Single 70K No

field, characteristic, dimension, or 4 Yes Married 120K No

feature. 5 No Divorced 95K Yes

• A collection of attributes describe an 6 No Married 60K No

object. 7 Yes Divorced 220K No

• An object is also known as a record, 8 No Single 85K Yes


10

point, case, sample, entity, or 9 No Married 75K No

instance. 10 No Single 90K Yes


DATA ANALYTICS
Attribute Values

• Attribute values are numbers or symbols assigned to an


attribute for a particular object

• Distinction between attributes and attribute values


• Same attribute can be mapped to different attribute
values
• Example: height can be measured in feet or meters

• Different attributes can be mapped to the same set of


values
• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
DATA ANALYTICS
Types of Attributes

• There are different types of attributes


• Nominal
• Examples: ID numbers, eye color, zip codes
• Ordinal
• Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades,
height {tall, medium, short}
• Interval
• Examples: Calendar dates, temperatures in Celsius or Fahrenheit.
• Ratio
• Examples: Temperature in Kelvin, length, counts, elapsed time (e.g., time to
run a race)
DATA ANALYTICS
Discrete and Continuous Attributes
• Discrete Attribute
• Has only a finite or countably infinite set of values
• Examples: zip codes, counts, or the set of words in a collection of
documents
• Often represented as integer variables.
• Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
• Has real numbers as attribute values
• Examples: temperature, height, or weight.
• Practically, real values can only be measured and represented using a finite
number of digits.
• Continuous attributes are typically represented as floating point variables.
DATA ANALYTICS
Data Representations
• Structured Data:Structured data means that the data is
described in a matrix form with labelled rows and
columns.
• Unstructured Data:Any data that is not originally in the
matrix form with rows and columns is an unstructured
data.
• Semi structured:Semi-structured data (also known as
partially structured data) is a type of data that doesn't
follow the tabular structure associated with relational
databases or other forms of data tables but does
contain tags and metadata to separate semantic
elements and establish hierarchies of records and
fields.
DATA ANALYTICS
Data Representations
⮚ Relational databases and spreadsheets. – Structured Data

⮚ Text and multimedia content. Photos and graphic images,


videos, streaming instrument data, webpages, PDF files,
PowerPoint presentations, emails, blog entries, wikis and word
processing documents. - Unstructured Data

⮚ XML documents and NoSQL databases. – Semi structured Data

⮚ For example, word processing software now can include


metadata showing the author's name and the date created,
with the bulk of the document just being unstructured text.
DATA ANALYTICS
Data Representations
■ Record
■ Relational records

■ Data matrix, e.g., numerical matrix, crosstabs

■ Document data: text documents: term-frequency

vector
■ Transaction data

■ Graph and network


■ World Wide Web

■ Social or information networks

■ Molecular Structures
DATA ANALYTICS
Data Representations

■ Ordered
■ Video data: sequence of images

■ Temporal data: time-series

■ Sequential Data: transaction

sequences
■ Genetic sequence data

■ Spatial, image and multimedia


■ Spatial data: maps

■ Image data

■ Video data
DATA ANALYTICS
Data Representations-Record Data
■ Data that consists of a collection of records, each of which
consists of a fixed set of attributes

Tid Refund Marital Taxable


Cheat
Status Income
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
DATA ANALYTICS
Data Representations-Data Matrix
■ If data objects have the same fixed set of numeric attributes,
then the data objects can be thought of as points in a multi-
dimensional space, where each dimension represents a
distinct attribute

■ Such data set can be represented by an m by n matrix, where


there are m rows, one for each object, and n columns, one for
each attribute

Projection Projection Distance Load Thickness


of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1
DATA ANALYTICS
Data Representations-Document Data
■ Each document becomes a `term' vector,
■ each term is a component (attribute) of the vector,

■ the value of each component is the number of times the

corresponding term occurs in the document.

Pl w
team coach ball score game lost timeout season
ay i
n

Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
DATA ANALYTICS
Data Representations-Transaction data
■ A special type of record data, where
■ each record (transaction) involves a set of items.

■ For example, consider a grocery store. The set of

products purchased by a customer during one shopping


trip constitute a transaction, while the individual
products that were purchased are the items.

TID Items
1 Bread,Coke,Milk
2 Beer,Bread
3 Beer,Coke,Diaper,Milk
4 Beer,Bread,Diaper,Milk
5 Coke,Diaper,Milk
DATA ANALYTICS
Data Representations
■ Graph Data
■ Examples: Generic graph and HTML Links

2 <a>Graph Partitioning </a>


<li>

5 1 <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
2 <a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5

■ Chemical Data
■ Benzene Molecule: C6H6
DATA ANALYTICS
Data Representations-Ordered Data
■ Sequences of transactions ■ Spatio-Temporal Data
Items/Events

■ Genomic sequence data


GGTTCCGCCTTCAGCCCCGCGCC
An element of CGCAGGGCCCGCCCCGCGCCGTC
the sequence GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
DATA ANALYTICS
Data Representations- Data Warehouse
“A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmon
Sales volume as a function of product, month, and region
DATA ANALYTICS
Data Representations
DATA ANALYTICS
Typical OLAP Operations

• Roll up (drill-up): summarize data


• by climbing up hierarchy or by dimension reduction
• Drill down (roll down): reverse of roll-up
• from higher level summary to lower level summary
or detailed data, or introducing new dimensions
• Slice and dice: project and select
• Pivot (rotate):
• reorient the cube, visualization, 3D to series of 2D planes
• Other operations
• drill across: involving (across) more than one fact table
• drill through: through the bottom level of the cube to its back-end
relational tables (using SQL)
DATA ANALYTICS
Typical OLAP Operations

Roll-up: operation and aggregate certain similar data


attributes having the same dimension together. For
example, if the data cube displays the daily income of a
customer, we can use a roll-up operation to find the
monthly income of his salary.
In our case we roll-up from cities to countries as shown
DATA ANALYTICS
Typical OLAP Operations

Drill-down: this operation is the reverse of the roll-up


operation. It allows us to take particular information and
then subdivide it further for coarser granularity analysis. It
zooms into more detail. For example- if India is an
attribute of a country column and we wish to see villages
in India, then the drill-down operation splits India into
states, districts, towns, cities, villages and then displays
the required information.
In our case we drill down from quarters to months.
DATA ANALYTICS
Typical OLAP Operations

• Slicing: this operation filters the unnecessary portions.


Suppose in a particular dimension, the user doesn’t need
everything for analysis, rather a particular attribute.

• Dicing: this operation does a multidimensional cutting,


that not only cuts only one dimension but also can go to
another dimension and cut a certain range of it. As a
result, it looks more like a subcube out of the whole
cube(as depicted in the figure).
DATA ANALYTICS
Typical OLAP Operations

• Pivot: this operation is very important from a viewing


point of view. It basically transforms the data cube in
terms of view. It doesn’t change the data present in the
data cube. For example, if the user is comparing year
versus branch, using the pivot operation, the user can
change the viewpoint and now compare branch versus
item type.
DATA ANALYTICS
A Quick Glance on OLAP operations
DATA ANALYTICS
Test your understanding

• To what type of an attribute does shoe size belong to?


Interval

• Which OLAP operation are you likely to perform at the end of the financial
year?
Roll-Up

• The example here is a 3D cube having attributes like


branch(A,B,C,D),item type (home, entertainment, computer,
phone,security),year(1997,1998,1999). If user wants to
observe only “branch A” data then which OLAP operation must
be performed?
Slicing
DATA ANALYTICS
References

• Business Analytics by U. Dinesh Kumar – Wiley 2nd Edition, 2022


Chapter : 1.1-1.7

• Data Mining : Concepts and Techniques by Han, Kamber and Pei ,


The Morgan Kaufmann Series in Data Management Systems ,3rd
Edition Chapter : 4.2.5

• https://www.geeksforgeeks.org/data-cube-or-olap-approach-in-
data-mining/
DATA ANALYTICS
UE21CS342AA2
UNIT-1
Lecture 2 : The R programming
environment and descriptive statistics

Department of Computer Science and Engineering


Data Analytics
Unit 1
Lecture 2 : The R programming environment and descriptive statistics.
DATA ANALYTICS
What is R?

• Free, open source interpreted language


Ross Ihaka & Robert Gentleman in 1993.
• Most widely used data analysis software
Used by 2M+ data scientists, statisticians and analysts
• Most powerful statistical programming language
Flexible, extensible and comprehensive for
productivity
• Create beautiful and unique data visualizations
As seen in New York Times, Twitter and Flowing Data
• Thriving open-source community
Leading edge of analytics research
DATA ANALYTICS
Applications of R

• R allows to collect data in real-time, perform statistical and


predictive analysis, create visualizations and communicate
actionable results to stakeholders

• It houses more than 9100 packages of statistical functions.

• R’s expressive syntax allows to quickly import, clean and analyze


data from various data sources.

• R also has charting capabilities, which means you can plot your
data and create interesting visualizations from any dataset.
DATA ANALYTICS
Applications of R

• R is used in predictive analytics and machine learning.

• Packages for ML tasks like linear and non-linear regression, decision


trees, linear and non-linear classification and many more.

• R has been used primarily in academics and research and is great for
exploratory data analysis.

• In recent years, enterprise usage has rapidly expanded. It is used by


statisticians, engineers, and scientists without computer programming
skills.

• It is popular in academia, finance, pharmaceuticals, media, and


marketing.
DATA ANALYTICS
Data Handling Capabilities

• R is great for data analysis because of its huge number of


packages, readily usable tests, and the advantage of using
formulas.

• It can handle basic data analysis without needing to install

packages.

• Big datasets require the use of packages such as data.table and

dplyr.
DATA ANALYTICS
Preference by Industry
DATA ANALYTICS
History of R

• The S language has been developed since the late 1970s by John
Chambers and colleagues at Bell Labs as a language for programming
with data.
• S language combines ideas from a variety sources (awk, lisp, APL,) and
provides an environment for quantitative computations and
visualization.
• Provides an explicit and consistent structure for manipulating, analyzing
statistically and visualizing data.
• S-Plus is a commercialization of the Bell Labs framework. It is “S" plus
“graphics".
• R is a free implementation of a dialect of the S language
DATA ANALYTICS
History of R

• R is an Open source statistical environment/platform developed by


Robert Gentleman and Ross Ihaka (University of Auckland) during the
• 1990s.
R is currently maintained by the R core-development team, a hard-
working, international team of volunteer developers.

• The primary R system is available from the Comprehensive R Archive


Network, also known as CRAN.

• CRAN also hosts many add-on packages that can be used to extend the
functionality of R. Over 6,789 packages are available on CRAN that have
been developed by users and programmers around the world.
DATA ANALYTICS
Getting Started

Finding out the latest version of R:


• To find out what is the latest version of R, you can look at
the CRAN (Comprehensive R Network) website,
http://cran.r- project.org/.
DATA ANALYTICS
Getting Started

Installing R on Windows:
• To install R on your Windows computer, follow the below steps :
• Go to https://cran.rstudio.com/bin/windows/base/
• Alternative : http://ftp.heanet.ie/mirrors/cran.r-project.org
• Under “Download and Install R” , click on the “Windows” link.
• Download the required.exe file. The latest version of R is 4.3.1. Click on the
link which says “Download R 4.3.1 for windows.”
• After downloading, double click on the executable to run it.
• Choose English as the language to install it.
DATA ANALYTICS
Download and install RStudio on Windows

Visit https://www.rstudio.com/products/rstudio/download/
and click on DOWNLOAD RSTUDIO DESKTOP

List of IDEs
■ RStudio
■ StatET for R (eclipse based)
■ R-Brain IDE (RIDE)
■ IntelliJ IDEA
■ R Tool for Visual Studio
DATA ANALYTICS
Setting your Working Directory

R getwd() Function
• Working directory is the directory where R finds all R files for
reading and writing.
• getwd() function returns an absolute file path representing the
current working directory of the R process.
• getwd() Output:"C:/Users/*****/Documents“
R setwd() Function
• setwd(dir) is used to set the working directory to dir.
• setwd(“e:/folder”) Output:Sets pwd to “e:/folder”
DATA ANALYTICS
Setting your Working Directory
R dir() Function
• dir() function lists all the files in a directory.
• dir() Output:Lists all files in the pwd
R ls() Function
• ls() is a function in R that lists all the
object in the working environment.
• ls() Output:Lists all objects in the pwd
• It can be used in the scenario where you want to clean the environment
before running the code.The following command will remove all the object
from R environment.
• rm(list = ls())
DATA ANALYTICS
Getting help with R

• To get general help just type the below command


• help.start()

• To access documentation for the standard lm (linear


model) function, for example, enter the command.

• help(lm) / help("lm") / ?lm /?"lm“

• To see the list of pre-loaded data


(datasets), type the function:

• data()

• A guide to get started on working with built in datasets:


DATA ANALYTICS
Installing an R Package in RStudio

• The primary location for obtaining R packages is CRAN.


• Information about the available packages on
CRAN with the available.packages() function.
• a <- available.packages()
• Packages can be installed with the install.packages() function in R.
• install.packages("ggplot2")
• Install multiple R packages at once with a single call to
install.packages(). Place the names of the R packages in a character
vector.
• install.packages(c("caret","ggplot2","dplyr"))
DATA ANALYTICS
Installing an R Package in RStudio
DATA ANALYTICS
Loading R Packages

• Installing a package does not make it immediately available to you in


R; you must load the package. The library() function is used to load
packages into R.

• library(ggplot2)
• After loading a package, the functions exported by that package will
be attached to the top of the search() list (after the workspace).
• library(ggplot2)
• search()
• To save your workspace to a file, you may type save.image() or use
Save Workspace in the File menu
• The default workspace file is called .RData
Data Analytics
Unit 1
Descriptive Statistics
DATA ANALYTICS
Revisiting types of data attributes
Attribute Description Examples Operations
Type
Nominal The values of a nominal attribute zip codes, employee mode, entropy,
are just different names, i.e., ID numbers, eye color, contingency
nominal attributes provide only sex: {male, female} correlation, χ2 test
enough information to distinguish
one object from another. (=, ≠)
Ordinal The values of an ordinal attribute hardness of minerals, median,
provide enough information to order {good, better, best}, percentiles, rank
objects. (<, >) grades, street numbers correlation, run
tests, sign tests

Interval For interval attributes, the calendar dates, mean, standard


differences between values are temperature in Celsius deviation, Pearson's
meaningful, i.e., a unit of or Fahrenheit correlation, t and F
measurement exists. (+, - ) tests

Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent variation
length, electrical
current
DATA ANALYTICS
How is Interval different to Ratio?

Features Interval scale Ratio scale


Ratio scale has all the characteristics
All variables measured in an interval
of an interval scale, in addition, to be
Variable scale can be added, subtracted, and
able to calculate ratios. That is, you
property multiplied. You cannot calculate a
can leverage numbers on the scale
ratio between them.
against 0.
Zero-point in an interval scale is
The ratio scale has an absolute zero
arbitrary. For example, the
Absolute or character of origin. Height and
temperature can be below 0 degrees
Point Zero weight cannot be zero or below zero.
Celsius and into negative
(Zero means ‘nothing’.)
temperatures.
Statistically, in a ratio scale, the
Statistically, in an interval scale, the
Calculation geometric or harmonic mean is
arithmetic mean is calculated.
calculated.
Interval scale can measure size and Ratio scale can measure size and
Measurement magnitude as multiple factors of a magnitude as a factor of one defined
defined unit. unit in terms of another.

A classic example of an interval scale


is the temperature in Celsius. The Classic examples of a ratio scale are
difference in temperature between 50 any variable that possesses an
Example degrees and 60 degrees is 10 absolute zero characteristic, like age,
degrees; this is the same difference weight, height, or sales figures.
between 70 degrees and 80 degrees.

https://www.questionpro.com/blog/ratio-scale-vs-interval-
scale/
DATA ANALYTICS
Transformations on Data

Attribute Transformation Comments


Level
Nominal Any permutation of values If all employee ID numbers
were reassigned, would it make
any difference?
Ordinal An order preserving change of values, An attribute encompassing the
i.e., notion of good, better best can be
new_value = f(old_value) represented equally well by the
values {1, 2, 3} or by {0.5, 1, 10}.
where f is a monotonic function.

Interval new_value =a * old_value + b where a Thus, the Fahrenheit and Celsius


and b are constants temperature scales differ in terms
of where their zero value is and the
size of a unit (degree).

Ratio new_value = a * old_value Length can be measured in meters


or feet.

Tan, Stienbach and Kumar, ‘Introduction to Data


DATA ANALYTICS
Test your understanding!

Classify the following data :


• Time of the day in terms of AM/PM
Qualitative ,Nominal ,Discrete
• Angles measured in degrees between 0 and 360
Quantitative ,Ratio ,Continuous
• Medals awarded at the Olympics
Qualitative ,Ordinal ,Discrete
• Brightness as measured by a light meter
Quantitative, Ratio ,Continuous
• Brightness as measured by people’s judgements
Qualitative, Ordinal ,Discrete
DATA ANALYTICS
Classification based on nature of data collection

• Cross-Sectional Data : Data collected on many variables of interest


at the same instance or duration of time. Example : Data of all
sitcoms released in 2022.The attributes can be budget, actors ,
popularity , social media engagement and so on..

• Time-series Data : Data collected on a single variable of interest over


several time intervals , like on a daily , monthly or a weekly basis.
Example : Daily price of Bitcoin since its inception.

• Panel Data : Data collected on several variables (multiple


dimensions) over several time intervals. It is also known as
longitudinal data. Example : Data collected on GDP (Gross Domestic
Product) , Gini Index and rate of unemployment for several countries
over several years.
DATA ANALYTICS
Population and Sample

• Population : The set of all possible observations or records (also


called records , cases, subjects or data points) for a given context of
the problem.
• Sample : The subset taken from the population. In real world
scenarios , an inference is made about the population based on
sample data.
DATA ANALYTICS
Summary statistics

• Exploratory data analysis (EDA) : A preliminary exploration of data


to better understand its characteristics.
Key motivations :
o Helps in selecting the right tool for preprocessing or analysis in
the further stages.
o Makes use of humans’ abilities to recognize patterns ,some of
which aren’t covered by data analysis tools.

• Summary statistics : Numbers that summarize properties of data. It


is a part of EDA. Most of them can be calculated in a single pass of
the data. Summarized properties include frequency , location and
spread of the data.
Example : location – mean , spread – standard deviation
DATA ANALYTICS
An illustration : Which group is smarter?

Consider the IQ scores of 13 students of two classes

Each individual may be different. If you try to understand a group by


remembering the qualities of each member, you become overwhelmed
and fail to understand the group.
DATA ANALYTICS
An illustration : Which group is smarter?

Class A – Average IQ Class B – Average IQ


110.54 110.23

With this data , we can conclude that both classes are equally smart as
their average IQs are roughly the same.

The question is easily answered now thanks to a descriptive summary


statistic.
DATA ANALYTICS
Central Tendency

• Measures of central tendency are those used to describe the data


using a single value.
• The three most frequently used measures of central tendencies are:
o Mean
o Median
o Mode
• It helps in comparisons between different datasets.
• It helps in summarizing and comprehending the data.
DATA ANALYTICS
Mean

• Mean (or Average ) Value

Mean is the arithmetical average value of the data and is one of the
most used measures of average tendency.
If the data is captured in frequencies, then use:

If n is the number of records in the sample and xi is the value of ith


record, then the mean value is given by :
DATA ANALYTICS
Mean

One should be careful about taking decisions based


on the mean value of the data. There is a famous
joke in statistics which says that “if someone’s head
is in freezer and leg is in the oven, the average body
temperature would be fine, but the person may not
be alive”. Making decisions solely based on mean
value is not advisable.
DATA ANALYTICS
Median
• Median (or Mid ) Value
Median is the value that divides the data into two equal parts. That is , the
proportion of records below and above the median will be 50% each.

• Finding the median


o Arrange the data in increasing order.
o If number of observations n is odd , median is the observation at the
position (n+1)/2 .
o If n is even , the median is the average value of observations at positions
(n/2) and (n+2)/2 .

Median is not calculated using the entire dataset like mean. We are simply
looking for the midpoint rather than using the values of the entire data.
However , it is more stable than mean as adding a new observation doesn’t
change the median significantly.
DATA ANALYTICS
Mode

• Mode is the most frequently occurring value in the dataset.

• It is the only measure of central tendency which is applicable to qualitative


(nominal) data , since mean and median for nominal data are meaningless.

• For example , assume that there is student data with students’ mode of
transport , namely car , bus , two wheeler or the metro. Mean and Median are
meaningless to analyze nominal data such as mode of transport.

• It is possible for a dataset to not have a mode at all. It occurs when each value
of the dataset appears equal number of times.
DATA ANALYTICS
Test your understanding!

• Which central tendency (when applicable and exists) provides a value which
is present in the dataset?
Mode
• Which central tendency is not robust to outliers?
Mean
• When the dataset contains outliers, which measure of central tendency is
the most appropriate to use?
Median
• Which function in R is used to load a package?
library()
DATA ANALYTICS
References

• Business Analytics by U. Dinesh Kumar – Wiley 2nd Edition, 2022


Chapter 2.1-2.5
• Introduction to Data Mining , Tan, Steinbach, Kumar, 2 nd Edition
• https://www.tutorialspoint.com/
DATA ANALYTICS
UE21CS342AA2
UNIT-1
Lecture 3 : Descriptive Statistics -
2
Data Analytics
Unit 1
Lecture 3 : Descriptive Statistics - 2
DATA ANALYTICS
Percentile

• Percentile, denoted as Px , is the value of the data at which x


percentage of the data lies below that value.It is used to identify the
position of the observation in the data set.
• For example , P10 denotes the value below which 10% of the data lies.
• In the context of asset management and reliability , P10 life implies the
time by which 10% of the products fail.
• To find Px , the data must be arranged in ascending order.

Position corresponding to Px x (n+1)/100

where Px is the position in the data calculated and n is the number of


observations in the data.
DATA ANALYTICS
Decile and Quartile

⦁ Decile corresponds to special values of percentile that divide the


data into 10 equal parts. First decile contains first 10% of the data
and second decile contains first 20% of the data and so on.

⦁ Quartile divides the data into 4 equal parts. The first quartile (Q1)
contains first 25% of the data, Q2 contains 50% of the data and is
also the median. Quartile 3 (Q3) accounts for 75% of the data.
DATA ANALYTICS
Example

1. Calculate the mean, median, and mode of time between failures


of wire-cuts
2. The company would like to know by what time 10% (ten percentile
or P10) and 90% (ninety percentile or P90) of the wire-cuts will fail?
3. Calculate the values of P25 and P75.
DATA ANALYTICS
Solution

1. Mean = 57.64, median = 56, and mode = 46

2. Note that the data in the table is arranged in increasing order of columns.
The position of P10 = 10 × (51)/100 = 5.1 . We can round off 5.1 to its
nearest integer 5. The corresponding value from the table is 21. (10 % of
the observations have a value of less than or equal to 21. That is by 21
hours, 10% of the wire-cuts will fail.

Instead of rounding the value obtained from the equation , we can use the
following approximation: Value at 5th position is 21. Value at position 5.1 is
approximated as 21 + (0.1 * (value at 6th position – value at 5th position))
= 21 + (0.1 * 1) = 25.1
DATA ANALYTICS
Solution

P90 = 90 × 51/100 = 45.9


The value at position 45 is 90, position 46 is 93. The value of position 45.9 is
90 + (0.9 * 3) = 92.7 . That is , 90% of the wire-cuts will fail by 92.7 hours.

3. P25 = (1st Quartile or Q1) = 25 x 51/100 = 12.75


Value at 12th position is 33, so
P25 = 33 + 0.75 * ( Value at 13th position – Value at 12th position)
= 33 + 0.75 * 1 = 33.75

P75 = (3rd Quartile or Q3) = 75 x 51/100 = 38.25


Value at 38th position is 86, so
P75 = 86 + 0.25 * ( Value at 39th position – Value at 38th position)
= 86 + 0.25 * 0 = 86
DATA ANALYTICS
Measures of Variation

• One of the primary objectives of analytics is to understand the


variability in the data and the cause of such variability.
• Predictive analytics techniques such as regression attempt to
explain variation in the outcome variable (Y) using predictor
variables (X)
• Measures of variability are useful in identifying how close records
are to the mean value and outliers in the data.
• An important application of variability is in feature selection during
model building. If a variable has very low variability , it is unlikely
to have a statistically significant relationship with the outcome
variable.
• Variability in the data is measured by range , Inter-Quartile
distance (IQD), Variance , Standard Deviation and coefficient
DATA ANALYTICS
Range , IQD and Variance

• Range is the difference between maximum and minimum value of


the data. It captures the data spread.
• Inter-quartile distance (IQD), also called inter-quartile range (IQR) is
a measure of the distance between Quartile 1 (Q1) and Quartile 3
(Q3).
IQD is used in identifying outliers in the data. Values of data below
Q1 – 1.5 IQD and above Q3 + 1.5 IQD are classified as potential
outliers.
• Variance is a measure of variability in the data from the mean value.
Variance for population is calculated using
DATA ANALYTICS
Example

Amount spent per month by a segment of credit card users


of a bank has a mean value of 12000 and a standard
deviation of 2000. Calculate the proportion of customers
who are spending between 8000 and 16000.

• Solution:

That is, the proportion of customers spending between


8000 and 16000 is at least 0.75 (or 75%).
DATA ANALYTICS
Skewness

• The following formula is used usually for a sample with n


observations (Joanes and Gill, 1998):
𝑛(𝑛 − 1)
𝐺1 = 𝑔1
𝑛−2

𝑛(𝑛−1)
• The value of will converge to 1 as the value of n
𝑛−2
increases.

• Skewness in finance is used to understand risk and return. For


example, negative skewness about data on return on stocks
would imply a that the returns could be much lower than the
mean, and it could result in a loss. Positive skewed distribution
DATA ANALYTICS
Symmetric vs Skewed Data

• Median, mean and mode of symmetric, positively skewed and


negatively skewed data
DATA ANALYTICS
Kurtosis

• Kurtosis is another measure of shape, aimed at the shape of


the tail, that is, whether the tail of the data distribution is heavy
or light.

− 4
σ4𝑖=1 𝑋𝑖 −𝑋 /𝑛
• Kurtosis =
𝜎4

• Kurtosis value of less than 3 is called platykurtic distribution


and greater than 3 is called leptokurtic distribution. Kurtosis
value of 3 indicates standard normal distribution and is called
as mesokurtic.
DATA ANALYTICS
Leptokurtic, mesokurtic and platykurtic distributions
DATA ANALYTICS
Excess Kurtosis

• The excess kurtosis is a measure that captures deviation fromkurtosis


ofanormal distribution.

− 4
σ4𝑖=1 𝑋𝑖 −𝑋 /𝑛
• Excess Kurtosis = −3
𝜎4

• Excess kurtosis is a useful metric used in the field of pathology.


• With excess kurtosis, any event in question is prone to extreme
outcomes.
DATA ANALYTICS
Test your understanding!

• The value of is
a) Zero for any sample
b) Zero for population but not necessarily for samples
c) Zero for both samples and population
d) Cannot say
Zero for both samples and population

• Kurtosis indicates how much data resides in the __________


of the distribution
Tail
DATA ANALYTICS
Test your understanding!

• Mean is greater than median in what kind of distribution?


Positively Skewed Distribution

• In a dataset with 60 observations, 3 parameters were


estimated. What is the degrees of freedom?
57

• Calculate Q3 for the data [10,50,30,20,10,20,70,30]


45
Solution:
=(6.75)th observation
=30+0.75(20)=45
DATA ANALYTICS
References

• Business Analytics by U. Dinesh Kumar – Wiley 2nd Edition,


2022 Chapter : 2.6 – 2.8
DATA ANALYTICS
UE21CS342AA2
UNIT-1
Lecture 4 : Data Preprocessing - Cleaning
Data Analytics
Unit 1
Lecture 4 : Data Preprocessing - Cleaning
DATA ANALYTICS
Data Preprocessing
• Analysis on data can only be as good as the data
itself. Low quality data will lead to low quality
analysis.

• Real world databases are highly susceptible to noisy,


missing and inconsistent data owing to their huge
size and multiple heterogeneous sources.

• Data processing techniques when applied before


analysis can substantially improve the overall quality
of analysis and/or the time required for the actual
analysis.
DATA ANALYTICS
Measures of Data Quality

1) Accuracy : Data must not contain errors or a lot of noise.


Example of inaccurate data : Date = 30/02/2002.
Reasons for inaccurate data :

• Data collection instruments may be faulty.


• Human errors occur during data entry.
• Disguised missing data : Users may purposefully submit incorrect data
values for mandatory fields when they don’t want to share their
personal information. Example : Choosing the default value of January
1st for date of birth.
DATA ANALYTICS
Measures of Data Quality
2) Completeness : Data must not lack attribute values. It must contain
attributes of interest and relevance to the problem at hand.
Reasons for incompleteness :
• Attributes of interest were not considered important at the time of entry.
• Data might not be recorded due to equipment malfunction resulting in
missing data.

3) Consistency : Should not contain any discrepancies in the data or the naming
convention of the attributes.
Examples of inconsistency:
• Age is recorded as 50 but Date of Birth = 03/04/2005.
• In the result column of students’ marks , few entries are in GPA format
and rest in percentage.
• Discrepancies can exist between duplicate records.
DATA ANALYTICS
Measures of Data Quality
4) Timeliness : The data must be updated in a timely fashion. For example , for
an analysis run on the first day of every month, previous month’s data must
be up to date for accurate analysis.

4) Interpretability : The data must be easily understood. If the attributes of the


data aren’t easily understandable , the analysis is going to be hindered.

4) Believability : The data and its source must be trusted by the users. If this
data or the source caused problems in the past , current users will find it
hard to trust it.

NOTE : The quality of data is subjective and depends on the intended use of
data. The data needs of each problem is different.
DATA ANALYTICS
Major Tasks in Data Preprocessing

• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation
DATA ANALYTICS
Data Cleaning

Data cleaning entails :


• Filling in missing values
• Smoothening noisy data
• Identifying and removing outliers

If users believe the data is dirty, they are


unlikely to trust the outcome of the analysis.
DATA ANALYTICS
Missing Data

Data is not always complete. Missing data maybe due to :

• Equipment malfunction
• Inconsistent with other recorded data leading to its deletion
• Data not recorded due to a misunderstanding
• Certain data may not be considered useful at the time of entry

Missing data may need to be inferred.


DATA ANALYTICS
Handling missing data
1. Ignore the tuple : Usually done when the class label is missing (for a
classification task). This is not effective when the percentage of missing
values per attribute varies considerably
2. Fill in the missing value manually : Time consuming and infeasible for a large
data set.
3. Fill it with a global constant : Replace it with a global constant like the word
“Unknown”. The downside is that the model might learn patterns with
respect to the occurrence of the word “Unknown”.
4. Fill it with a central tendency : For symmetric data distributions , replace it
with the mean and for skewed data distributions, replace it with the median.
5. A smarter way is to use attribute mean or median(based on the
distribution)for all samples belonging to the same class.
6. Use most probable model : Use models like regression, decision tree or
inference-based Bayesian formalism to infer the missing value.
DATA ANALYTICS
Types of Missing Values
1. Missing Completely At Random (MCAR)
• The missing data is independent of the observed and unobserved data. In other
words, no systematic differences exist between records with missing data and
those with complete data.
• For example : A weighing scale running out of batteries. This is not dependent on
the person and the probability of this happening is equal to everyone.
• Assuming the data as MCAR is a strong and often unrealistic assumption as “true
randomness” is rare in the real world.
• MCAR data doesn’t add bias to the analysis.
• Ways to deal with it :
▪ Delete the records : If it is a small fraction of data
▪ Delete the attributes : If it is a small fraction of attributes
▪ Mean imputation
▪ Pairwise deletion : Compute the mean, variance and covariance with another
variable available.
DATA ANALYTICS
Types of Missing Values
2. Missing At Random (MAR)
• MAR assumes that the missing value can be predicted based on the other observed
data. The missingness is still random.
• Example : Employed people are less likely to answer all questions of a survey when
compared to unemployed people. Data is MAR if the likelihood of completing the
survey is dependent of the employment status but not on the topic of survey.
• Almost always produces a bias in the analysis.
• MCAR implies MAR but the converse isn’t true.
• Ways to deal with it :
▪ Regression imputation : unbiased if it considers the factor which influences the
missingness.
▪ Last observation carried forward (LOCF) and Baseline observation carried
forward(BOCF) : Yields biased estimates. Must be used only if the underlying
assumptions are scientifically justifiable.
▪ Use of multiple imputation (Packages mice and amelia in R)
DATA ANALYTICS
Types of Missing Values
3. Missing Not At random (MNAR)
• The missingness of the data depends on the value of the data. The mechanism for
why the data is missing is known. Yet, the values can’t be effectively inferred.
• Examples :
▪ Censored data
▪ People belonging to certain income brackets might not wish to disclose their
assets.
▪ A weighing machine can only measure weights in a particular range.
• Ways to deal with this :
▪ One must model the missingness explicitly, jointly modelling the response and
missingness.
▪ Generally , the data is assumed to be MAR whenever feasible to avoid this
situation.

NOTE : There is no statistical way to determine under which category your missing data
will fall under.
DATA ANALYTICS
Types of Missing Values-A Quick Glance

Missing Completely at Random, MCAR, means there is no relationship between the


missingness of the data and any values, observed or missing. Those missing data points are a
random subset of the data. There is nothing systematic going on that makes some data more
likely to be missing than others.

Missing at Random, MAR, means there is a systematic relationship between the propensity
of missing values and the observed data, but not the missing data.
Whether an observation is missing has nothing to do with the missing values, but it does
have to do with the values of an individual’s observed variables. So, for example, if men are
more likely to tell you their weight than women, weight is MAR.

Missing Not at Random, MNAR, means there is a relationship between the propensity of a
value to be missing and its values.
DATA ANALYTICS
An Interesting Thought

• Imagine you are collecting some information from your classmates. For many
reasons , not everyone will answer every question of yours. And that is okay!
• Well the next step is replacing missing values right? We can use any one of
the methods we have discussed till now after some analysis of the data.
• But wait! Don’t you think the fact that they did not answer is some kind of
information which can be beneficial to our analysis?
• So the next time you build a model , before dealing with the missing values ,
create an additional variable ( preferably a binary variable ) in which you
store if the particular student answered or not.
• This may (or may not! ) help you gain more insights about the population or
improve the analytics model you are building!

https://youtu.be/f9AQy7p0QEo
DATA ANALYTICS
Noisy data
Noise is a random error or variance in a measured variable.
Data smoothening techniques to combat noise :
• Binning
▪ Sort the data and partition into bins(equal-width, equal-frequency, etc.)
▪ Smooth by bin means, bin medians, by bin boundaries etc.
▪ More on binning in further lectures.
• Regression - Data can be smoothened by fitting it to a regression model.
• Clustering - Outliers can be detected with the help of clustering and can be
removed to smoothen the data.
• Combined computer and human inspection – Computer detects suspicious
values and is validated by a human. Is useful when dealing with possible
outliers.
DATA ANALYTICS
Outliers
Outliers are data objects with characteristics
that are considerably different than most of the
other data objects in the data set.

Case 1 : Outliers are noise that interferes with


data analysis.

Case 2 : Outliers are the main goal of our


analysis. Examples :
• Credit card fraud
• Spam detection
• Intrusion detection
DATA ANALYTICS
Data Cleaning as a Process
• Data discrepancy detection
• Refer to the metadata of data to gain knowledge regarding its properties.
• Perform summary statistics for all attributes and discover the distributions,
dependencies , outliers and so on.
• Look for inconsistent representation of data. For example , make sure all dates are
following the same format , for instance , DD-MM-YYYY.
• Check for field overloading – practice of coupling two or more data elements to a
single field. It ensures efficient memory utilization.
• Check for uniqueness rule, consecutive rule and null rule.
• Uniqueness rule : Each value of the given attribute must be unique.
• Consecutive rule : There can’t be any missing values between lowest and highest
value for that attribute. All values must be unique. Example - cheque number.
• Null rule : Specifies how to record a null value. For example , use 0 for numeric
attribute and ‘?’ for nominal attribute.
DATA ANALYTICS
Data Cleaning as a Process

• Data discrepancy detection


• Use commercial tools that can aid in this step.
• Data scrubbing tools : Use simple domain knowledge (example, knowledge of
postal zip code and spell-check) to detect errors and make corrections.
• Data auditing tools : Find discrepancies by analyzing the data to discover rules
and relationships, and detect data that violates the discovered rules. For
example, it employs statistical analysis to find correlations or clustering to
detect outliers.
DATA ANALYTICS
Data Cleaning as a Process

• Data transformation
• Some data inconsistencies can be corrected manually but most errors require
data transformations.
• Data migration tools allow transformations to be specified.
• ETL (Extraction/Transformation/Loading) tools allow users to specify
transformations through a graphical user interface.
• Data transformations may introduce more discrepancies.
• The 2-step process of discrepancy detection and data transformation occurs
iteratively until no further anomalies are found.
• New approaches to data cleaning emphasize increased interactivity. Potter’s
wheel is a publicly available data cleaning tool that integrates both the steps.
DATA ANALYTICS
Test your understanding!
• Which of these is not a method to deal with noisy data?
a) Binning
b) Regression
c) Principal Component Analysis
d) Clustering
Solution
c) Principle Component Analysis
• Outliers need to be removed in every dataset , regardless of the problem
statement.
Solution
False
• Mean imputation can be done for which type of missing data?
Solution
MCAR
DATA ANALYTICS
Test your understanding!

• The statement “Most of the missing people from work are sickest people” denotes
what type of missingness?
MNAR

• Which type of missingness is called “non-ignorable”?


MNAR
Because the missing data mechanism itself has to be modelled as you deal with the
missing data. You have to include some model for why the data are missing and what
the likely values are.
DATA ANALYTICS
References

• Data Mining : Concepts and Techniques by Han, Kamber and Pei , The
Morgan Kaufmann Series in Data Management Systems ,3rd Edition
Chapter : 3.1-3.2
• http://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/mi.html
• https://www.scribbr.com/statistics/missing-data/
• https://www.theanalysisfactor.com/missing-data-mechanism/

You might also like