Descriptive Analytics - Uni and Bi

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 36

COE 102

Introductory
Big Data
College of Engineering

Chapter -2-
Descriptive Analytics Part 1
Descriptive
Analytics Part 1

2
Our Mission: From Raw Data to a
Dashboard
Learning Objectives

• Scale types
• Introduction to descriptive analytics
• Univariate descriptive analytics
• Visualization

4
Let’s Consider these Scenarios
• Can you study the employees behavior in ALL government
organizations by surveying ALL the employees?

• Can you study the purchasing behavior of ALL teenagers around the
globe by surveying ALL teenagers?

• Is it feasible?

• What would be a better alternative?


5
Statistical Concepts
Population
• A set of similar instances/objects or events which is of interest for some
question or experiment
• E.g. all students of my school, all nails produced by a machine
Sample
• A set of a data collected and/or selected from a population by a defined
procedure
• E.g. a subset of the students of my school that answered to a survey, a subset of
randomly selected nails produced by a machine

• Can you give more examples?


6
Descriptive Statistics
• But..even Samples are too big
• Descriptive statistics are methods / techniques to describe or
summarize samples in order to help humans to understand it

7
Scale Types .. Brainstorming
• Does your family name describe a quantity?

• Does your height describe a quantity?

• Does your gender describe a quantity?

• Can you order people heights?

• Can you order people family names?


8
Descriptive statistics
• Different scale types that exist to describe
data:
• Qualitative Scales
• Quantitative Scales

9
Scale Types
• What does ordinal mean?
• Qualitative scales
• Nominal: categorizes data in a non-
ordinal way (CAN’T BE ORDERED)
• Operations: = and ≠
• E.g. friend’s name and gender (e.g. Eve is a
Female – Eve is not a Male)
• Ordinal: categorizes data in an ordinal
way (CAN BE ORDERED)
• Operations: =, ≠, <, >, ≤, and ≥
• E.g. company
• Let’s compare Andrew and Marcus
Company
10
Scale Types
• Quantitative scales
• Always numeric
• Relative (Interval): does not have an
absolute zero
• Operations: =, ≠, <, >, ≤, ≥, - and +
• E.g. temperature
• Absolute (Ratio): has an absolute zero
• Operations: =, ≠, <, >, ≤, ≥, -, +, / and ×
• E.g. weight and heigth

When the attribute “height” is zero it means there is no height.


This is also true for the weight. But for the temperature, when
we have 0∘C it does not mean there is no temperature. When
we talk about weight, we can say that Bernhard weighs twice as
much as Irene
11
Changing Data Scale
• The more arithmetic operations applicable to a scale type, the more informative it is! Because more
arithmetic operations will allow having more processing and insights

• Look at the table, which one is the most informative?

Nominal = and ≠

Ordinal =, ≠, <, >, ≤, and ≥


Relative : =, ≠, <, >, ≤, ≥, - and +
(Interval)
Absolute =, ≠, <, >, ≤, ≥, -, +, / and ×
(Ratio)

Hence, in many cases we convert data of a certain scale type to another more informative type
12
Descriptive
univariate analysis
13
Descriptive Univariate Analysis
• What does univariate mean?
• In descriptive univariate analysis, three types of
information can be obtained:
1. Frequency tables
2. Visualization (plots)
3. Statistical measures

14
Frequency Tables

15
Descriptive Univariate Analysis:
Frequencies
• A frequency is basically a counter
• Absolute frequency counts how many times a value appears.
• Relative frequency counts the percentage of times that value appears.

• The absolute cumulative frequency is the number of occurrences less or


equal than a given value

• The relative cumulative frequency is the percentage of occurrences less


or equal than a given value

16
Example 1 – Company

7/14=50%

17
Example 2 – Height

18
Data Visualization

19
Descriptive Univariate Analysis:
data visualization
• Pie chart: it is used typically for
Qualitative Data
• Question: can you estimate the
proportion of Bad values in the data by
looking at the pie chart?

20
Descriptive Univariate Analysis:
data visualization
• Bar chart: It is used
typically for qualitative
scales.
• Sometimes it can be used
with quantitative scales
with a limited number of
values.

21
Descriptive Univariate Analysis:
data visualization Max Temp Day
21 1
• Line chart: They are specially 25 2
used to deal with the notion of 30 3
time. 20 4
21 5

• These are used with


quantitative scale with equal
lag between observations.

• Represent time series, graphs


of values obtained over regular
time sequences.
22
Statistics

26
Descriptive Univariate Analysis:
statistics
• A statistic is a descriptor
• Location statistics:
• It describes numerically a • Minimum: is the lowest value
characteristic of the sample or • Maximum: is the largest value
the population • Mean: is the average value
• There are two main groups of • Mode: is the most frequent value
univariate statistics: • The value that is larger than:
• Location statistics • 25% of all values is the 1st quartile
• Dispersion statistics • 50% of all values is the median or 2nd
quartile
• 75% of all values is the 3rd quartile

27
Example
• Let us use as an example the attribute
weight from our data set

Graphical representation of the statistics


Location statistic Weight (kg)
Min 55.00
Max 115.00
Mean or average 79.00
Mode 75.00
1st quartile 65.75
2nd quartile or mode 75.00
3rd quartile 87.50
Descriptive Univariate Analysis:
statistics
• Box-plots present the minimum,
the 1st quartile, the median, the
 Mean (or average), median and
3rd quartile and the maximum mode are known as measures
statistics, by this order, bottom- of central tendency, because
up or from left to right return a central value from a
• The attribute height set of values

29
Descriptive Univariate Analysis:
statistics
• Box-plots can also be used
to describe the symmetry/
skewness of an attribute

• The median or the mode


are more robust as a
central tendency statistic
than the mean in the
presence of extreme
values or strongly skewed
distributions
30
Descriptive Univariate Analysis:
statistics
• Dispersion statistic measures • Dispersion statistics (cont.):
how distant the different values • Mean absolute deviation: Mean
are absolute deviation: is a measure
for the mean absolute distance
• Dispersion statistics: between the observations and
• Amplitude (Range): is the the mean
difference between the maximum • Its math formula for the
and the minimum values population is:
• Interquartile range: is the
difference between the values of • Its math formula for a sample is:
the 3rd and 1st quartiles

31
Descriptive Univariate Analysis:
statistics
• Dispersion statistics (cont): • Using again as example the
• Standard deviation: is another weight attribute, dispersion
measure for the typical distance statistics are as shown in the
between the observations and table
their mean
• Its math formula for the population
is: Dispersion statistic Weight (kg)
• Its math formula for a sample is:
Amplitude 60.00
• The square of the standard
deviation is named variance Interquartile range 21.75
14.31
s 17.38

32
Descriptive bivariate
analysis
36
Descriptive bivariate analysis
• When the two attributes of the pair
are quantitative
• There are several visualization
techniques able to visually show the
distribution of points with two
quantitative attributes
• One of these techniques is scatter
plots

37
Descriptive bivariate analysis
• Pearson correlation
• Sample Pearson correlation

• values always between [-1, 1]


• If the points form:
• an increasing line, the Pearson correlation
coefficient will be 1
• a decreasing line, its value will be -1
• a horizontal line or a cloud without increasing or
decreasing tendency, its value will be 0

38
Descriptive bivariate analysis
• The Spearman's rank correlation:

• Spearman's correlation ranges in value from -1 to 1, with


values near 1 indicating similarity in ranks for the two
variables and values near -1 indicating ranks are dissimilar for
the two variables.
• 0 means that there is no association between ranks

39
Example Friend Weight Height Ranked Ranked
(cm) (cm) weight height
Andrew 77 175 1.0 1.0
Bernhard 110 195 4.0 2.0
• Pearson correlation Carolina 70 172 2.0 3.0
Dennis 85 180 3.0 4.0
Eve 65 168 5.0 5.5
• Spearman's rank Fred 75 173 6.0 5.5
correlation Gwyneth 75 180 7.5 7.0
• Hayden 63 165 9.0 8.0
Irene 55 158 7.5 9.5
James 66 163 11.0 9.5
Kevin 95 190 10.0 11.0
Lea 72 172 12.0 12.0
Marcus 83 185 14.0 13.0
Nigel 115 192 13.0 14.0
Discussion

• Can the mean be used in ordinal


scales?

• This is strongly arguable but there are


examples of its use with numeric
ordinal scales such as the Likert scale

• The Likert uses an ordered scale, e.g.,


integers from 1 (highest
disagreement) to 5 (highest
agreement)
41
Reading
• Textbook: Chapter -2- from the textbook
• Moreira, João, André Carlos Ponce de Leon Ferreira, and Tomáš Horváth. A
general introduction to data analytics. Wiley, 2019. ISBN: 9781119296263.

42

You might also like