Chapter 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

Chapter 1

Intorduction
to Statistics
and Data
Analysis
2

The Japanese were able to succeed to create an atmosphere that allows


the production of high-quality products. Much of the success of the
Japanese has been attributed to the use of statistical methods and
statistical thinking among management personnel.
3

Introduction
Statistics is a science that helps us make decisions and
draw conclusions in the presence of variability.

If the observed metrics are always the same and is always


on the target, there would be no need for statistical
methods. (Deterministic models/systems)

Variability means that successive observations of a


system or phenomenon do not produce the exact same
result.
4

Sources of Variability
5

What is Statistics
Statistics is the estimation of parameters (e.g.,
mean, median) and selection of distribution type
needed to quantify uncertainty.

The field of statistics deals with the collection,


presentation, analysis, and use of data to make
decisions, solve problems, and design products
and processes. In simple terms, statistics is the
science of data.
6

Statistical Path

Data Collection Data Analysis

Data Presentation Decision Making


7

Data Collection
The data collected could be Qualitative (shape, color,
etc.) or Quantitative (length, weight, etc.)

The data collected could be Discrete (distinct and


separate, e.g. no. of defected items) or Continuous (can
take any value within an interval, e.g. rod diameter)
8

Data Collection
Population: Collection of all individuals or individual items of a
particular type.

Sample: A subset of a population

Sample Size: The number of elements in the sample

Simple Random Sampling: Is a procedure for sampling from a


population in which the selection of a sample unit is based on chance.
Implies that any sample, of a specified sample size, has the same
chance of being selected as any other sample with the same sample
size.
9

Data Collection
There are three basic methods of collecting data:

➢ Retrospective study (historical data)


➢ Observational study (data, presently collected, by
a passive observer)
➢ Designed experiment (data collected in response
to process input changes)
10

Statistical Path

Data Collection Data Analysis

Data Presentation Decision Making


11

Data Presentation
12

Data Presentation
Data presentation is defined as the process of
using various graphical formats to visually
represent the relationship between two or more
data sets.
13

Statistical Path

Data Collection Data Analysis

Data Presentation Decision Making


14

Data Analysis
Descriptive statistics are brief descriptive coefficients that
summarize a given dataset (sample).

Descriptive statistics are broken down into measures of


central tendency (location) and measures of variability
(spread)

Measures of central tendency include the mean, median,


and mode, while measures of variability include standard
deviation, variance, range, IQR.
15

Statistical Path

Data Collection Data Analysis

Data Presentation Decision Making


16

Decision Making
17

Decision Making
Descriptive Statistics information about the sample only
Descriptive + Probability concepts Conclusions about
Statistics employment the population
Inferential Statistics
Decision Making

Concepts in probability form a major component that supplements


statistical methods and helps us gauge the strength of the
statistical inference. The discipline of probability, then, provides the
transition between descriptive statistics and inferential methods
18

Descriptive Statistics-Measures of Location


The sample mean is the numerical average and reflects the
central tendency of the sample (the centroid of the sample)
19

Descriptive Statistics-Measures of Location


Trimmed mean is computed by trimming away a certain
percent of both the largest and smallest set of values
(useful with outliers).

The Mode: the value that appears most often in a set of


data
20

Descriptive Statistics-Measures of Location


The sample median reflects the central tendency of the
sample in such a way that it is uninfluenced (insensitive) by
extreme values or outliers.

Median is in between these two points


21

Descriptive Statistics-Measures of Location

The Outlier is a data point on a graph or in a set of results


that is very much bigger or smaller than the next nearest
data point.

The trimmed mean is more insensitive to outliers than the


sample mean but not as insensitive as the median. On the
other hand, the trimmed mean approach makes use of more
information than the sample median.
22

Median and Quartiles


The Median is a measure of location
The Interquartile Range (IQR) is a measure of variability.
23

Descriptive Statistics-Measures of Variability


Sample Range: Simplest measure (Max Value – Min Value)

Interquartile Range (IQR) = Q3-Q1


Q1=25%ile = 𝑥1(𝑛+1)/4
Q2=50%ile=𝑥෤ = 𝑥2(𝑛+1)/4 = 𝑥(𝑛+1)/2
Q3=75%ile = 𝑥3(𝑛+1)/4
(The IQR is less sensitive to the extreme values in the sample than is
the ordinary sample range)

Sample Variance & Sample Standard Deviation


24

Descriptive Statistics-Measures of Variability

n-1: the degree of freedom


The unit of the standard deviation (s) is the same as the unit of the data
while the unit of the variance ( 𝑠 2 ) is the square of the units of the data (if
x is in gram, s is in gram, 𝑠 2 is in 𝑔𝑟𝑎𝑚2 )
25

Example 1
26
27
28

Textbook Exercises

Calculate the Mode, Range, and the


IQR.
29

Textbook Exercises
30

Textbook Exercises

Calculate the Mode, Range, and the


IQR.
31

Textbook Exercises
32

Data Presentation
The dot diagram is a useful data display for small samples up to about
20 observations. However, when the number of observations is
moderately large, other graphical displays may be more useful. In this
chapter we will discuss three different types of graphs which are:

➢ Stem and Leaf Plot

➢ Relative Frequency Distribution & Histogram

➢ Box Plot
33

Stem and Leaf Plot


It is a summary of a collection of data via a graphical display
can provide insight regarding the system from which the
data were taken.

It is combined tabular and graphic display which is useful for


studying the behavior of the distribution.
34

Stem and Leaf Plot


How to construct the stem and leaf plot?

➢ Divide each number (xi) into two parts: a stem, consisting of the
leading digits, and a leaf, consisting of the remaining digit. Choose
relatively few stems in comparison with the number of observations
(between 5 and 20 stems).
➢ List the stem values in a vertical column.
➢ Record the leaf for each observation beside its stem.
➢ Write the units for the stems and leaves on the display.
35

Stem and Leaf Plot-Example 1

Minimum=? Maximum=? Range=? Mode=? Mean=? S=? 𝑆 2 =?


Median=? Q1=? Q3=? IQR=? (data should be ordered)
36

Stem and Leaf Plots-Exercise

➢ How many of the footballers had a rating less than 85?


➢ Calculate the minimum, maximum rating, and the range of ratings.
➢ Calculate the mean and the mode of the footballers’ ratings.
➢ Calculate the median, 25% percentile, 75% percentile, and IQR of the
data.
➢ Calculate the variance and the standard deviation of the data.
37
Relative Frequency Distribution & Histogram
A frequency distribution is a more compact summary of data than a stem-and-leaf
diagram. To construct a frequency distribution, we must divide the range of the data into
intervals, which are usually called class intervals, cells, or bins. If possible, the bins should
be of equal width in order to enhance the visual information in the frequency distribution.

The number of bins depends on the number of observations and the amount of scatter or
dispersion in the data. A frequency distribution that uses either too few or too many bins
will not be informative. We usually find that between 5 and 20 bins is satisfactory in most
cases.
Choose the number of bins approximately equal to the square root of the number of
observations
Number of bins ≅ 𝑛

The histogram is a visual display of the frequency distribution.


38
Relative Frequency Distribution & Histogram
How to construct the relative
frequency distribution & histogram?
➢ Determine the number of bins(≅ 𝑛 ).
➢ Determine the smallest, the largest
data points, and the bin width.
➢ Calculate the class midpoint,
frequency and relative frequency
(divide the class frequency on the total
number of observations).
𝑛𝑜. 𝑜𝑓 𝑏𝑖𝑛𝑠 ≅ 80 = 9
Lower limit=70 Upper limit=250 ➢ Use the relative frequency distribution
(𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 − 𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑡) 180 to draw a histogram or a relative
𝑏𝑖𝑛 𝑤𝑖𝑑𝑡ℎ = =
𝑛𝑜. 𝑜𝑓 𝑏𝑖𝑛𝑠 9 frequency histogram.
= 20
39
Relative Frequency Distribution & Histogram
40
Relative Frequency Distribution & Histogram
41
Relative Frequency Distribution & Histogram
A distribution is said to be symmetric if it can be folded along a vertical
axis so that the two sides coincide. A distribution that lacks symmetry
with respect to a vertical axis is said to be skewed (asymmetric).

mode < median < mean if the distribution is skewed to the right
mode > median > mean if the distribution is skewed to the left
Mode = median = mean if the distribution is symmetric
42
Relative Frequency Distribution & Histogram

(a) is negative or left skewed (based on the location of the long tail) in
which the median is greater than the mean 𝑥෤ > 𝑥ҧ .

(b) is symmetric in which the median equals the mean 𝑥෤ = 𝑥ҧ .

(c) is positive or right skewed (based on the location of the long tail) in
which the median is smaller than the mean 𝑥෤ < 𝑥ҧ .
43
Relative Frequency Distribution & Histogram
44
Box-and-Whisker Plot or Box Plot
The box plot (vertical or horizontal) is a graphical display that simultaneously
describes several important features of a data set, such as center, spread,
departure from symmetry, and identification of unusual observations or outliers

Median (Q2)=? Q1=? Q3=? IQR=? Minimum=? Maximum=?


Range=? Outliers=? Extreme Outliers=? Symmetry?
45

Box-and-Whisker Plot or Box Plot


46

Box-and-Whisker Plot or Box Plot


47

Box-and-Whisker Plot or Box Plot


The box plot represents the
%sensitivity at different
window size. Find the
following values for each of
them:

Median (Q2)=?
Q1=? Q3=? IQR=?
Minimum=? Maximum=?
Range=?
Outliers=? Extreme
Outliers=?
Symmetry?
48

Box-and-Whisker Plot or Box Plot

Median (Q2)=? Q1=? Q3=? IQR=?


Minimum=? Maximum=? Range=?
Outliers=? Extreme Outliers=?
Symmetry?
49

Box-and-Whisker Plot or Box Plot


Median (Q2)=? Q1=? Q3=? IQR=?
Minimum=? Maximum=? Range=?
Outliers=? Extreme Outliers=?
Symmetry?
There is too much variability at plant 2.

Plants 2 and 3 need to raise their quality


index performance
50
51
Course Structure

You might also like