Data Visualization: Are Merely Labels, Codes or Mutually Exclusive Categories
Data Visualization: Are Merely Labels, Codes or Mutually Exclusive Categories
Data Visualization: Are Merely Labels, Codes or Mutually Exclusive Categories
• For Categorical data/Non-numeric data-Data those are merely labels, codes or mutually
exclusive categories.
13.5
Frequency
13
24% 28%
12.5 Vanilla
12 12 12 Strawberry
12
Butterscotch
11.5
24%
24%
11
Vanilla Strawberry Butterscotch
Ice cream flavour
➢ Gender and education are categorical variables, and data given in Table-2, is further arranged
as in Table-3 as an contingency table.
• Cross Tabulation- data can further be represented by Stacked or Clustered Chart as
below:
UG PG
UG PG
3.5
6
3
5
2.5
4 2
3 1.5
1
2
0.5
1
0
0 Male Female
Male Female
Figure 3: Stacked chart for bivariate categorical data Figure 4:Clustered Chart for bivariate categorical data
Pareto Chart
➢ A Pareto chart is a type of chart that contains both bars and a line graph, where
individual values are represented in descending order by bars, and the cumulative total
is represented by the line.
1. Pareto chart is used to prioritize some frequency in order to observe the greatest
overall improvement/impact
2. Each bar usually represents a type of defect or problem. The height of the bar
represents any important unit of measure — often the frequency of occurrence or
cost.
3. The bars are presented in descending order (from tallest to shortest). Therefore,
you can see which defects are more frequent at a glance.
4. The line represents the cumulative percentage of defects.
Pareto Chart- for the defects in Shirts
70
Button defect 23 23 50
Pocket defect 16 39
Frequency
40
Frequency of defects
Collar defect 10 49 30
Cummulative frequency
Cuff defect 7 56
20
Sleeve defect 3 59
10
Total 59
0
Button defect Pocket defect Collar defect Cuff defect Sleeve defect
Types of defects
Pareto Chart- for the defects in Shirts
120
Frequency of 100
Types of defect defects Percentage Cummulative %
80
Button defect 23 38.98305085 38.98305085
Frequency/%
Pocket defect 16 27.11864407 66.10169492 60 Frequency of defects
Percentage
plots
Frequency
Table
Organizing numerical data: Ordered array
• An ordered array is sequence of data in rank order from the smallest value to the largest
value
• Shows ranges(minimum to maximum)
• May help identify outliers(unusual observations)
• Values appear more than once
• Divide data in sections(Day students-1/3rd of the data below 18, 2/3rd of the data below 22
etc.)
Stem and Leaf display
• A simple way to see how the data is distributed and where concentrations of data exist.
➢Method-Separate the data series into leading digits(the stems) and tailing digits( the leaves)
➢A stem and leaves diagram organises data into groups called stems, so that the values within
each group called leaves branch out to the right on each row.
Frequency distribution
➢The frequency distribution is a summary table in which the data are arranged into numerically
ordered classes.
➢One must needs to give attention in selecting the number of group classes of the table
determining suitable width of a class and establishing the boundaries of each class avoiding
overlapping
➢The no of classes depend on the no of values in data. With large no of values there must be
more classes. In general a frequency distribution should have at least 5 but no more than 15
classes.
➢To determine the width of a class interval, divide the range(Highest value-lowest value) by the
no of classes desired.
Frequency distribution
Let us consider 10 employees of an organisation(Table 4). This type of arrangement of data is
called ungrouped data distribution. It can be noted that each data point of salary has frequency
one. Low no of data point no need to construct a frequency distribution. Let us add 10 more
employees in Table 4. As in Table 5.
Table 4: Data on salaries of 10 employees Table 5: Data on salaries of 20 employees
Employee Salary($) Employee Salary($) Employee Salary($)
1 1800 1 1800 11 2800
2 2800 2 2800 12 3400
3 3400 3 3400 13 4200
4 4200 4 4200 14 4800
5 4800 5 4800 15 5400
6 5400 6 5400 16 5800
7 6400 7 6400 17 6400
8 7800 8 7800 18 6400
9 8600 9 8600 19 7800
10 9600
10 9600 20 9600
➢The discrete frequency distribution of the salary data of 20 employees is given in Table 6. In
Table 7, grouped frequency distribution of data on salaries is presented.
Table 6: Discrete frequency distribution of Table 7: Continuous frequency distribution
salaries of employees Note that class
Salary Frequency(f) Salary(X) Frequency( Relative frequency % Frequency(f) intervals are
f) mutually
1800 1
exclusive and the
2800 2 1000-2000 1 0.05 5
upper limit of a
3400 2 2000-3000 2 0.10 10 class is inclusive
3000-4000 2 0.10 10 in the next
4200 2
interval. For
4800 2 4000-5000 4 0.20 20
example, if one
5400 2 5000-6000 2 0.10 10 more employee
6000-7000 3 0.15 15 with salary $7000
6400 3
joins, he/she will
7800 3 7000-8000 3 0.15 15
be included in
8600 1 8000-9000 1 0.05 5 7000-8000 but
9000-10000 2 0.10 10 not in 6000-7000
9600 2
class
Total 𝛴𝑓 =20 1.00 100
• Class Interval-are based on the range of the continuous variable under study. For example, in
Table7, 2000-3000 is termed as a class interval with lower limit (L/l) 2000 and upper limit
(U/u) 3000.
• The mid value of any class interval is calculated by, mid value(x)=(l+u)/2
For class interval 2000-3000, mid value=(2000+3000)/2=2500.
• Cumulative frequency(C.F) for a particular class interval is the sum of frequency of this class
and all previous classes.
C.F of a class=Frequency of the class + Sum of frequencies of previous classes
• C.F for the last class interval is always equal to the sum of all the frequencies in the
distribution Table 8: Calculations for Mid value and C.F
Salary(X) Frequency(f) Mid value(x) C.F
1000-2000 1 1500 1
2000-3000 2 2500 3
3000-4000 2 3500 5 𝛴𝑓 = 11
4000-5000 4 4500 9
5000-6000 2 5500 11
Why use frequency distribution?
• It condenses a raw data into a more useful form
• It allows a quick visual interpretation of the data
• It enables the determination of the major characteristic of the data set including where the
data are concentrated/clustered.
➢Frequency distribution: some tips
• Different class boundaries may provide different pictures for the same data
• Choosing different class boundaries may show up shifts in concentrations
• As the size of the data set increases, the impact of alterations in the selection of class
boundaries is greatly reduced
• When comparing two or more groups with different sample sizes, one must use either a
relative frequency or percentage frequency distribution
Graphical Presentation: The Histogram
• A vertical bar chart of data in frequency distribution is called histogram
• In histogram there is no gap between adjacent bars
• The class boundaries or midpoints are shown on the horizontal axis
• The vertical axis is either frequency, relative frequency or percentage frequency
• The height of the bar represents the frequency, relative frequency or percentage frequency
14
Salary(X) Frequency(f) Mid value(x) Histogram
12
1000-2000 5 1500 10
Frequency
8
2000-3000 8 2500
6
3000-4000 3 3500 4
2
4000-5000 2 4500 0
1500 2500 3500 4500 5500 6500 More
5000-6000 6 5500 Salary
6000-7000 12 6500
Graphical Presentation: The Frequency
Polygon, Ogive
➢Polygon-If we join the midpoints of the bars of the histogram, we get the frequency polygon.
➢Ogive- plots the cumulative frequency distribution of a variable. It is also known as
cumulative frequency curve.
36
12
24
8
CF
18
6 16
5 13
3 5
2
30
20
10
0
0 2 4 6 8 10 12 14
SALARY OF PEOPLE
Scatter Plot Example
Volume per day Cost per day 250
23 125
26 140 200
29 146
150
33 160
Cost
38 167
100
42 170
50 188 50
55 195
0
60 200 0 10 20 30 40 50 60 70
Volume
Time Series Plot-an example
➢A time series plot is used to study patterns in the values of a numeric variable over
time
➢In the time series plot the numeric variable is measured on vertical axis and the time
period is measured on horizontal axis.
120
100
Year No of Frenchises
2005 43 80
No of Franchise
2006 54
2007 60 60
2008 73
40
2009 82
2010 95
20
2011 107
2012 99 0
2013 95 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Year
2014 104
Principle of Excellent Graph
• The Graph should not distort the data
• The graph should not contain unnecessary adornments
• The scale at the vertical axis should begin at zero
• All axes should be properly labelled
• The Graph should contain a title
• The simplest possible graph should be used for a given data
Graphical Errors: Chart Junk
Graphical Errors: No Relative Basis
Graphical Errors: Compressing the vertical data
Graphical Errors: No zero point on vertical axis