Data Visualization: Are Merely Labels, Codes or Mutually Exclusive Categories

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Data Visualization

• For Categorical data/Non-numeric data-Data those are merely labels, codes or mutually
exclusive categories.

➢ Frequency distribution(Tabular Representation)- of a categorical variable is the distribution


of the different categories or the labels of the variables along with its corresponding
frequencies.
➢Bar Chart-plots the different categories or labels of the categorical data on the x-axis and the
corresponding frequencies are plotted on the y-axis.
➢Pie Chart- is a circular chart that plots the relative percentage share of different categories of
the categorical variable.
➢Example(Ice-cream Flavour)- Suppose the following table represents the responses of 50
people for their favourite ice-cream flavours.

Chocolate Vanilla Strawberry Chocolate Vanilla


Vanilla Chocolate Vanilla Strawberry Strawberry
Strawberry Strawberry Butterscotch Vanilla Chocolate
Butterscotch Vanilla Chocolate Strawberry Chocolate
Vanilla Chocolate Butterscotch Vanilla Butterscotch
Strawberry Butterscotch Chocolate Butterscotch Chocolate
Vanilla Chocolate Butterscotch Vanilla Chocolate
Vanilla Strawberry Butterscotch Strawberry Strawberry
Butterscotch Chocolate Vanilla Strawberry Butterscotch
Strawberry Butterscotch Chocolate Butterscotch Chocolate
Table-1: Frequency Table for categorical data
The percentage share of a flavour, say Vanilla is
Ice-cream flavour Frequency calculated as follows:
Chocolate 14 % share for vanilla=(12/50)*100=24%
Vanilla 12
Similarly,
Strawberry 12 Butterscotch=24%
Butterscotch 12 Chocolate=28% and
Strawberry=24%
Total 50
14.5
14
14

13.5
Frequency

13
24% 28%
12.5 Vanilla
12 12 12 Strawberry
12
Butterscotch
11.5
24%
24%

11
Vanilla Strawberry Butterscotch
Ice cream flavour

Figure 1: Bar Chart Figure 2: Pie Chart for categorical


data
➢ Cross Tabulation- Joint frequency of different values or categories of variables
presented. Such a table is also called a Contingency table.
Table-2: Data on gender and education of employee Table-3:Contingency table for data on gender and education
Employee No. Gender Education Education Gender Total
Male Female
1 Male Undergraduate
Undergraduate 2 3 5
2 Female Postgraduate
Postgraduate 3 2 5
3 Male Undergraduate
5 5 10
4 Female Undergraduate
5 Male Postgraduate
6 Female Undergraduate
7 Male Postgraduate
8 Female Undergraduate
9 Male Postgraduate
10 Female Postgraduate

➢ Gender and education are categorical variables, and data given in Table-2, is further arranged
as in Table-3 as an contingency table.
• Cross Tabulation- data can further be represented by Stacked or Clustered Chart as
below:

UG PG
UG PG
3.5
6
3
5
2.5
4 2

3 1.5
1
2
0.5
1
0
0 Male Female
Male Female

Figure 3: Stacked chart for bivariate categorical data Figure 4:Clustered Chart for bivariate categorical data
Pareto Chart
➢ A Pareto chart is a type of chart that contains both bars and a line graph, where
individual values are represented in descending order by bars, and the cumulative total
is represented by the line.
1. Pareto chart is used to prioritize some frequency in order to observe the greatest
overall improvement/impact
2. Each bar usually represents a type of defect or problem. The height of the bar
represents any important unit of measure — often the frequency of occurrence or
cost.
3. The bars are presented in descending order (from tallest to shortest). Therefore,
you can see which defects are more frequent at a glance.
4. The line represents the cumulative percentage of defects.
Pareto Chart- for the defects in Shirts
70

Types of defect Frequency of defects Cummulative frequency 60

Button defect 23 23 50

Pocket defect 16 39

Frequency
40

Frequency of defects
Collar defect 10 49 30
Cummulative frequency
Cuff defect 7 56
20

Sleeve defect 3 59
10

Total 59
0
Button defect Pocket defect Collar defect Cuff defect Sleeve defect
Types of defects
Pareto Chart- for the defects in Shirts
120

Frequency of 100
Types of defect defects Percentage Cummulative %

80
Button defect 23 38.98305085 38.98305085

Frequency/%
Pocket defect 16 27.11864407 66.10169492 60 Frequency of defects
Percentage

Collar defect 10 16.94915254 83.05084746 Cummulative %


40

Cuff defect 7 11.86440678 94.91525424


20

Sleeve defect 3 5.084745763 100


0
Button defect Pocket defect Collar defect Cuff defect Sleeve defect
Total 59 100 Types of defects
Numeric/quantitative data

plots
Frequency
Table
Organizing numerical data: Ordered array
• An ordered array is sequence of data in rank order from the smallest value to the largest
value
• Shows ranges(minimum to maximum)
• May help identify outliers(unusual observations)
• Values appear more than once
• Divide data in sections(Day students-1/3rd of the data below 18, 2/3rd of the data below 22
etc.)
Stem and Leaf display
• A simple way to see how the data is distributed and where concentrations of data exist.
➢Method-Separate the data series into leading digits(the stems) and tailing digits( the leaves)
➢A stem and leaves diagram organises data into groups called stems, so that the values within
each group called leaves branch out to the right on each row.
Frequency distribution
➢The frequency distribution is a summary table in which the data are arranged into numerically
ordered classes.
➢One must needs to give attention in selecting the number of group classes of the table
determining suitable width of a class and establishing the boundaries of each class avoiding
overlapping
➢The no of classes depend on the no of values in data. With large no of values there must be
more classes. In general a frequency distribution should have at least 5 but no more than 15
classes.
➢To determine the width of a class interval, divide the range(Highest value-lowest value) by the
no of classes desired.
Frequency distribution
Let us consider 10 employees of an organisation(Table 4). This type of arrangement of data is
called ungrouped data distribution. It can be noted that each data point of salary has frequency
one. Low no of data point no need to construct a frequency distribution. Let us add 10 more
employees in Table 4. As in Table 5.
Table 4: Data on salaries of 10 employees Table 5: Data on salaries of 20 employees
Employee Salary($) Employee Salary($) Employee Salary($)
1 1800 1 1800 11 2800
2 2800 2 2800 12 3400
3 3400 3 3400 13 4200
4 4200 4 4200 14 4800
5 4800 5 4800 15 5400
6 5400 6 5400 16 5800
7 6400 7 6400 17 6400
8 7800 8 7800 18 6400
9 8600 9 8600 19 7800
10 9600
10 9600 20 9600
➢The discrete frequency distribution of the salary data of 20 employees is given in Table 6. In
Table 7, grouped frequency distribution of data on salaries is presented.
Table 6: Discrete frequency distribution of Table 7: Continuous frequency distribution
salaries of employees Note that class
Salary Frequency(f) Salary(X) Frequency( Relative frequency % Frequency(f) intervals are
f) mutually
1800 1
exclusive and the
2800 2 1000-2000 1 0.05 5
upper limit of a
3400 2 2000-3000 2 0.10 10 class is inclusive
3000-4000 2 0.10 10 in the next
4200 2
interval. For
4800 2 4000-5000 4 0.20 20
example, if one
5400 2 5000-6000 2 0.10 10 more employee
6000-7000 3 0.15 15 with salary $7000
6400 3
joins, he/she will
7800 3 7000-8000 3 0.15 15
be included in
8600 1 8000-9000 1 0.05 5 7000-8000 but
9000-10000 2 0.10 10 not in 6000-7000
9600 2
class
Total 𝛴𝑓 =20 1.00 100
• Class Interval-are based on the range of the continuous variable under study. For example, in
Table7, 2000-3000 is termed as a class interval with lower limit (L/l) 2000 and upper limit
(U/u) 3000.
• The mid value of any class interval is calculated by, mid value(x)=(l+u)/2
For class interval 2000-3000, mid value=(2000+3000)/2=2500.
• Cumulative frequency(C.F) for a particular class interval is the sum of frequency of this class
and all previous classes.
C.F of a class=Frequency of the class + Sum of frequencies of previous classes
• C.F for the last class interval is always equal to the sum of all the frequencies in the
distribution Table 8: Calculations for Mid value and C.F
Salary(X) Frequency(f) Mid value(x) C.F
1000-2000 1 1500 1
2000-3000 2 2500 3
3000-4000 2 3500 5 𝛴𝑓 = 11
4000-5000 4 4500 9
5000-6000 2 5500 11
Why use frequency distribution?
• It condenses a raw data into a more useful form
• It allows a quick visual interpretation of the data
• It enables the determination of the major characteristic of the data set including where the
data are concentrated/clustered.
➢Frequency distribution: some tips
• Different class boundaries may provide different pictures for the same data
• Choosing different class boundaries may show up shifts in concentrations
• As the size of the data set increases, the impact of alterations in the selection of class
boundaries is greatly reduced
• When comparing two or more groups with different sample sizes, one must use either a
relative frequency or percentage frequency distribution
Graphical Presentation: The Histogram
• A vertical bar chart of data in frequency distribution is called histogram
• In histogram there is no gap between adjacent bars
• The class boundaries or midpoints are shown on the horizontal axis
• The vertical axis is either frequency, relative frequency or percentage frequency
• The height of the bar represents the frequency, relative frequency or percentage frequency
14
Salary(X) Frequency(f) Mid value(x) Histogram
12

1000-2000 5 1500 10

Frequency
8
2000-3000 8 2500
6

3000-4000 3 3500 4

2
4000-5000 2 4500 0
1500 2500 3500 4500 5500 6500 More
5000-6000 6 5500 Salary

6000-7000 12 6500
Graphical Presentation: The Frequency
Polygon, Ogive
➢Polygon-If we join the midpoints of the bars of the histogram, we get the frequency polygon.
➢Ogive- plots the cumulative frequency distribution of a variable. It is also known as
cumulative frequency curve.

36
12

24
8

CF
18
6 16
5 13

3 5
2

1500 2500 3500 4500 5500 6500


1500 2500 3500 4500 5500 6500 Salary
Graphical Presentation: Scatter diagram
• Histogram and ogive plot the univariate data. Scatter diagram plot data on two metric
variables. Further a trend line can be fitted on this scatter plot. Trend line is a line that
provides an approximation of the pattern of the relationship between the two metric
variables. Thus scatter plot is used to examine possible relationship between two
numerical variables.
• One variable is measured on the horizontal axis and the other variable is measured on
the vertical axis.
AvAge
60
50
40
AVG AGE

30
20
10
0
0 2 4 6 8 10 12 14
SALARY OF PEOPLE
Scatter Plot Example
Volume per day Cost per day 250

23 125

26 140 200

29 146
150
33 160

Cost
38 167
100
42 170

50 188 50

55 195
0
60 200 0 10 20 30 40 50 60 70
Volume
Time Series Plot-an example
➢A time series plot is used to study patterns in the values of a numeric variable over
time
➢In the time series plot the numeric variable is measured on vertical axis and the time
period is measured on horizontal axis.
120

100
Year No of Frenchises
2005 43 80

No of Franchise
2006 54
2007 60 60
2008 73
40
2009 82
2010 95
20
2011 107
2012 99 0
2013 95 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Year
2014 104
Principle of Excellent Graph
• The Graph should not distort the data
• The graph should not contain unnecessary adornments
• The scale at the vertical axis should begin at zero
• All axes should be properly labelled
• The Graph should contain a title
• The simplest possible graph should be used for a given data
Graphical Errors: Chart Junk
Graphical Errors: No Relative Basis
Graphical Errors: Compressing the vertical data
Graphical Errors: No zero point on vertical axis

You might also like