Confirmatory Data Analysis (CFA)

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

Module 5

Examine the Data

A. What is Exploratory Data Analysis?


Data examination uses exploratory data analysis (EDA), which explores and reduces the data using
descriptive statistics and some preliminary graphical displays of data. Data analysis and interpretation uses

confirmatory data analysis (CFA), an analytical process guided by classical statistical inference in its use of
significance testing confidence to determine whether variables are independent; it is required when attempting to prove
causation. Exploratory data analysis is the first step in the search for evidence, without which confirmatory analysis
has nothing to evaluate.

Role of EDA in the Research Process

Measurement Instrument
Preliminary Analysis Planning
Refine Hypothesis
Data Visualization

Collect and
Enter Data

Exploratory Data Analysis


Descriptive Statistics of
Variables Edit Data
Recode Variables
Cross-tabulate Variables
Prepare Data Displays

The guide for data examination is the preliminary analysis plan, the foundation for the development of the
measurement instrument. During EDA, the researcher has the flexibility to respond to the patterns revealed in the
preliminary summaries of the data. This flexibility is an important attribute of the process. Because it doesn’t follow a
rigid structure, EDA is free to take many paths in unravelling the mysteries in the data. While numerical summaries
may start the process, visual representations and graphical techniques offer major contributions. When numerical
summaries are used exclusively and accepted without visual inspection, the selection of confirmatory may be based on
flawed assumptions. For these reasons, exploratory data analysis should begin with visual inspection. After that, it is
not only possible but also desirable to cycle between exploratory and confirmatory approaches.

Frequency Tables
Several techniques are essential to any data examination.
Frequency Table is a simple device for arraying data. It arrays data by assigned response code values, from
lowest to highest value, with columns for count, percent, valid percent and cumulative percent.
Example
A Frequency Table (Minimum Age for Social Networking)
Cumulative
Value Label Value Frequency Percent Valid Percent
Percent
21 years old 1 60 6 6 6
18 years old min 2 180 18 18 24
16 years old min 3 330 33 33 57
13 years old min 4 280 28 28 85
10 years old min 5 50 5 5 90
Any age 6 60 6 6 96
No opinion 7 40 4 4 100
1, 000 100 100

This example nominal table variable table describes the perceived desirable minimum age to be permitted to own a
social networking account.
The values and percentages are more readily understood in the graphic format. When the variable interest is measured
at an interval-ratio level and has many potential values there are other appropriate techniques available.

Histograms
The histogram is a conventional solution for the display of interval-ratio data. Histograms are used when it is
possible to group the variable’s values into intervals. Histograms are constructed with bars that represents each
interval, where the interval quantity determines the height of the bar, and where each interval’s bar is the same width
and occupies an equal amount of area within graphs.
Example

A histogram of the results of exam is shown above. Each interval range for the variable of interest, points, is shown on
the horizontal axis; the frequency of students or number of observations in each interval is on the vertical axis. The
value of the start of each interval is noted at the left of the bar on horizontal access. The height of the bar corresponds
with the frequency or number of observations in the interval above which it is erected. This histogram was constructed
with intervals 10 increment wide.

Stem-and-leaf Displays
The stem-and-leaf display is a technique that is closely related to histogram. It shares some of histogram’s
features but offers several unique advantages. It is easy to construct by hand for small samples or may be produced by
computer programs. In contrast to histograms, which lose information by grouping data values into intervals, the stem-
and-leaf presents actual data values that can be inspected directly, without the use of enclosed bars as the
representation medium. This features reveals the distribution of values within the interval and preserves their rank
order for finding the median, quartiles, and other summary statistics. It also eases linking a specific observation back to
the data file and to the participant that produced it.
Visualization is the second advantage of stem-and-leaf displays. The range of values is apparent at glance, and both
shape and spread impressions are immediate. Patterns in the data – such as gaps where no values exists, areas where
values are clustered, or outlying values that differ from the main body of data – are easily observed.

How to develop a stem-and-leaf display?


1. The first digits of each data item are arranged to the left of the vertical line.
2. Place the last digit for each item to the right of the vertical line.
Note that any digit to the right of the decimal point is ignored. The last digit for each item is placed on the
horizontal row corresponding to its first digit(s).
Each line or row is a stem, and each piece of information on the stem is a leaf.
Example

When rotated, a stem-and-leaf display takes on the properties of a histogram.


Pareto Diagrams
The pareto diagrams is a bar chart whose percentages sum to 100 percent. The data are derived from a multiple
choice, single-response scale; a multiple-choice response scale; or a frequency counts of words from content analysis.
The participants’ answer are sorted in decreasing importance, with bar height in descending order from left to right.
Example
Boxplots
The boxplots, or box-and-whisker plot, is another technique used frequently in exploratory data analysis.
A box plot reduces the detail of stem-and-leaf display and provides a different visual image of the distribution’s
location, spread, shape, tail length, and outliers. Boxplots are extensions of the five-number summary of the
distribution. This summary consist of the median, the upper and lower quartile, and the largest and smallest
observations. The median and quartiles are used because they are particularly resistant statistics. Resistant statistics
are unaffected by outliers and change only slightly in response to the replacement of small portions of the data set.

Example
Assume we are examining the following data set
[5, 6, 6, 7, 7, 7, 8, 8, 9], mean = 7, standard deviation = 1.22
median = 7, lower quartile = 6, upper quartile = 8
if we replace the 9, with 90
[5, 6, 6, 7, 7, 7, 8, 8, 90], mean = 16, standard deviation = 27.78
median = 7, lower quartile = 6, upper quartile = 8

Changing only of nine values has disturbed the location and the spread summaries to the point where they no longer
represent the other eight values. Both the mean and the standard deviation are considered non-resistant statistics;
they are susceptible to the effects of extreme values in the tails of the distribution and do not represent typical values
well under conditions of asymmetry. The standard deviation is particularly problematic because it is computed from
the squared deviations from the mean. In the contrast, the median and the quartiles are highly resistant to change to
change. These characteristics of resistance are incorporated into the construction of boxplots.
Boxplots may be constructed easily by hand or computer programs. The basic ingredients of the boxplot include:
1. The rectangular plot (encompasses 50% of the data values).
2. A center line (marks the median and goes through the width of the box).
3. The edges of the box, called hinges.
4. The “whiskers” (extend from the right and left hinges to the largest and smallest values).
These values may be found within 1.5 times the interquartile range (IQR) from either edge of the box.
When you are examining data, it is important to separate legitimate outliers from errors in measurement, editing,

coding, and data entry. Outliers are data points that exceed the interquartile range by 1.5 times, reflect unusual
cases and are an important source of information for the study. They are displayed or given special statistical

treatment, or other portions of the data set are sometimes shielded from their effects. Extreme outliers,
however, can be data entry errors; these variables should be corrected during editing.

The components of a box plot and their relationships.

Boxplots are excellent diagnostic tool, especially when graphed on the same scale. Below is the summary of several
comparisons when analysing boxplots.
The upper two plots are both symmetric, but one is larger than the other. Larger bow widths are sometimes used
when the second variable, from the same measurement scale, comes from a larger sample size. The box width should
be proportional to the square root of the sample size but not all plotting programs account for this. Right- and left-
skewed distributions and those with reduced spread are also presented clearly in the plot comparison. Finally groups
may be compared by means of multiple plots. One variation, in which a notch at the median marks off a confidence
interval to test the equality of group of median takes us a step closer to hypothesis testing. Here the size of the box
return to full width at the upper and lower confidence intervals. When the intervals do not overlap, we can be
confident, at a specified confidence level, that the median of the two populations are different.

Example
Boxplot Comparison of Customer Services

In this example, there are multiple boxplots. It compares


five sectors of customers’ services of a certain company. The overall impression is one of several potential problems
for the analyst: unequal variances, skewness, and extreme outliers. Note the similarities of the profiles of finance and
retailing in contrast to the high-tech and insurance sectors.

B. Cross-Tabulation
Your preliminary analysis plan contains multiple dummy tables designed to find patterns in your data. In EDA,

cross-tabulation is a first step for identifying relationships between variables. Cross-tabulation is a


technique for comparing data from two or more variables that results in a table. Cross-tabulation is used with
classification variables and the study’s target variables (operationalized measurement questions). These tables has
rows and columns that correspond to the code values of each variable’s categories. Each cell contains a count of
the cases of the joint classification and also the row, column, and total percentages. The number of row cells and
column cells is often used to designate the size of the table, as in 2 x 2 table. Row and column totals, called
marginal, appear at the bottom and right “margins” of the table. They show separately the counts and percentages
of the rows and columns.

Example
SPSS Cross-Tabulation of Gender by Overseas Assignment Opportunity

Throughout the entire module 5, we have


exploited the visual techniques of exploratory data analysis to look beyond numerical summaries and gain insight
into the patterns of the data. And you have learned the use of resistant statistics to protect the study from the
effects of extreme scores and occasional errors.

Check your Understanding

Answer the following.


1. You study the attrition of entering college freshmen (those students who enter college as freshmen but don’t
stay to graduate). You find the following relationships among attrition, aid and distance of home from
college.
a. What graphical display would you choose to understand the data? Why?
b. What is your interpretation of the data below?
Home Near Home Far
Aid
Receiving Aid Receiving Aid
Yes No Yes No Yes No
% % % % % %
Drop out 25 20 5 15 30 40
Stay 75 80 95 85 70 60

2. Construct a stem-and-leaf display about the average annual purchase of Target’s top 50 customers

54 55 55 56 56 57 58 58 58 58
59 61 62 64 66 66 67 69 69 70
72 72 73 75 76 77 78 80 82 82
86 102 104 110 111 118 123 131 140 146
153 163 166 183 206 218 218 220 221 222

You might also like