Confirmatory Data Analysis (CFA)
Confirmatory Data Analysis (CFA)
Confirmatory Data Analysis (CFA)
confirmatory data analysis (CFA), an analytical process guided by classical statistical inference in its use of
significance testing confidence to determine whether variables are independent; it is required when attempting to prove
causation. Exploratory data analysis is the first step in the search for evidence, without which confirmatory analysis
has nothing to evaluate.
Measurement Instrument
Preliminary Analysis Planning
Refine Hypothesis
Data Visualization
Collect and
Enter Data
The guide for data examination is the preliminary analysis plan, the foundation for the development of the
measurement instrument. During EDA, the researcher has the flexibility to respond to the patterns revealed in the
preliminary summaries of the data. This flexibility is an important attribute of the process. Because it doesn’t follow a
rigid structure, EDA is free to take many paths in unravelling the mysteries in the data. While numerical summaries
may start the process, visual representations and graphical techniques offer major contributions. When numerical
summaries are used exclusively and accepted without visual inspection, the selection of confirmatory may be based on
flawed assumptions. For these reasons, exploratory data analysis should begin with visual inspection. After that, it is
not only possible but also desirable to cycle between exploratory and confirmatory approaches.
Frequency Tables
Several techniques are essential to any data examination.
Frequency Table is a simple device for arraying data. It arrays data by assigned response code values, from
lowest to highest value, with columns for count, percent, valid percent and cumulative percent.
Example
A Frequency Table (Minimum Age for Social Networking)
Cumulative
Value Label Value Frequency Percent Valid Percent
Percent
21 years old 1 60 6 6 6
18 years old min 2 180 18 18 24
16 years old min 3 330 33 33 57
13 years old min 4 280 28 28 85
10 years old min 5 50 5 5 90
Any age 6 60 6 6 96
No opinion 7 40 4 4 100
1, 000 100 100
This example nominal table variable table describes the perceived desirable minimum age to be permitted to own a
social networking account.
The values and percentages are more readily understood in the graphic format. When the variable interest is measured
at an interval-ratio level and has many potential values there are other appropriate techniques available.
Histograms
The histogram is a conventional solution for the display of interval-ratio data. Histograms are used when it is
possible to group the variable’s values into intervals. Histograms are constructed with bars that represents each
interval, where the interval quantity determines the height of the bar, and where each interval’s bar is the same width
and occupies an equal amount of area within graphs.
Example
A histogram of the results of exam is shown above. Each interval range for the variable of interest, points, is shown on
the horizontal axis; the frequency of students or number of observations in each interval is on the vertical axis. The
value of the start of each interval is noted at the left of the bar on horizontal access. The height of the bar corresponds
with the frequency or number of observations in the interval above which it is erected. This histogram was constructed
with intervals 10 increment wide.
Stem-and-leaf Displays
The stem-and-leaf display is a technique that is closely related to histogram. It shares some of histogram’s
features but offers several unique advantages. It is easy to construct by hand for small samples or may be produced by
computer programs. In contrast to histograms, which lose information by grouping data values into intervals, the stem-
and-leaf presents actual data values that can be inspected directly, without the use of enclosed bars as the
representation medium. This features reveals the distribution of values within the interval and preserves their rank
order for finding the median, quartiles, and other summary statistics. It also eases linking a specific observation back to
the data file and to the participant that produced it.
Visualization is the second advantage of stem-and-leaf displays. The range of values is apparent at glance, and both
shape and spread impressions are immediate. Patterns in the data – such as gaps where no values exists, areas where
values are clustered, or outlying values that differ from the main body of data – are easily observed.
Example
Assume we are examining the following data set
[5, 6, 6, 7, 7, 7, 8, 8, 9], mean = 7, standard deviation = 1.22
median = 7, lower quartile = 6, upper quartile = 8
if we replace the 9, with 90
[5, 6, 6, 7, 7, 7, 8, 8, 90], mean = 16, standard deviation = 27.78
median = 7, lower quartile = 6, upper quartile = 8
Changing only of nine values has disturbed the location and the spread summaries to the point where they no longer
represent the other eight values. Both the mean and the standard deviation are considered non-resistant statistics;
they are susceptible to the effects of extreme values in the tails of the distribution and do not represent typical values
well under conditions of asymmetry. The standard deviation is particularly problematic because it is computed from
the squared deviations from the mean. In the contrast, the median and the quartiles are highly resistant to change to
change. These characteristics of resistance are incorporated into the construction of boxplots.
Boxplots may be constructed easily by hand or computer programs. The basic ingredients of the boxplot include:
1. The rectangular plot (encompasses 50% of the data values).
2. A center line (marks the median and goes through the width of the box).
3. The edges of the box, called hinges.
4. The “whiskers” (extend from the right and left hinges to the largest and smallest values).
These values may be found within 1.5 times the interquartile range (IQR) from either edge of the box.
When you are examining data, it is important to separate legitimate outliers from errors in measurement, editing,
coding, and data entry. Outliers are data points that exceed the interquartile range by 1.5 times, reflect unusual
cases and are an important source of information for the study. They are displayed or given special statistical
treatment, or other portions of the data set are sometimes shielded from their effects. Extreme outliers,
however, can be data entry errors; these variables should be corrected during editing.
Boxplots are excellent diagnostic tool, especially when graphed on the same scale. Below is the summary of several
comparisons when analysing boxplots.
The upper two plots are both symmetric, but one is larger than the other. Larger bow widths are sometimes used
when the second variable, from the same measurement scale, comes from a larger sample size. The box width should
be proportional to the square root of the sample size but not all plotting programs account for this. Right- and left-
skewed distributions and those with reduced spread are also presented clearly in the plot comparison. Finally groups
may be compared by means of multiple plots. One variation, in which a notch at the median marks off a confidence
interval to test the equality of group of median takes us a step closer to hypothesis testing. Here the size of the box
return to full width at the upper and lower confidence intervals. When the intervals do not overlap, we can be
confident, at a specified confidence level, that the median of the two populations are different.
Example
Boxplot Comparison of Customer Services
B. Cross-Tabulation
Your preliminary analysis plan contains multiple dummy tables designed to find patterns in your data. In EDA,
Example
SPSS Cross-Tabulation of Gender by Overseas Assignment Opportunity
2. Construct a stem-and-leaf display about the average annual purchase of Target’s top 50 customers
54 55 55 56 56 57 58 58 58 58
59 61 62 64 66 66 67 69 69 70
72 72 73 75 76 77 78 80 82 82
86 102 104 110 111 118 123 131 140 146
153 163 166 183 206 218 218 220 221 222