Clase 11 Data Exploration
Clase 11 Data Exploration
Clase 11 Data Exploration
Busisness Intelligence
Master of Informatic Technology
Univariate Analysis
◦ Graphical analysis of categorical attributes .
◦ Graphical analysis of numerical attributes
◦ Measures of central tendency for numerical attributes
◦ Measures of dispersion for numerical attributes
◦ Measures of relative location for numerical attributes
◦ Identification of outliers for numerical attributes
◦ Measures of heterogeneity for categorical attributes
◦ Analysis of the empirical density
◦ Summary statistics
Bivariate Analysis
◦ Graphical analysis
◦ Measures of correlation for numerical attributes
◦ Contingency tables for categorical attributes
Multivariate Analysis
◦ Graphical analysis
◦ Measures of correlation for numerical attributes
“The primary purpose of exploratory data
analysis is to highlight the relevant features
of each attribute contained in a dataset, using
graphical methods and calculating summary
statistics, and to identify the intensity of the
underlying relationships among the
attributes”
Exploratory data analysis includes three main phases:
the set of H distinct values that are taken by the categorical attribute
a, and let H = {1, 2, . . .,H}.
Taking the dataset shown in Example 5.2 and Table 5.2, for the
attribute area we have H = 4 and V = {1, 2, 3, 4}.
The most natural representation for the graphical
analysis of a categorical attribute is a vertical bar
chart, which indicates along the vertical axis or
ordinate the empirical frequencies.
It is also possible to calculate the relative empirical frequency, or
empirical density,
For discrete numerical attributes assuming a finite and
limited number of values, it is possible to resort to a bar chart
representation, just as in the case of categorical attributes.
In the presence of continuous or discrete attributes that
might assume infinite distinct values, this type of
representation cannot be used, as it would require an infinite
number of vertical bars.
We must therefore subdivide the horizontal axis
corresponding to the values assumed by the attribute, into a
finite and moderate number of intervals, usually of equal
width, which in practice are considered as distinct classes
(discretization).
Procedure 7.1 – Histogram for the empirical density
The number R of classes loosely depends on the number m of observations in
the sample and on the uniformity of the data. Usually the goal is to obtain
between 5 and 20 classes, making sure that in each class the frequency is
higher than 5.
The total range and the width lr of each class is then defined. Usually the total
range, given by the difference between the highest value and the lowest value
of the attribute, is divided by the number of classes, so as to obtain intervals of
equal width.
The boundaries of each class are properly assigned so as to keep the classes
disjoint, making sure that no value falls simultaneously into contiguous classes.
For example, each interval should be closed on the left and open on the right,
except for the last one which should also be closed on the right.
Finally, the number of observations in each interval is counted and the
corresponding rectangle is assigned a height equal to the empirical density pr
defined as
Mean
Median
Moda
Midrange
Geometric mean
The location measures of the previous subsection gave
an indication of the central part of the observed values
of a numerical attribute.
However, it is necessary to define other indicators that
describe the dispersion of the data, representing the
level of variability expressed by the observations with
respect to central values.
Range:The simplest measure of dispersion is the range,
which is defined as the difference between the
maximum and the minimum of the observations:
we will denote by aj = (x1j, ,x2j, . . . , xmj ) and ak = (x1k, x2k ,. . . , xmk) the
vectors composed of m observations corresponding to the two attributes
considered.
Scatter plots:A scatter plot is definitely the
most intuitive graphical representation of the
relationship between two numerical
attributes.
Loess plots: Loess plots are based on scatter plots and can
therefore be applied in turn to pairs of numerical attributes.
Starting from a scatter plot, it is possible to add a trend curve to
express the functional relationship between the attribute ak and
the attribute aj . The trend curve can be obtained using local
regression techniques. This explains the term loess, which
stands for local regression.
Level curves: Level curves are a further development
of scatter plots and can only be used for numerical
attributes. They highlight the value of a third
numerical attribute az as the attributes aj and ak
placed on the axes of the plot vary.
Quantile–quantile plots (QQ plots) are used to compare the
distributions of the same attribute for two different
characteristics of the population or for samples extracted from
two different populations. The analysis is applicable to numerical
attributes and is carried out by comparing the quantiles of the
two series of observations. It is also possible to obtain a QQ plot
even when one of the series is more populated than the other.
Box plots Time series