Clase 11 Data Exploration

Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

Gustavo Cáceres Castellanos

[email protected]

Busisness Intelligence
Master of Informatic Technology
 Univariate Analysis
◦ Graphical analysis of categorical attributes .
◦ Graphical analysis of numerical attributes
◦ Measures of central tendency for numerical attributes
◦ Measures of dispersion for numerical attributes
◦ Measures of relative location for numerical attributes
◦ Identification of outliers for numerical attributes
◦ Measures of heterogeneity for categorical attributes
◦ Analysis of the empirical density
◦ Summary statistics
 Bivariate Analysis
◦ Graphical analysis
◦ Measures of correlation for numerical attributes
◦ Contingency tables for categorical attributes
 Multivariate Analysis
◦ Graphical analysis
◦ Measures of correlation for numerical attributes
“The primary purpose of exploratory data
analysis is to highlight the relevant features
of each attribute contained in a dataset, using
graphical methods and calculating summary
statistics, and to identify the intensity of the
underlying relationships among the
attributes”
Exploratory data analysis includes three main phases:

 univariate analysis, in which the properties of each


single attribute of a dataset are investigated;

 bivariate analysis, in which pairs of attributes are


considered, to measure the intensity of the
relationship existing between them (for supervised
learning models, it is of particular interest to analyze
the relationships between the explanatory attributes
and the target variable);

 multivariate analysis, in which the relationships


holding within a subset of attributes are investigated.
 Univariate analysis is used to study the behavior of each
attribute, considered as an entity independent of the other
variables of the dataset.

 It is of interest to assess the tendency of the values of a


given attribute to arrange themselves around a specific
central value , to measure the propensity of the variable to
assume a more or less wide range of values (dispersion) and
to extract information on the underlying probability
distribution.
 to verify the validity of statistical hypothesis,
regarding the distribution of the variables being
examined, before proceeding with the
subsequent investigation.

 intuitively draws conclusions concerning the


information content that each attribute may
provide.

 plays a key role in pointing out anomalies and


non-standard values – that is, in identifying the
outliers
 Suppose that a given dataset D contains m
observations, and denote by aj the generic attribute
being analyzed.
 El conjunto de datos de m observaciones

Será denotado por:

For the purpose of a graphical representation, it is


necessary to make a distinction between categorical
and numerical attributes
 A categorical attribute may be graphically analyzed by resorting to
various representations for the empirical distribution of the
observations – that is, the relative frequencies with which the
different values occur. Denote by

the set of H distinct values that are taken by the categorical attribute
a, and let H = {1, 2, . . .,H}.

Taking the dataset shown in Example 5.2 and Table 5.2, for the
attribute area we have H = 4 and V = {1, 2, 3, 4}.
 The most natural representation for the graphical
analysis of a categorical attribute is a vertical bar
chart, which indicates along the vertical axis or
ordinate the empirical frequencies.
It is also possible to calculate the relative empirical frequency, or
empirical density,
 For discrete numerical attributes assuming a finite and
limited number of values, it is possible to resort to a bar chart
representation, just as in the case of categorical attributes.
 In the presence of continuous or discrete attributes that
might assume infinite distinct values, this type of
representation cannot be used, as it would require an infinite
number of vertical bars.
 We must therefore subdivide the horizontal axis
corresponding to the values assumed by the attribute, into a
finite and moderate number of intervals, usually of equal
width, which in practice are considered as distinct classes
(discretization).
Procedure 7.1 – Histogram for the empirical density
 The number R of classes loosely depends on the number m of observations in
the sample and on the uniformity of the data. Usually the goal is to obtain
between 5 and 20 classes, making sure that in each class the frequency is
higher than 5.
 The total range and the width lr of each class is then defined. Usually the total
range, given by the difference between the highest value and the lowest value
of the attribute, is divided by the number of classes, so as to obtain intervals of
equal width.
 The boundaries of each class are properly assigned so as to keep the classes
disjoint, making sure that no value falls simultaneously into contiguous classes.
For example, each interval should be closed on the left and open on the right,
except for the last one which should also be closed on the right.
 Finally, the number of observations in each interval is counted and the
corresponding rectangle is assigned a height equal to the empirical density pr
defined as
 Mean

 Median

 Moda

 Midrange

 Geometric mean
 The location measures of the previous subsection gave
an indication of the central part of the observed values
of a numerical attribute.
 However, it is necessary to define other indicators that
describe the dispersion of the data, representing the
level of variability expressed by the observations with
respect to central values.
 Range:The simplest measure of dispersion is the range,
which is defined as the difference between the
maximum and the minimum of the observations:

Mean absolute deviation : The deviation, or spread, of a value is defined


as the signed difference from the sample arithmetic mean

We can express a measure of dispersion of the observations around


their sample mean through the sum of the absolute values of the
spreads, called mean absolute deviation (MAD),
 Variance
 Normal distribution
 Arbitrary distribution
 Coefficiente of variation
 Measures of relative location for a numerical attribute are
used to examine the localization of a value with respect to
other values in the sample.
 Quantiles
 Measures of central tendency based on quantiles
◦ Mid-mean.
◦ Trimmed mean.
◦ Winsorized mean.
◦ z -index
 Box plot
 Gini Index
 Entropy index
 Asymmetry of the density curve
 Kurtosis of the density curve
 it is appropriate to exploit the relationships existing between pairs of
attributes through bivariate analysis. We will denote by aj and ak a
generic pair of attributes to be analyzed.

 It is useful to distinguish three cases that may occur within bivariate


analysis:
◦ both attributes are numerical;
◦ one attribute is numerical and the other is categorical;
◦ both attributes are categorical.

 In supervised learning problems it is usually important to investigate the


relationship between the target attribute and each explanatory attribute.
Hence, it frequently happens that one of the attributes of the pair {aj ,ak}
represents the target of a supervised learning problem, which is
categorical for classification and numerical for regression.

 we will denote by aj = (x1j, ,x2j, . . . , xmj ) and ak = (x1k, x2k ,. . . , xmk) the
vectors composed of m observations corresponding to the two attributes
considered.
 Scatter plots:A scatter plot is definitely the
most intuitive graphical representation of the
relationship between two numerical
attributes.
 Loess plots: Loess plots are based on scatter plots and can
therefore be applied in turn to pairs of numerical attributes.
Starting from a scatter plot, it is possible to add a trend curve to
express the functional relationship between the attribute ak and
the attribute aj . The trend curve can be obtained using local
regression techniques. This explains the term loess, which
stands for local regression.
 Level curves: Level curves are a further development
of scatter plots and can only be used for numerical
attributes. They highlight the value of a third
numerical attribute az as the attributes aj and ak
placed on the axes of the plot vary.
 Quantile–quantile plots (QQ plots) are used to compare the
distributions of the same attribute for two different
characteristics of the population or for samples extracted from
two different populations. The analysis is applicable to numerical
attributes and is carried out by comparing the quantiles of the
two series of observations. It is also possible to obtain a QQ plot
even when one of the series is more populated than the other.
 Box plots  Time series

Box plots for the attribute Pmob in


Example 5.2 at distinct values
of the target churner

Daily closing prices of 4 financial indices


(1991–1998) a
 Covarianza
 Correlation

Examples of scatter plots and their linear


correlation coefficients
 When dealing with a pair of categorical attributes aj and ak, let
V = {v1, v2, . . . , vJ }, U= {u1, u2, . . . , uK}
denote the sets of distinct values respectively assumed by each of
them.
 A contingency table is defined as a matrix T whose generic
element trs indicates the frequency with which the pair of values
{xij = vr } and {xik = us } appears in the records of the dataset D.
 The purpose of multivariate analysis is to
extend the concepts introduced for the
bivariate case in order to assess the
relationships existing among multiple
attributes in a dataset.
 Scatter plot matrix:Since scatter plots show in
an intuitive way the relationships between
pairs of numerical attributes, in the case of
multivariate analysis it is natural to consider
matrices of plots evaluated for every pair of
numerical variables. In this way, it is possible
to visualize the nature and intensity of the
pairwise relationshipsin a single chart.
 Star plots: belong to the broader class of
icon-based charts. They show in an intuitive
way the differences among values of the
attributes for the records of a dataset. To be
effective, they should be applied to a limited
number of observations, say no more than a
few dozen, and the comparison should be
based on a small number of attributes.
 Spider web chart: are grids where the main
rays correspond to the attributes analyzed.
For every record the position is calculated on
each ray, based on the value of the
corresponding attribute. Finally, the points so
obtained on the rays for each record are
sequentially connected to each other, thus
creating a circuit for every record in the
dataset.
 For multivariate analysis of numerical
attributes, covariance and correlation
matrices are calculated among all pairs of
attributes.

You might also like