Multivariate Statistics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Multivariate statistics

Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and


analysis of more than one outcome variable, i.e., multivariate random variables. Multivariate statistics
concerns understanding the different aims and background of each of the different forms of multivariate
analysis, and how they relate to each other. The practical application of multivariate statistics to a particular
problem may involve several types of univariate and multivariate analyses in order to understand the
relationships between variables and their relevance to the problem being studied.

In addition, multivariate statistics is concerned with multivariate probability distributions, in terms of both

how these can be used to represent the distributions of observed data;


how they can be used as part of statistical inference, particularly where several
different quantities are of interest to the same analysis.

Certain types of problems involving multivariate data, for example simple linear regression and multiple
regression, are not usually considered to be special cases of multivariate statistics because the analysis is
dealt with by considering the (univariate) conditional distribution of a single outcome variable given the
other variables.

Multivariate analysis
Multivariate analysis (MVA) is based on the principles of multivariate statistics. Typically, MVA is used to
address the situations where multiple measurements are made on each experimental unit and the relations
among these measurements and their structures are important.[1] A modern, overlapping categorization of
MVA includes:[1]

Normal and general multivariate models and distribution theory


The study and measurement of relationships
Probability computations of multidimensional regions
The exploration of data structures and patterns

Multivariate analysis can be complicated by the desire to include physics-based analysis to calculate the
effects of variables for a hierarchical "system-of-systems". Often, studies that wish to use multivariate
analysis are stalled by the dimensionality of the problem. These concerns are often eased through the use of
surrogate models, highly accurate approximations of the physics-based code. Since surrogate models take
the form of an equation, they can be evaluated very quickly. This becomes an enabler for large-scale MVA
studies: while a Monte Carlo simulation across the design space is difficult with physics-based codes, it
becomes trivial when evaluating surrogate models, which often take the form of response-surface
equations.

Types of analysis

There are many different models, each with its own type of analysis:

1. Multivariate analysis of variance (MANOVA) extends the analysis of variance to cover cases
where there is more than one dependent variable to be analyzed simultaneously; see also
Multivariate analysis of covariance (MANCOVA).
2. Multivariate regression attempts to determine a formula that can describe how elements in a
vector of variables respond simultaneously to changes in others. For linear relations,
regression analyses here are based on forms of the general linear model. Some suggest
that multivariate regression is distinct from multivariable regression, however, that is debated
and not consistently true across scientific fields.[2]
3. Principal components analysis (PCA) creates a new set of orthogonal variables that contain
the same information as the original set. It rotates the axes of variation to give a new set of
orthogonal axes, ordered so that they summarize decreasing proportions of the variation.
4. Factor analysis is similar to PCA but allows the user to extract a specified number of
synthetic variables, fewer than the original set, leaving the remaining unexplained variation
as error. The extracted variables are known as latent variables or factors; each one may be
supposed to account for covariation in a group of observed variables.
5. Canonical correlation analysis finds linear relationships among two sets of variables; it is the
generalised (i.e. canonical) version of bivariate[3] correlation.
6. Redundancy analysis (RDA) is similar to canonical correlation analysis but allows the user
to derive a specified number of synthetic variables from one set of (independent) variables
that explain as much variance as possible in another (independent) set. It is a multivariate
analogue of regression.
7. Correspondence analysis (CA), or reciprocal averaging, finds (like PCA) a set of synthetic
variables that summarise the original set. The underlying model assumes chi-squared
dissimilarities among records (cases).
8. Canonical (or "constrained") correspondence analysis (CCA) for summarising the joint
variation in two sets of variables (like redundancy analysis); combination of correspondence
analysis and multivariate regression analysis. The underlying model assumes chi-squared
dissimilarities among records (cases).
9. Multidimensional scaling comprises various algorithms to determine a set of synthetic
variables that best represent the pairwise distances between records. The original method is
principal coordinates analysis (PCoA; based on PCA).
10. Discriminant analysis, or canonical variate analysis, attempts to establish whether a set of
variables can be used to distinguish between two or more groups of cases.
11. Linear discriminant analysis (LDA) computes a linear predictor from two sets of normally
distributed data to allow for classification of new observations.
12. Clustering systems assign objects into groups (called clusters) so that objects (cases) from
the same cluster are more similar to each other than objects from different clusters.
13. Recursive partitioning creates a decision tree that attempts to correctly classify members of
the population based on a dichotomous dependent variable.
14. Artificial neural networks extend regression and clustering methods to non-linear
multivariate models.
15. Statistical graphics such as tours, parallel coordinate plots, scatterplot matrices can be used
to explore multivariate data.
16. Simultaneous equations models involve more than one regression equation, with different
dependent variables, estimated together.
17. Vector autoregression involves simultaneous regressions of various time series variables on
their own and each other's lagged values.
18. Principal response curves analysis (PRC) is a method based on RDA that allows the user to
focus on treatment effects over time by correcting for changes in control treatments over
time.[4]
19. Iconography of correlations consists in replacing a correlation matrix by a diagram where the
“remarkable” correlations are represented by a solid line (positive correlation), or a dotted
line (negative correlation).
Dealing with incomplete data

It is very common that in an experimentally acquired set of data the values of some components of a given
data point are missing. Rather than discarding the whole data point, it is common to "fill in" values for the
missing components, a process called "imputation".[5]

Important probability distributions


There is a set of probability distributions used in multivariate analyses that play a similar role to the
corresponding set of distributions that are used in univariate analysis when the normal distribution is
appropriate to a dataset. These multivariate distributions are:

Multivariate normal distribution


Wishart distribution
Multivariate Student-t distribution.

The Inverse-Wishart distribution is important in Bayesian inference, for example in Bayesian multivariate
linear regression. Additionally, Hotelling's T-squared distribution is a multivariate distribution, generalising
Student's t-distribution, that is used in multivariate hypothesis testing.

History
Anderson's 1958 textbook, An Introduction to Multivariate Statistical Analysis,[6] educated a generation of
theorists and applied statisticians; Anderson's book emphasizes hypothesis testing via likelihood ratio tests
and the properties of power functions: admissibility, unbiasedness and monotonicity.[7][8]

MVA once solely stood in the statistical theory realms due to the size, complexity of underlying data set and
high computational consumption. With the dramatic growth of computational power, MVA now plays an
increasingly important role in data analysis and has wide application in OMICS fields.

Applications
Multivariate hypothesis testing
Dimensionality reduction
Latent structure discovery
Clustering
Multivariate regression analysis
Classification and discrimination analysis
Variable selection
Multidimensional analysis
Multidimensional scaling
Data mining

Software and tools


There are an enormous number of software packages and other tools for multivariate analysis, including:

JMP (statistical software)


MiniTab
Calc
PSPP
R[9]
SAS (software)
SciPy for Python
SPSS
Stata
STATISTICA
The Unscrambler
WarpPLS
SmartPLS
MATLAB
Eviews
NCSS (statistical software) includes multivariate analysis.
The Unscrambler® X (http://www.camo.com/rt/Products/Unscrambler/unscrambler.html) is a
multivariate analysis tool.
SIMCA (https://umetrics.com/products/simca)
DataPandit (Free SaaS applications by Let's Excel Analytics Solutions (http://letsexcel.in))

See also
Estimation of covariance matrices
Important publications in multivariate analysis
Multivariate testing in marketing
Structured data analysis (statistics)
Structural equation modeling
RV coefficient
Bivariate analysis
Design of experiments (DoE)
Dimensional analysis
Exploratory data analysis
OLS
Partial least squares regression
Pattern recognition
Principal component analysis (PCA)
Regression analysis
Soft independent modelling of class analogies (SIMCA)
Statistical interference
Univariate analysis

References
1. Olkin, I.; Sampson, A. R. (2001-01-01), "Multivariate Analysis: Overview" (http://www.science
direct.com/science/article/pii/B0080430767004721), in Smelser, Neil J.; Baltes, Paul B.
(eds.), International Encyclopedia of the Social & Behavioral Sciences, Pergamon,
pp. 10240–10247, ISBN 9780080430768, retrieved 2019-09-02
2. Hidalgo, B; Goodman, M (2013). "Multivariate or multivariable regression?" (https://www.ncb
i.nlm.nih.gov/pmc/articles/PMC3518362). Am J Public Health. 103 (1): 39–40.
doi:10.2105/AJPH.2012.300897 (https://doi.org/10.2105%2FAJPH.2012.300897).
PMC 3518362 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3518362). PMID 23153131
(https://pubmed.ncbi.nlm.nih.gov/23153131).
3. Unsophisticated analysts of bivariate Gaussian problems may find useful a crude but
accurate method (http://www.dioi.org/sta.htm#sdsx) of accurately gauging probability by
simply taking the sum S of the N residuals' squares, subtracting the sum Sm at minimum,
dividing this difference by Sm, multiplying the result by (N - 2) and taking the inverse anti-ln
of half that product.
4. ter Braak, Cajo J.F. & Šmilauer, Petr (2012). Canoco reference manual and user's guide:
software for ordination (version 5.0), p292. Microcomputer Power, Ithaca, NY.
5. J.L. Schafer (1997). Analysis of Incomplete Multivariate Data. Chapman & Hall/CRC.
ISBN 978-1-4398-2186-2.
6. T.W. Anderson (1958) An Introduction to Multivariate Analysis, New York: Wiley
ISBN 0471026409; 2e (1984) ISBN 0471889873; 3e (2003) ISBN 0471360910
7. Sen, Pranab Kumar; Anderson, T. W.; Arnold, S. F.; Eaton, M. L.; Giri, N. C.; Gnanadesikan,
R.; Kendall, M. G.; Kshirsagar, A. M.; et al. (June 1986). "Review: Contemporary Textbooks
on Multivariate Statistical Analysis: A Panoramic Appraisal and Critique". Journal of the
American Statistical Association. 81 (394): 560–564. doi:10.2307/2289251 (https://doi.org/1
0.2307%2F2289251). ISSN 0162-1459 (https://www.worldcat.org/issn/0162-1459).
JSTOR 2289251 (https://www.jstor.org/stable/2289251).(Pages 560–561)
8. Schervish, Mark J. (November 1987). "A Review of Multivariate Analysis" (https://doi.org/10.
1214%2Fss%2F1177013111). Statistical Science. 2 (4): 396–413.
doi:10.1214/ss/1177013111 (https://doi.org/10.1214%2Fss%2F1177013111). ISSN 0883-
4237 (https://www.worldcat.org/issn/0883-4237). JSTOR 2245530 (https://www.jstor.org/stab
le/2245530).
9. CRAN (https://cran.r-project.org/web/views/Multivariate.html) has details on the packages
available for multivariate data analysis

Further reading
Johnson, Richard A.; Wichern, Dean W. (2007). Applied Multivariate Statistical Analysis
(Sixth ed.). Prentice Hall. ISBN 978-0-13-187715-3.
KV Mardia; JT Kent; JM Bibby (1979). Multivariate Analysis. Academic Press. ISBN 0-12-
471252-5.
A. Sen, M. Srivastava, Regression Analysis — Theory, Methods, and Applications, Springer-
Verlag, Berlin, 2011 (4th printing).
Cook, Swayne (2007). Interactive Graphics for Data Analysis (http://ggobi.org/book/index.ht
ml).
Malakooti, B. (2013). Operations and Production Systems with Multiple Objectives. John
Wiley & Sons.
T. W. Anderson, An Introduction to Multivariate Statistical Analysis, Wiley, New York, 1958.
KV Mardia; JT Kent & JM Bibby (1979). Multivariate Analysis. Academic Press. ISBN 978-
0124712522. (M.A. level "likelihood" approach)
Feinstein, A. R. (1996) Multivariable Analysis. New Haven, CT: Yale University Press.
Hair, J. F. Jr. (1995) Multivariate Data Analysis with Readings, 4th ed. Prentice-Hall.
Schafer, J. L. (1997) Analysis of Incomplete Multivariate Data. CRC Press. (Advanced)
Sharma, S. (1996) Applied Multivariate Techniques. Wiley. (Informal, applied)
Izenman, Alan J. (2008). Modern Multivariate Statistical Techniques: Regression,
Classification, and Manifold Learning. Springer Texts in Statistics. New York: Springer-
Verlag. ISBN 9780387781884.
"Handbook of Applied Multivariate Statistics and Mathematical Modeling | ScienceDirect".
Retrieved 2019-09-03.

External links
Statnotes: Topics in Multivariate Analysis, by G. David Garson (https://archive.today/2012.12.
14-194305/http://www2.chass.ncsu.edu/garson/pa765/statnote.htm)
Mike Palmer: The Ordination Web Page (http://ordination.okstate.edu/)
InsightsNow: Makers of ReportsNow, ProfilesNow, and KnowledgeNow (http://www.insights
now.com)

Retrieved from "https://en.wikipedia.org/w/index.php?title=Multivariate_statistics&oldid=1161680534"

You might also like