309 330 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Statistica Applicata Vol. 20, n.

3-4, 2008 309

USING CATEGORICAL VARIABLES IN PLS PATH


MODELING TO BUILD SYSTEM OF
COMPOSITE INDICATORS1

Laura Trinchera
Dipartimento di Studi sullo Sviluppo Economico,
Università degli Studi di Macerata,
Piazza Oberdan, 3 – Macerata (Italy)
[email protected]
Giorgio Russolillo, Carlo N. Lauro
Dipartimento di Matematica e Statistica,
Università degli Studi di Napoli “Federico II”, Via Cintia
– Complesso Monte S. Angelo – Napoli (Italy)
[email protected]

Abstract
Nowadays there is a pre-eminent need to measure very complex phenomena like
poverty, progress, well-being, etc. As is well known, the main feature of a composite
indicator is that it summarizes complex and multidimensional issues. Thanks to its features,
Structural Equation Modeling seems to be a useful tool for building systems of composite
indicators. Among the several methods that have been developed to estimate Structural
Equation Models we focus on the PLS Path Modeling approach (PLS-PM), because of the
key role that estimation of the latent variables (i.e. the composite indicators) plays in the
estimation process. In this paper we provide a suite of statistical methodologies for
handling categorical indicators with respect to the role they have in a system of composite
indicators. A categorical variable can play an active or a moderating role. An active
categorical variable directly participates in the construction of the model. In other words,
it is a categorical indicator impacting on a composite indicator jointly with other manifest
variables. A moderating categorical variable, instead, is a variable that does not play a
direct role in the construction of the system of composite indicators but affects the relations,

1 This paper was financially supported by the MURST grant “Multivariate statistical models for
the ex-ante and the ex-post analysis of regulatory impact”, coordinated by C.N. Lauro (2006)
310 Trinchera L., Russolillo G., Lauro C.N.

in terms of strength and/or direction, among them. In this work we investigate both the use
of categorical variables as indicators and as moderating variables. In particular, we
propose a new approach in order to take into account active categorical indicators.
This new approach provides a quantification of the categorical indicators in such a
way that the weight of each quantified indicator is coherent with the explicative ability of
the corresponding categorical indicator.
Keywords: Structural Equation Models, Systems of Composite Indicators, Categorical
Variables

1. INTRODUCTION
Nowadays there is a pre-eminent need to measure very complex phenomena
like poverty, progress, well-being, etc. Since they are complex and latent con-
cepts, it is not possible to measure them directly and they must be summarized by
several indicators. The challenge of constructing a global measure of well-being
or progress by using composite indicators is a much discussed theme. In partic-
ular, in literature two aspects have been investigated: i) the identification of the
key indicators to be used; ii) the ways in which these indicators can be brought to-
gether to make a coherent system of information. How to choose these indicators
is the task of psychologists, sociologists and economists, while it is the role of
statisticians to provide operational tools to aggregate indicators in order to build
composite indicators.
Structural Equation Models (SEMs) [Bollen, 1989; Kaplan 2000], and specif-
ically the PLS approach to SEMs (PLS Path Modeling, PLS-PM) [Wold, 1982;
Tenenhaus et al., 2005], can be used to compute systems of composite indicators
(see section 2). In this paper we provide a suite of statistical methodologies for
handling categorical indicators with respect to the role they have in a system of
composite indicators (see section 3). In particular, sub-section 3 deals with the
treatment of moderating categorical variables, while in sub-section 3 a new way
of handling categorical indicators in PLS-PM is proposed. To conclude, an appli-
cation involving data taken from a paper by Russet [1964] will be presented.

2. USING STRUCTURAL EQUATION MODELS TO BUILD SYS-


2. USING STRUCTURAL EQUATION MODELS TOBUILD SYSTEM
TEMOFOF COMPOSITE
COMPOSITE INDICATORS
INDICATORS

According to Saisana et al. [2002], a composite indicator (CI) is a mathemat-


Using categorical variables in PLS PATH modeling to build system of CI 311

ical combination of single indicators that represent different dimensions of a con-


cept the description of which is the objective of analysis. Thus, the main feature of
a complex indicator is that it summarizes complex and multidimensional issues.
In this regard, the Multidimensional Data Analysis (MDA) approach seems to be
the most natural tool to compute composite indicators. In the MDA approach the
computation of the weights is not subjective, but it is based on the statistical re-
lations among elementary indicators. If all the indicators refer to a single latent
concept, classical MDA techniques, like Factor Analysis (FA) [Gorsuch, 1983] or
Principal Component Analysis (PCA) [Hotelling, 1933], can be used. However,
often the several indicators used in the construction of a composite indicator ex-
press different aspects of a complex phenomenon, and so they can be conceptually
split into several blocks of indicators. Each block can be summarized by a single
composite indicator, which is considered causative with respect to a second-order
composite indicator. We will refer to this kind of index, which is a synthesis of
composite indicators, as a complex indicator. We can build complex indicators
using Multiple Factor Analysis (MFA) models [Escofier et al., 1994]. However,
MFA assumes a causal relation only between composite indicators and the com-
plex indicator. Structural Equation Models can be used in order to build more flex-
ible systems of composite indicators where CIs are connected by causal relations.
SEMs include a number of statistical methodologies that allow us to estimate the
causal relationships, defined according to a theoretical model, linking two or more
latent complex concepts (i.e. the composite indicators), each measured through a
number of observable indicators. The basic idea is that the complexity inside a
system can be studied by taking into account the whole set of causal relationships
among latent concepts, called Latent Variables (LVs), each measured by several
observed indicators usually defined as Manifest Variables (MVs). The conceptual
model behind the relations among latent and manifest variables is drawn as a Path
diagram in which ellipses or circles represent the latent variables and rectangles or
squares refer to the manifest variables. Moreover, arrows show causations among
the variables (either latent or manifest), and the direction of the arrow defines the
direction of the relation, i.e. variables receing the arrow are to be considered as
endogenous variables in the specific relationship (see figure 1). Each SEM model
involves two levels of relationships: the first one takes into account the relations
between the MVs and the corresponding LV (measurement model), the second
considers the causal relations among the LVs (structural model). Thus, the LVs
can be seen not only as composite indicators, due to their relations with the corre-
sponding indicators, but also as complex indicators, due to their causal relations
with other composite indicators.
312 Trinchera L., Russolillo G., Lauro C.N.

Fig. 1: Simple representation of a Structural Equation Model with reflective indicators.

One of the most important advantages in using SEMs is that they provide
two kinds of weights: one measuring the impact of each indicator on the corre-
sponding composite indicator, the other measuring relations among the composite
indicators in the system. These two levels of weights help us to understand what
is the importance of each indicator in building composite indicators, as well as
which are the main drivers in computing complex indicators. In other words, us-
ing SEMs to build complex indicators leads to the construction of a system of
weights and relations that allows us to understand the different aspects composing
the complex indicators.
Several methods have been developed to estimate SEM parameters, among
them the PLS Path Modeling approach (PLS-PM) [Wold, 1982]. PLS-PM is a so-
called component-based estimation method, because of the key role that is played
by the estimation of the LVs in the model. The main aim of component-based
methods, in fact, is to provide an estimate (ξξ ) of the LVs in such a way that the
estimates of the LVs are the most correlated with one another (according to the
path diagram structure) and the most representative of each corresponding block
Using categorical variables in PLS PATH modeling to build system of CI 313

of MVs. This is of great importance in building systems of composite indicators.


As a matter of fact, according to the PLS-PM approach, each composite indicator
is obtained in order to be the most representative of its corresponding indicators
and the most correlated with the other linked composite indicators. Moreover, in
a PLS-PM framework, two different ways are available to model the aggregation
of the indicators in order to building the complex indicators: the formative and
the reflective schemes. In a reflective scheme the set of indicators is assumed to
measure a unique underlying concept. Each indicator reflects the corresponding
composite indicator and plays the role of an endogenous variable in the block
specific measurement model. In the formative scheme, each indicator, or each sub-
block of indicators, represents different dimensions of the underlying concept. So,
the composite indicator is obtained as a linear combination of the corresponding
indicators, and each indicator is an exogenous variable in the measurement model.
In the following section a brief review of the PLS-PM algorithm is provided
(see sub-section 2). For a complete review of the PLS approach to SEMs refer to
Tenenhaus et al. [2005].

2.1A BRIEF
2.1 A brief review of
REVIEW OFthe
THEPLS-PM algorithm
PLS-PM ALGORITHM
PLS Path Modeling aims to estimate the relationships among Q blocks of MVs
(or indicators), which are an expression of unobservable constructs. Specifically,
PLS-PM estimates, through a system of interdependent equations based on sim-
ple and multiple regressions, the network of relations among the MVs and their
corresponding LV, and among the LVs inside the model. Formally, let us assume
that P quantitative indicators are observed on N units (i = 1, . . . , N). The resulting
data xnpq are collected in a partitioned table of centered data X :

X = [X
X 1, . . . , X q, . . . , X Q] ,

where x pq (p = 1, . . . , Pq , with ∑Q
q=1 Pq = P) is a generic indicator belonging to the
q-th block X q . In SEMs we assume that there is a latent variable (ξξ q ) associated
to each block.
In PLS-PM an iterative procedure allows us to estimate the LV scores, the
weights w pq to be associated to each MV in order to obtain the LV scores, and the
so-called path coefficients linking the LVs. The estimation of the SEM parameters
and of the LV scores is obtained through the alternate iteration of the outer and
the inner estimations, till convergence. Without loss of generality, the indicators
are centered in such a way that they have mean equal to zero. The procedure starts
by choosing arbitrary weights w pq . Then, in the external estimation, each LV is
314 Trinchera L., Russolillo G., Lauro C.N.

estimated as a linear combination of its own indicators:


Pq
νq ∝ ∑ w pq x pq = X q w q (1)
p=1

where ν q is the standardized outer estimation of the q-th LV ξ q and the symbol ∝
means that the left side of the equation corresponds to the standardized right side.
In the internal estimation, each LV is estimated by considering its links with the
other Q adjacent LVs:

Q
ϑq ∝ ∑ eqq ν q
 (2)
q =1

where ϑ q is the standardized inner estimation of the q-th LV ξ q and the inner
weights (eqq ) are equal (in a centroid scheme) to the signs of the correlations
between the q-th LV ν q and the ν q s connected with ν q . Inner weights can be
obtained following other schemes rather than the centroid one. For a review refer
to Tenenhaus et al. [2005].
Once a first inner estimation of the LVs is obtained, the algorithm continues
by updating the weights w pq . Two different ways are available to update the outer
weights, usually related to the two different kinds of measurement model. In a
reflective scheme, each outer weight w pq is the regression coefficient in the simple
regression of the p-th indicator of the q-th block (xx pq ) on the inner estimate of the
q-th LV ϑ q . Since the inner estimate of the q-th latent variable ϑ q is standardized,
the generic outer weight w pq is obtained as the covariance between each indicator
and the corresponding inner estimate of the LV:

w pq = cov (xx pq , ϑ q ) (3)

In the formative scheme, the vector w q of the weights w pq associated to the indica-
tors of the q-th block is the regression coefficient vector in the multiple regression
of the inner estimate of the q-th LV ϑ q on its centered indicators X q :
 −1 
wq = X q X q X qϑ q (4)

The algorithm is iterated till convergence. After convergence, structural (or path)
coefficients are estimated through simple/multiple OLS regressions among the
estimated LV scores.
Using categorical variables in PLS PATH modeling to build system of CI 315

3. THE ROLE OF CATEGORICAL INDICATORS IN BUILDING


SYSTEM OF COMPOSITE INDICATORS VIA PLS-PM APPROACH

PLS-PM is a technique devised to handle quantitative variables. However,


in practice categorical indicators could be used to measure complex concepts as
well. In particular, a categorical variable can play two different roles in PLS-
PM: an active role and a moderating role. An active categorical variable directly
participates in the construction of the system of composite indicators. In other
words, it is a categorical indicator impacting on a composite indicator jointly with
other indicators. A moderating categorical variable, instead, is a variable that does
not play a direct role in the construction of the system of composite indicators (see
figure 2). However, this variable influences the relationships, in terms of strength
and/or direction, between an exogenous and an endogenous variable [Baron et
al., 1986]. The so-called moderating effect can be seen as the effect obtained by
considering several groups of units each defined by a category of the moderating
categorical variable. In this section we investigate the use of categorical variables

Fig. 2: Moderating Variable


Fig. 2:inModerating
a simple SEM.
Variable in a simple SEM.

both as an active indicators and as moderating variables. First, we investigate the


case of moderating categorical variables (see sub-section 3). Then, a modified
version of the PLS-PM algorithm able to handle both categorical and quantitative
indicators is proposed (see sub-section 3).
316 Trinchera L., Russolillo G., Lauro C.N.

3.1 USING CATEGORICAL VARIABLES AS MODERATING INDICATORS


IN A SYSTEM OF COMPOSITE INDICATORS

Different approaches have been proposed in literature to model moderating cate-


gorical variables. In particular, we propose to distinguish between manifest mod-
erating categorical variables and latent moderating categorical variables. Mani-
fest moderating categorical variables are usually modeled by adding a so-called
interaction term as an additional LV in the model [Kenny et al., 1984]. A latent
moderating categorical variable, instead, is usually considered as an LV defining
latent classes of statistical units.

Manifest moderating categorical variables

Manifest moderating categorical variables define classes of units showing differ-


ent behaviors. The effects of these variables can be seen as the effect obtained
by considering several groups of units, each defined by a category of the man-
ifest moderating categorical variable. In composite indicator frameworks, it is
of great importance to take into account this kind of variables when computing
system of composite indicators. For instance, gender-specific indexes, such as
the GDI (Gender-related Development Index) of the United Nations Development
Program, involve taking into account the same variables for female and male. In
other words, the gender variable plays the role of a manifest moderating categori-
cal variable.
A first attempt to take into account moderating variables in PLS-PM by in-
cluding interaction effects was made by Chin et al. [2003]. In a simple model,
with only one exogenous variable and one endogenous variable, the interaction
term is obtained as the product of the indicators linked to the exogenous latent
variable and the moderating variable (see figure 3). In such a model, it does not
matter which variable is moderating and which is exogenous. Moreover, problems
arise in the interpretation of the product term. Since then, other proposals have
been presented for modeling moderating effects in a PLS-PM framework, such
as those of Henseler et al. [to appear] and of Tenenhaus et al. [to appear]. In
particular, Henseler et al. [to appear] have proposed using a two step procedure to
include product terms. In the first step, they suggest performing PLS-PM by con-
sidering both the exogenous variable and the moderating variable as independent
LVs in the model. Once the LV scores are estimated, the product term is computed
as the elementwise product of the exogenous LV scores and the moderating LV
score. A multiple linear regression between the endogenous LV scores, the mod-
Using categorical variables in PLS PATH modeling to build system of CI 317

Fig. 3: Creating an interaction term in a simple SEM by product.


Fig. 3: Creating an interaction term in a simple SEM by product.

erating LV score, and the product term LV scores is then performed. A scheme of
the procedure proposed by Henseler et al. [to appear] is shown in figure 4.
Group of units showing different behaviors are not always known a priori.
In other words, very rarely heterogeneity in the models may be captured by well-
known manifest variable playing the role of moderating categorical variable. In
this case, it is interesting to look for a latent moderating categorical variable which
splits units in classes showing different behaviors. This approach ensures a double
result: it checks if there is actually heterogeneity in the data (i.e. if all the units
are not well described by a unique model), and it provides classes of units that are
homogeneous with respect to the model parameters.

Latent moderating categorical variables

Methods to detect latent moderating categorical variables are becoming more and
more popular in statistics. In modeling the real world, it is reasonable to expect
that different classes showing heterogeneous behaviors may exist in the observed
set of units. This is true also in composite indicator frameworks. As a matter of
fact, in developing a system of composite indicators, it is reasonable to suppose
318 Trinchera L., Russolillo G., Lauro C.N.

Fig. 4: Henseler and Fassot procedure to model an interaction effect in a simple SEM.
Fig. 4: Henseler and Fassot procedure to model an interaction effect in a simple SEM.

that different models, i.e. different systems of weights, should be applied in order
to take into account differences among units. Furthermore, in these frameworks
also, it is of great importance to obtain clusters of units that are homogenous with
regard to the weights to be applied in computing the composite indicators. Several
clustering techniques have been developed in an SEM and in PLS-PM to look for
latent classes. Among them, some techniques allow us to obtain latent classes by
taking into account the causal structure of the model: the so-called response-based
clustering techniques [Trinchera, 2007]. When information concerning the causal
relationships among variables is available (as it is in the theoretical causal network
of relationships defining an SEM), classes should be looked for while taking into
account this relevant piece of information. That is why response-based methods
have to be preferred to classic clustering techniques, such as cluster analysis. As
a matter of fact, in response-based clustering methods, the obtained classes are
homogeneous with respect to the postulated model, i.e. with respect to the weights
used to computed composite and complex indicators. This approach to clustering
is in contrast to the traditional clustering approaches, where classes are defined
according to information which is not related to the existing model but depends
Using categorical variables in PLS PATH modeling to build system of CI 319

on external criteria.
Response-based clustering techniques allow us to obtain local models, i.e.
class-specific models. Each local model is characterized by class-specific param-
eters. In other words, these methods assume that in the observed data-set several
groups of units exist, each characterized by different models. Two main methods
exist to obtain response-based clusters in PLS-PM: the Finite Mixture PLS, pro-
posed by Hahn et al. [2002], and the REsponse Based Unit Segmentation in PLS
Path Model (REBUS-PLS) [Trinchera, 2007; Esposito Vinzi et al., 2008].
FIMIX-PLS is an extension of the Finite Mixture Model [McLachlan et al.,
2000] to PLS-PM. Being based on EM algorithm it requires the normal distribu-
tion at least for the LV scores. This is not in line with the features of PLS-PM,
which is a distribution-free technique.
In order to overcome the main limits of FIMIX-PLS a new method for la-
tent class detection in PLS-PM has recently been developed: the REBUS-PLS
[Trinchera, 2007; Esposito Vinzi et al., 2008]. Unlike FIMIX-PLS and accord-
ing to PLS-PM features, REBUS-PLS does not require distributional hypotheses.
Moreover, REBUS-PLS has been developed so as to detect heterogeneity both in
the structural and the measurement models. The idea is that if latent classes ex-
ist, units belonging to the same latent class will have similar local models, i.e. a
similar performance with regard to the global model. Moreover, if a unit is cor-
rectly assigned to a latent class, its performance in the local model computed for
that class will be better than the performance obtained by the same unit in all the
other local models. For these reasons the units are assigned to the latent classes
according to a closeness measure (CM), that takes into account the residuals of a
unit with respect to each local model. The chosen CM is defined in order to obtain
local models that are better fitted than the global model for both the measurement
and the structural models. For more details refer to the original REBUS-PLS
papers [Trinchera, 2007; Esposito Vinzi et al., 2008].
3.2 USING CATEGORICAL INDICATORS AS MANIFEST VARIABLES IN A
3.2 PLS Using
PATHcategorical
MODEL indicators as manifest variables in a PLS Path Model
Until now, composite indicators have been obtained only as a mathematical com-
bination of single (quantitative) indicators [Saisana et al., 2002]. However, taking
into account also categorical indicators in building composite indicators is very
fascinating. For instance, when computing complex and composite indicators, it
could be interesting to consider demographic variables, such as religion or gender,
and/or categorical variables defining states, such as type of government. It is for
this reason that here we propose a modified version of the PLS-PM algorithm able
320 Trinchera L., Russolillo G., Lauro C.N.

to consider also categorical indicators.


The most common approach for introducing a categorical indicator x ∗pq with
L pq modalities as an MV in PLS-PM is to replace it with the corresponding
dummy matrix X̃ X pq . Each element x̃il of X̃ X pq is equal to one if the i-th obser-
vation shows the l pq -th modality (l pq = 1, . . . , L pq ); otherwise it is zero. However,
this approach has an important drawback: it measures the impact of each modal-
ity of the categorical indicator on the composite indicator. As a consequence, the
global influence of the categorical indicator is not directly measured. Moreover,
the indirect weight of the categorical indicator in the construction of an CI in-
creases as the number of its modalities increases. To overcome these drawbacks,
quantification-based techniques have been recently proposed in literature. These
algorithms assign a numeric value to each category in order to obtain quantified
indicators that can be handled as if they were quantitative.
Partial Maximum Likelihood (PML) algorithm [Jakobowicz et al., 2007] is
an adapted version of PLS-PM aimed at generalizing the PLS-PM approach when
the indicators are of different types. The authors of PML advise using PLM be-
cause this technique seems to significantly improve the quality of the model when
a number of indicators are nominal and/or boolean. However, PLM provides the
impacts of each category, while the global impact of each categorical indicator is
not provided by the algorithm but it is indirectly calculated a posteriori. Further-
more, it is not specified how these impacts can be interpreted. As a matter of fact,
the PML algorithm can be seen as an optimal scaling procedure, without a well
specified optimality criterion.
Here, we propose a modified PLS-PM algorithm for the optimal scaling of
categorical indicators. We extend the PLS-CAP approach [Russolillo, 2008], used
for handling categorical predictors in PLS Regression (see Tenenhaus [1998] for
a review), in order to obtain a suitable quantification of reflective categorical in-
dicators in PLS-PM. PLS-CAP algorithm can be naturally extended to PLS-PM
because PLS Regression is a particular PLS Path Model with two reflective blocks
[Tenenhaus et al., 2005]. This approach makes it possible to handle categorical
variables as if they were measured on an interval scale. The aim is to provide an
optimal scaling of the categorical indicators in such a way that their weights in
the construction of the LV can be interpreted as functions of the LV variance ex-
plained by the categories. This quantification criterion ensures that the role of the
quantified indicators is coherent with the explicative ability of the corresponding
categorical indicator. In order to obtain quantifications with such properties, we
propose a modified PLS-PM algorithm that estimates at the same time both the
Using categorical variables in PLS PATH modeling to build system of CI 321

model parameters and the scaling parameters of the categorical indicators.


In the modified PLS-PM algorithm the computation of the LVs starts with an
arbitrary choice of their inner estimates ϑ 1 , . . . , ϑ Q . Afterwards, iterative proce-
dure begins with a new first step, that is added into each cycle of the iteration. It
is a quantification step, in which each categorical indicator is transformed into a
quantitative one; this new quantified indicator x pq is obtained as the normalized
orthogonal projection of ϑ q on the space spanned by the columns of X̃ X pq . From a
computational point of view,

 −1
X pq X˜  pq X̃
x pq ∝ X̃ X pq X˜  pq ϑ q (5)

In the second step, outer weights are updated following the reflective scheme, fol-
lowing equation 3, and in the third step the outer estimations of the LVs are calcu-
lated according to equation 1. The iterative cycle continues with the update of the
inner weights (step 4) and it is closed by the inner estimation of each LV according
to equation 2 (step 5). Once new outer estimates are computed, the cycle restarts
with the quantification step and it is iterated until the convergence between the
inner and the outer estimations is reached. After convergence, structural (or path)
coefficients are estimated in the same way that in the classic PLS-PM algorithm.
A schematic description of the algorithm is given below.
This procedure yields as output both scaling and model parameters. It en-
sures that quantified indicators show suitable properties in terms of optimality
and interpretability. The scaling parameters maximize the correlations between
the quantified indicator and the inner estimate of its corresponding LV, and as a
consequence its weight in the construction of the LV in a reflective scheme (see
equation 3). Moreover, the weight of each quantified indicator can be expressed
also in terms of that part of the variability of ϑ q explained by the modalities of
x̃x pq . In particular, it is possible to show the following equivalence:

ρ x pq ,ϑϑ q = η x ∗pq ,ϑϑ q (6)

Hence, the weight of a quantified indicator x pq reflects the predictive capability


of the categories of x ∗pq with respect to ϑ q , as measured by the correlation ratio
square root.
322 Trinchera L., Russolillo G., Lauro C.N.

Algorithm 1 PLS-CAP algorithm extended to PLS Path Modeling


Input: X ∗1 , . . . , X ∗q . . . , X ∗Q blocks of mixed (quantitative and categorical) MVs;
Output: X q , β j , w q , ξ q ;
for all q = 1, . . . , Q do
initialize ϑ q
end for
repeat
for all q = 1, . . . , Q do
for all p = 1. . . . , Pq do
if x ∗pq is categorical then
 −1
 
Step 1: x pq ∝ X̃ X pq X̃ X pq
X pq X̃ X pq ϑ q
X̃ - quantification step -
else
Step 1: x pq ∝ x ∗pq
end if
end for
Step 2: w pq = cor(xx pq , ϑ q ) - update of w q -
Pq
Step 3: ν q ∝ ∑ p=1 w pq x pq = X q w q - outer estimation -
  
Step 4: eqq = sign cor ν q , ν q - update of eqq (centroid scheme) -
Q
Step 5: ϑ q ∝ ∑q =1 eqq ν q - inner estimation -
end for
until ϑ q = ν q
for all q = 1, . . . , Q do
ξ q ∝ X qwq - computation of LV scores -
end for
for all j = 1, . . . , J do
 −1 
β j = Ξ q Ξ q Ξ q ξ j - computation of path coefficients -
end for
Using categorical variables in PLS PATH modeling to build system of CI 323

4. AN APPLICATION TO THE RUSSET DATA

The data for this example are taken from a paper by Russet [1964]. The
basic hypothesis in Russet’s paper is that economic inequality leads to political
instability. In particular in the Russet model political instability is a function of
inequality of land distribution and of industrial development. Three indicators
are used to measure inequality of land distribution. The indicator “gini" is Gini’s
index of concentration, which measures the deviation of the Lorenz curve from the
line of equality. The indicator “farm" is the percentage of farmers that own half of
the lands, starting with the smallest ones. Thus if “farm" is 90%, then 10% of the
farmers own half of the land. The third indicator is “rent", which is the percentage
of farm households that rent all their land. Two indicators are used to measure
industrial development: indicator “gnpr" is the gross national product pro capite
(in U.S. dollars) in 1955, and the indicator “labo" is the percentage of labor force
employed in agriculture. Political stability is measured by four indicators. The
indicator “inst" is a function of the number of the chiefs of the executive and of
the number of years of independence of the country during the period 1946-1961.
This index bounds between 0 (very stable) and 17 (very unstable). The indicator
“ecks" is Eckstein’s index, which measures the number of violent internal war
incidents during the same period. The indicator “death" is the number of people
killed as a result of violent manifestations during the period 1950-1962. The
indicator “demo" classifies countries in three groups: stable democracy, unstable
democracy and dictatorship.
This data-set was already analyzed in Gifi [1990] and in Tenenhaus [1998].
In particular, Tenenhaus [1998] modeled the Russet data-set in a PLS-PM frame-
work (see figure 5). He partitioned the Russet data-set in three reflective blocks.
The first block, consisting of the indicators “gini", “farm" and “rent" measures
the composite indicator “Agricultural Inequality". The second one, formed by the
indicators “gnpr" and “labo", measures the composite indicator “Industrial De-
velopment". The third block, composed of the indicators “inst", “ecks", “death"
and “demo", expresses the composite indicator “Political Instability". Tenenhaus
performed a PLS-PM analysis on the model defined in figure 5 by using the op-
tion centroid for inner weight estimation and handling all the blocks as reflective.
Moreover, since Gifi analysis suggested a high degree of non-linearity of data,
Tenenhaus approximated Gifi results by means of monotone functional transfor-
mations.
Tenenhaus results are represented in figure 6. It is possible to investigate the
324 Trinchera L., Russolillo G., Lauro C.N.

Fig. 5: Russet
Fig. 5: Russet data as modeled data as modeled by Tenenhaus.
by Tenenhaus.

relations between Agricultural Inequality, Industrial Development and Political


Instability through the path coefficients represented in the figure; obviously, the
two latent predictors impact in opposite sense on the response. However, Political
Instability largely depends on Industrial Development rather than on Agricultural
Inequality. The higher the Industrial Development is, the lower the Political In-
stability is.
As one can expect, the indicators “gini", “farm" and “l_rent" are positively
correlated to the CI ξ 1 , which measures the Agricultural Inequality. The CI In-
dustrial Development is positively affected by the gross national product (indi-
cator“l_gnpr") and negatively affected by the percentage of agricultural workers
(indicator “l_labo"). All of the indicators of the block representing Political Insta-
bility positively impact on the CI ξ 3 except for the binary indicator “d-stb", which
indicates the countries with a stable democratic regime.
It is not clear if the overall weight of the indicator “demo", expressed by
the three dummy indicators “d-stb", “d-inst" and “dict", is high or low. While
weights of “d-stb" and “dict" are large, the weight of “d-inst" is almost zero (see
Using categorical variables in PLS PATH modeling to build system of CI 325

Fig. 6: PLS-PM analysis of Russet data as performed by Tenenhaus [1998]: model parameter
estimates.

Tab. 1: PLS-PM analysis of Russet data as performed by Tenenhaus [1998]: outer model results.
LV MV Outer weights Stand. load. Comm. Red.
ξ1 gini 0.460 0.977 0.955
farm 0.516 0.986 0.972
l_rent 0.081 0.516 0.266
ξ2 l_gnpr 0.511 0.950 0.903
l_labo -0.538 -0.955 0.912
ξ3 e_inst 0.104 0.352 0.124 0.077
l_ecks 0.270 0.816 0.665 0.414
l_death 0.302 0.794 0.630 0.392
d-stb -0.336 -0.866 0.749 0.466
d-inst 0.037 0.094 0.009 0.006
dict 0.285 0.733 0.537 0.334
326 Trinchera L., Russolillo G., Lauro C.N.

table 1). As matter of fact, the weight of the binary variable “d-inst" is so small
just because there is a strong relation between the categorical indicator “demo”
and the CI Political Instability. In fact, the category “d-stb" is mainly associated
with observations sharing the lowest values of ξ 3 , while the category “dict" is
mainly associated with observations sharing the highest values of the CI and the
category “d-inst" is mainly associated with observation sharing the central values
of political instability score distribution. Hence, there is a strong relation between
ξ 3 and all of the binary variables representing the categories of indicator “demo".
Unfortunately, while relations between binary variables “dict" and “d-stb" and ξ 3
are pretty monotone (and so they can be easily approximated by a linear function),
the binary variable “d-inst" is linked to ξ 3 by a non-monotonic relation (see figure
7). As a consequence, this variable is underestimated in the model.
Binary valoues
Binary valoues

Binary valoues

Political instability scores Political instability scores Political instability scores


Fig. 7:
Fig. 7: Raw
Rawvalues
valuesof
of binary
binary variables
variables corresponding
corresponding to
to categories
categories of indicator “demo”
of indicator “demo" plotted
plotted
versus the CI Political Instability values.
versus the CI Political Instability values.

The quality of Tenenhaus’ model is assessed looking at table 2. As regards


the inner model, a good part of the variability of the latent response ξ 3 (“Polit-
ical Instability") is explained by the two latent predictors, with an R2 value of
0.622. With respect to the quality of the outer model the mean Communalities
of exogenous blocks (i.e. for each block the average of the squared correlations
between each indicator and the corresponding CI) are satisfying. However, the CI
“Political Instability" only explains 45.2% of its own indicator variability. This is
due to the presence of the binary variables among the indicators related to the CI
‘Political Instability" .
In order to overcome the binary coding drawbacks, we use the PLS-CAP
algorithm properly modified in order to run PLS-PM analyses (see section 3). In
this analysis, the categorical indicator “demo" is properly quantified. The new
model is represented in figure 8. Now the CI Political Instability is expressed by
just four indicators: “e_inst", “l_ecks", “l_death" and “demo". The quality of this
model is summarized in table 3.
Using categorical variables in PLS PATH modeling to build system of CI 327

Fig. 8: PLS-PM analysis of Russet data (the indicator “demo” is properly quantified): model
parameter estimates.

Tab. 2: PLS-PM analysis of Russet data as performed by Tenenhaus [1998]: model assessment.
LV R2 Mean Comm. Mean Red.
ξ1 0.731
ξ2 0.907
ξ3 0.622 0.452 0.282

Tab. 3: PLS-PM analysis of Russet data (the indicator “demo” is properly quantified): model
assessment.
LV R2 Mean Comm. Mean Red.
ξ1 0.737
ξ2 0.908
ξ4 0.589 0.572 0.337
328 Trinchera L., Russolillo G., Lauro C.N.

Tab. 4: PLS-PM analysis of Russet data (the indicator “demo” is properly quantified): outer
model results.
LV MV Outer weights Stand. load. Comm. Red.
ξ1 gini 0.455 0.973 0.947
farm 0.502 0.984 0.968
l_rent 0.117 0.543 0.294
ξ2 l_gnpr 0.514 0.951 0.904
l_labo -0.536 -0.955 0.911
ξ3 e_inst 0.127 0.375 0.140 0.083
l_ecks 0.329 0.853 0.728 0.429
l_death 0.370 0.826 0.682 0.402
demo 0.427 0.859 0.739 0.435

With respect to the previous one, this model shows a worst prediction ca-
pability of the latent response, while it gains on the explicative capability of the
indicator underlying the concept of Political Instability. The mean Communali-
ties of the other two blocks remain about the same. However, the global model fit
improves, as GoF passes from 0.617 to 0.643.
This analysis makes it clear that the indicator “demo" is the most important
in the construction of the CI Political Instability (see table 4). According to these
results we can conclude that the categories of the indicator “demo" are greatly
discriminant with respect to the Political Instability scores. In fact, the weight of
an MV quantified at a nominal scaling level reflects the variability of the corre-
sponding CI explained by the categories of the original categorical indicator (see
equation 6).

5. CONCLUSION AND PERSPECTIVES

SEMs, and mainly PLS Path Models, are very useful tools to compute com-
posite and complex indicators. However, such models take into account only
quantitative indicators. Until now, composite indicators have been obtained only
as a mathematical combination of (quantitative) indicators [Saisana et al., 2002].
Nevertheless, considering also categorical indicators in building composite indi-
cators is very fascinating. For instance, when computing complex and composite
indicators, it could be interesting to take into account demographic variables, such
as religion or gender, and/or categorical variables defining states, such as type of
government. All these variables can play different roles: they can play the role of
a manifest moderating categorical variable (such as the variable gender in com-
Using categorical variables in PLS PATH modeling to build system of CI 329

puting the GDI index); they can define latent classes of units showing different
systems of weights; or they can be used as categorical indicators in computing the
composite indicators. In this work we have reviewed a suite of statistical method-
ologies for handling categorical indicators with respect to the role they play in a
system of composite indicators. In particular, a new algorithm to handle categori-
cal indicators in PLS-PM has been proposed.
As far as future developments are concerned, the identification of a compro-
mise model need to be further investigated. In particular, a two steps strategy
(mixing REBUS-PLS algorithm and the proposed version of the PLS-PM algo-
rithm) can be implemented in order to obtain a compromise model considering
detected unobserved heterogeneity in PLS-PM. Moreover, in composite indica-
tor frameworks it is of great importance to include a priori information on the
weights defining composite indicators. Research has to be done in order to apply
constraints on the weights so as to take into account this information in PLS-PM.

REFERENCES
BARON, R.M., KENNY, D.A. (1986), The Moderator-Mediator Variable Distinction in Social
Psychological Research: Conceptual, Strategic, and Statistical Considerations, Journal of
Personality and Social Psychology, 51 (6), 1173-1182.
BOLLEN, K. A. (1989), Structural equations with latent variables, Wiley, New York.
CHIN, W. (2003), A permutation procedure for multi-group comparison of PLS models, in PLS and
related methods - Proceedings of the International Symposium PLS’03, M. Vilares et al. (eds),
DECISIA, 33-43.
ESCOFIER, B., PAGÉS, J. (1994), Multiple factor analysis (AFMULT package), Computational
Statistics and Data Analysis, 18,121-140.
ESPOSITO VINZI, V., TRINCHERA, L., SQUILLACCIOTTI, S., TENENHAUS, M. (2008),
REBUS-PLS: A response - based procedure for detecting unit segments in PLS path modeling,
Applied Stochastic Models in Business and Industry (ASMBI), 24 (5), 439-458.
GIFI, A. (1990), Nonlinear Multivariate Analysis, Chichester, UK: Wiley.
GORSUCH, R. L. (1983), Factor Analysis, Lawrence Erlbaum (2nd edition), Mahwah, New Jersey.
HAHN, C., JOHNSON, M., HERRMANN, A., HUBER, F. (2002), Capturing Customer Heterogeneity
using a Finite Mixture PLS Approach, Schmalenbach Business Review, 54, 243-269.
HENSLER J., FASSOTT, G. (2009), Testing moderating effects in PLS path models: An illustration
of available procedure, in Handbook of Partial Least Squares - Concepts, Methods and
Applications, V. Esposito Vinzi et al. (eds), Springer, Berlin, Heidelberg, New York.
HOTELLING, H. (1933), Analysis of a complex of statistical variables into principal components,
Journal of Educational Psychology, 24.
KAPLAN, D. (2000), Structural Equation Modeling: Foundations and Extensions, Sage Publications
Inc., Thousands Oaks, California.
KENNY, D., JUDD, C. (1984), Estimating the nonlinear and interactive effects of latent variables,
Psychological Bulletin, 96, 201-210.
McLACHLAN, G.J., PEEL, D. (2000), Finite Mixture Models, John Wiley & Sons, Inc., New York,
Chichester, Weinheim, Brisbane, Singapore, Toronto.
330 Trinchera L., Russolillo G., Lauro C.N.

RUSSET, B. M. (1964), Inequality and instability, Word politics 21, 422-454.


RUSSOLILLO, G. (2008), A proposal for handling categorical predictors in PLS regression
framework. in First joint meeting of the Société Francophone de Classification and the
Classification and Data Analysis Group of the Italian Statistical Society. Book of short papers,
Edizioni Scientifiche Italiane, 401.
SAISANA, M., TARANTOLA, S. (2002), State-of-the-art Report on Current Methodologies and
Practices for Composite Indicator Development, EUR 20408 EN, European Commission-JRC:
Italy.
TENENHAUS, M. (1998), La Régression PLS: théorie et pratique, Technip, Paris.
TENENHAUS, M., ESPOSITO VINZI V., CHATELIN, Y.M., LAURO C. (2005), PLS path
modeling, Computational Statistics and Data Analysis, 48, 159-205.
TENENHAUS, M., MAUGER, E., GUINOT, C. (2009), Use of ULS-SEM and PLS-SEM to measure
a group effect in a regression model relating two blocks of binary variables, in Handbook of
Partial Least Squares - Concepts, Methods and Applications, V. Esposito Vinzi et al. (eds),
Springer, Berlin, Heidelberg, New York.
TRINCHERA, L. (2007), Unobserved Heterogeneity in Structural Equation Models: a new approach
in latent class detection in PLS Path Modeling, PhD thesis, DMS, University of Naples.
WOLD, H. (1982), Soft modeling: the basic design and some extensions, in Systems under Indirect
Observation, Part 2, Jöreskog K.G., Wold H. (eds). North-Holland, Amsterdam, 1-54.

L’USO DELLE VARIABILI CATEGORICHE PER LA


COSTRUZIONE DI SISTEMI DI INDICATORI COMPOSITI
VIA PLS PATH MODELING
Riassunto
Negli ultimi decenni si è sentita sempre più forte la necessità di misurare fenomeni
molto complessi come la povertà, il progresso ed il benessere. Come è noto, la caratteristica
principale di un indicatore composito è che esso è capace di riassumere in un unico indice
fenomeni complessi e multidimensionali. I Modelli ad Equazioni Strutturali (Structural
Equation Models) possono rivelarsi un utile strumento per la costruzione di sistemi di
indicatori compositi. In questo lavoro si propone di usare il PLS Path Modeling (PLSPM)
per l’analisi di sistemi di indicatori compositi per il ruolo chiave che il concetto di variabile
latente (cioè di indicatore composito) svolge nel processo di stima in questo approccio.
Nell’ambito dei Modelli ad Equazioni Strutturali, una variabile categorica può svolgere un
ruolo attivo o un ruolo di moderazione. Una variabile categorica moderatrice è una
variabile che non svolge un ruolo diretto nella costruzione del sistema di indicatori
compositi, ma che ha un effetto, in termini di forza e/o direzione, sul legame esistente tra
gli indicatori compositi. Una variabile categorica attiva, invece, partecipa direttamente
alla costruzione del indicatore composto. In altre parole, gioca un ruolo di indicatore
semplice nella costruzione del construtto latente che rappresenta l’indicatore composito.
In questo lavoro vengono discussi i contributi più recenti al trattamento delle
variabili categoriche moderatrici e/o attive nell’ambito del PLS-PM. Inoltre, si propone un
nuovo approccio per il trattamento di indicatori categorici come variabili categoriche
attive. Questo nuovo approccio fornisce una quantificazione degli indicatori categorici in
modo che il peso di ciascun indicatore quantificato sia coerente con la capacità esplicativa
del corrispondente indicatore categorico.

You might also like