Laura Trinchera
Dipartimento di Studi sullo Sviluppo Economico,
Università degli Studi di Macerata,
Piazza Oberdan, 3 – Macerata (Italy)
[email protected]
Giorgio Russolillo, Carlo N. Lauro
Dipartimento di Matematica e Statistica,
Università degli Studi di Napoli “Federico II”, Via Cintia
– Complesso Monte S. Angelo – Napoli (Italy)
[email protected]
Nowadays there is a pre-eminent need to measure very complex phenomena like
poverty, progress, well-being, etc. As is well known, the main feature of a composite
indicator is that it summarizes complex and multidimensional issues. Thanks to its features,
Structural Equation Modeling seems to be a useful tool for building systems of composite
indicators. Among the several methods that have been developed to estimate Structural
Equation Models we focus on the PLS Path Modeling approach (PLS-PM), because of the
key role that estimation of the latent variables (i.e. the composite indicators) plays in the
estimation process. In this paper we provide a suite of statistical methodologies for
handling categorical indicators with respect to the role they have in a system of composite
indicators. A categorical variable can play an active or a moderating role. An active
categorical variable directly participates in the construction of the model. In other words,
it is a categorical indicator impacting on a composite indicator jointly with other manifest
variables. A moderating categorical variable, instead, is a variable that does not play a
direct role in the construction of the system of composite indicators but affects the relations,
1 This paper was financially supported by the MURST grant “Multivariate statistical models for
the ex-ante and the ex-post analysis of regulatory impact”, coordinated by C.N. Lauro (2006)
in terms of strength and/or direction, among them. In this work we investigate both the use
of categorical variables as indicators and as moderating variables. In particular, we
propose a new approach in order to take into account active categorical indicators.
This new approach provides a quantification of the categorical indicators in such a
way that the weight of each quantified indicator is coherent with the explicative ability of
the corresponding categorical indicator.
Keywords: Structural Equation Models, Systems of Composite Indicators, Categorical
Nowadays there is a pre-eminent need to measure very complex phenomena
like poverty, progress, well-being, etc. Since they are complex and latent con-
cepts, it is not possible to measure them directly and they must be summarized by
several indicators. The challenge of constructing a global measure of well-being
or progress by using composite indicators is a much discussed theme. In partic-
ular, in literature two aspects have been investigated: i) the identification of the
key indicators to be used; ii) the ways in which these indicators can be brought to-
gether to make a coherent system of information. How to choose these indicators
is the task of psychologists, sociologists and economists, while it is the role of
statisticians to provide operational tools to aggregate indicators in order to build
composite indicators.
Structural Equation Models (SEMs) [Bollen, 1989; Kaplan 2000], and specif-
ically the PLS approach to SEMs (PLS Path Modeling, PLS-PM) [Wold, 1982;
Tenenhaus et al., 2005], can be used to compute systems of composite indicators
(see section 2). In this paper we provide a suite of statistical methodologies for
handling categorical indicators with respect to the role they have in a system of
composite indicators (see section 3). In particular, sub-section 3 deals with the
treatment of moderating categorical variables, while in sub-section 3 a new way
of handling categorical indicators in PLS-PM is proposed. To conclude, an appli-
cation involving data taken from a paper by Russet [1964] will be presented.
One of the most important advantages in using SEMs is that they provide
two kinds of weights: one measuring the impact of each indicator on the corre-
sponding composite indicator, the other measuring relations among the composite
indicators in the system. These two levels of weights help us to understand what
is the importance of each indicator in building composite indicators, as well as
which are the main drivers in computing complex indicators. In other words, us-
ing SEMs to build complex indicators leads to the construction of a system of
weights and relations that allows us to understand the different aspects composing
the complex indicators.
Several methods have been developed to estimate SEM parameters, among
them the PLS Path Modeling approach (PLS-PM) [Wold, 1982]. PLS-PM is a so-
called component-based estimation method, because of the key role that is played
by the estimation of the LVs in the model. The main aim of component-based
methods, in fact, is to provide an estimate (ξξ ) of the LVs in such a way that the
estimates of the LVs are the most correlated with one another (according to the
path diagram structure) and the most representative of each corresponding block
2.1 A brief review of
THEPLS-PM algorithm
PLS Path Modeling aims to estimate the relationships among Q blocks of MVs
(or indicators), which are an expression of unobservable constructs. Specifically,
PLS-PM estimates, through a system of interdependent equations based on sim-
ple and multiple regressions, the network of relations among the MVs and their
corresponding LV, and among the LVs inside the model. Formally, let us assume
that P quantitative indicators are observed on N units (i = 1, . . . , N). The resulting
data xnpq are collected in a partitioned table of centered data X :
X = [X
X 1, . . . , X q, . . . , X Q] ,
where x pq (p = 1, . . . , Pq , with ∑Q
q=1 Pq = P) is a generic indicator belonging to the
q-th block X q . In SEMs we assume that there is a latent variable (ξξ q ) associated
to each block.
In PLS-PM an iterative procedure allows us to estimate the LV scores, the
weights w pq to be associated to each MV in order to obtain the LV scores, and the
so-called path coefficients linking the LVs. The estimation of the SEM parameters
and of the LV scores is obtained through the alternate iteration of the outer and
the inner estimations, till convergence. Without loss of generality, the indicators
are centered in such a way that they have mean equal to zero. The procedure starts
by choosing arbitrary weights w pq . Then, in the external estimation, each LV is
where ν q is the standardized outer estimation of the q-th LV ξ q and the symbol ∝
means that the left side of the equation corresponds to the standardized right side.
In the internal estimation, each LV is estimated by considering its links with the
other Q adjacent LVs:
ϑq ∝ ∑ eqq ν q
q =1
where ϑ q is the standardized inner estimation of the q-th LV ξ q and the inner
weights (eqq ) are equal (in a centroid scheme) to the signs of the correlations
between the q-th LV ν q and the ν q s connected with ν q . Inner weights can be
obtained following other schemes rather than the centroid one. For a review refer
to Tenenhaus et al. [2005].
Once a first inner estimation of the LVs is obtained, the algorithm continues
by updating the weights w pq . Two different ways are available to update the outer
weights, usually related to the two different kinds of measurement model. In a
reflective scheme, each outer weight w pq is the regression coefficient in the simple
regression of the p-th indicator of the q-th block (xx pq ) on the inner estimate of the
q-th LV ϑ q . Since the inner estimate of the q-th latent variable ϑ q is standardized,
the generic outer weight w pq is obtained as the covariance between each indicator
and the corresponding inner estimate of the LV:
In the formative scheme, the vector w q of the weights w pq associated to the indica-
tors of the q-th block is the regression coefficient vector in the multiple regression
of the inner estimate of the q-th LV ϑ q on its centered indicators X q :
wq = X q X q X qϑ q (4)
The algorithm is iterated till convergence. After convergence, structural (or path)
coefficients are estimated through simple/multiple OLS regressions among the
estimated LV scores.
erating LV score, and the product term LV scores is then performed. A scheme of
the procedure proposed by Henseler et al. [to appear] is shown in figure 4.
Group of units showing different behaviors are not always known a priori.
In other words, very rarely heterogeneity in the models may be captured by well-
known manifest variable playing the role of moderating categorical variable. In
this case, it is interesting to look for a latent moderating categorical variable which
splits units in classes showing different behaviors. This approach ensures a double
result: it checks if there is actually heterogeneity in the data (i.e. if all the units
are not well described by a unique model), and it provides classes of units that are
homogeneous with respect to the model parameters.
Methods to detect latent moderating categorical variables are becoming more and
more popular in statistics. In modeling the real world, it is reasonable to expect
that different classes showing heterogeneous behaviors may exist in the observed
set of units. This is true also in composite indicator frameworks. As a matter of
fact, in developing a system of composite indicators, it is reasonable to suppose
318 Trinchera L., Russolillo G., Lauro C.N.
Fig. 4: Henseler and Fassot procedure to model an interaction effect in a simple SEM.
that different models, i.e. different systems of weights, should be applied in order
to take into account differences among units. Furthermore, in these frameworks
also, it is of great importance to obtain clusters of units that are homogenous with
regard to the weights to be applied in computing the composite indicators. Several
clustering techniques have been developed in an SEM and in PLS-PM to look for
latent classes. Among them, some techniques allow us to obtain latent classes by
taking into account the causal structure of the model: the so-called response-based
clustering techniques [Trinchera, 2007]. When information concerning the causal
relationships among variables is available (as it is in the theoretical causal network
of relationships defining an SEM), classes should be looked for while taking into
account this relevant piece of information. That is why response-based methods
have to be preferred to classic clustering techniques, such as cluster analysis. As
a matter of fact, in response-based clustering methods, the obtained classes are
homogeneous with respect to the postulated model, i.e. with respect to the weights
used to computed composite and complex indicators. This approach to clustering
is in contrast to the traditional clustering approaches, where classes are defined
according to information which is not related to the existing model but depends
on external criteria.
Response-based clustering techniques allow us to obtain local models, i.e.
class-specific models. Each local model is characterized by class-specific param-
eters. In other words, these methods assume that in the observed data-set several
groups of units exist, each characterized by different models. Two main methods
exist to obtain response-based clusters in PLS-PM: the Finite Mixture PLS, pro-
posed by Hahn et al. [2002], and the REsponse Based Unit Segmentation in PLS
Path Model (REBUS-PLS) [Trinchera, 2007; Esposito Vinzi et al., 2008].
FIMIX-PLS is an extension of the Finite Mixture Model [McLachlan et al.,
2000] to PLS-PM. Being based on EM algorithm it requires the normal distribu-
tion at least for the LV scores. This is not in line with the features of PLS-PM,
which is a distribution-free technique.
In order to overcome the main limits of FIMIX-PLS a new method for la-
tent class detection in PLS-PM has recently been developed: the REBUS-PLS
[Trinchera, 2007; Esposito Vinzi et al., 2008]. Unlike FIMIX-PLS and accord-
ing to PLS-PM features, REBUS-PLS does not require distributional hypotheses.
Moreover, REBUS-PLS has been developed so as to detect heterogeneity both in
the structural and the measurement models. The idea is that if latent classes ex-
ist, units belonging to the same latent class will have similar local models, i.e. a
similar performance with regard to the global model. Moreover, if a unit is cor-
rectly assigned to a latent class, its performance in the local model computed for
that class will be better than the performance obtained by the same unit in all the
other local models. For these reasons the units are assigned to the latent classes
according to a closeness measure (CM), that takes into account the residuals of a
unit with respect to each local model. The chosen CM is defined in order to obtain
local models that are better fitted than the global model for both the measurement
and the structural models. For more details refer to the original REBUS-PLS
papers [Trinchera, 2007; Esposito Vinzi et al., 2008].
3.2 PLS Using
MODEL indicators as manifest variables in a PLS Path Model
Until now, composite indicators have been obtained only as a mathematical com-
bination of single (quantitative) indicators [Saisana et al., 2002]. However, taking
into account also categorical indicators in building composite indicators is very
fascinating. For instance, when computing complex and composite indicators, it
could be interesting to consider demographic variables, such as religion or gender,
and/or categorical variables defining states, such as type of government. It is for
this reason that here we propose a modified version of the PLS-PM algorithm able
X pq X˜ pq X̃
x pq ∝ X̃ X pq X˜ pq ϑ q (5)
In the second step, outer weights are updated following the reflective scheme, fol-
lowing equation 3, and in the third step the outer estimations of the LVs are calcu-
lated according to equation 1. The iterative cycle continues with the update of the
inner weights (step 4) and it is closed by the inner estimation of each LV according
to equation 2 (step 5). Once new outer estimates are computed, the cycle restarts
with the quantification step and it is iterated until the convergence between the
inner and the outer estimations is reached. After convergence, structural (or path)
coefficients are estimated in the same way that in the classic PLS-PM algorithm.
A schematic description of the algorithm is given below.
This procedure yields as output both scaling and model parameters. It en-
sures that quantified indicators show suitable properties in terms of optimality
and interpretability. The scaling parameters maximize the correlations between
the quantified indicator and the inner estimate of its corresponding LV, and as a
consequence its weight in the construction of the LV in a reflective scheme (see
equation 3). Moreover, the weight of each quantified indicator can be expressed
also in terms of that part of the variability of ϑ q explained by the modalities of
x̃x pq . In particular, it is possible to show the following equivalence:
The data for this example are taken from a paper by Russet [1964]. The
basic hypothesis in Russet’s paper is that economic inequality leads to political
instability. In particular in the Russet model political instability is a function of
inequality of land distribution and of industrial development. Three indicators
are used to measure inequality of land distribution. The indicator “gini" is Gini’s
index of concentration, which measures the deviation of the Lorenz curve from the
line of equality. The indicator “farm" is the percentage of farmers that own half of
the lands, starting with the smallest ones. Thus if “farm" is 90%, then 10% of the
farmers own half of the land. The third indicator is “rent", which is the percentage
of farm households that rent all their land. Two indicators are used to measure
industrial development: indicator “gnpr" is the gross national product pro capite
(in U.S. dollars) in 1955, and the indicator “labo" is the percentage of labor force
employed in agriculture. Political stability is measured by four indicators. The
indicator “inst" is a function of the number of the chiefs of the executive and of
the number of years of independence of the country during the period 1946-1961.
This index bounds between 0 (very stable) and 17 (very unstable). The indicator
“ecks" is Eckstein’s index, which measures the number of violent internal war
incidents during the same period. The indicator “death" is the number of people
killed as a result of violent manifestations during the period 1950-1962. The
indicator “demo" classifies countries in three groups: stable democracy, unstable
democracy and dictatorship.
This data-set was already analyzed in Gifi [1990] and in Tenenhaus [1998].
In particular, Tenenhaus [1998] modeled the Russet data-set in a PLS-PM frame-
work (see figure 5). He partitioned the Russet data-set in three reflective blocks.
The first block, consisting of the indicators “gini", “farm" and “rent" measures
the composite indicator “Agricultural Inequality". The second one, formed by the
indicators “gnpr" and “labo", measures the composite indicator “Industrial De-
velopment". The third block, composed of the indicators “inst", “ecks", “death"
and “demo", expresses the composite indicator “Political Instability". Tenenhaus
performed a PLS-PM analysis on the model defined in figure 5 by using the op-
tion centroid for inner weight estimation and handling all the blocks as reflective.
Moreover, since Gifi analysis suggested a high degree of non-linearity of data,
Tenenhaus approximated Gifi results by means of monotone functional transfor-
Tenenhaus results are represented in figure 6. It is possible to investigate the
Fig. 5: Russet
Fig. 6: PLS-PM analysis of Russet data as performed by Tenenhaus [1998]: model parameter
Tab. 1: PLS-PM analysis of Russet data as performed by Tenenhaus [1998]: outer model results.
LV MV Outer weights Stand. load. Comm. Red.
ξ1 gini 0.460 0.977 0.955
farm 0.516 0.986 0.972
l_rent 0.081 0.516 0.266
ξ2 l_gnpr 0.511 0.950 0.903
l_labo -0.538 -0.955 0.912
ξ3 e_inst 0.104 0.352 0.124 0.077
l_ecks 0.270 0.816 0.665 0.414
l_death 0.302 0.794 0.630 0.392
d-stb -0.336 -0.866 0.749 0.466
d-inst 0.037 0.094 0.009 0.006
dict 0.285 0.733 0.537 0.334
table 1). As matter of fact, the weight of the binary variable “d-inst" is so small
just because there is a strong relation between the categorical indicator “demo”
and the CI Political Instability. In fact, the category “d-stb" is mainly associated
with observations sharing the lowest values of ξ 3 , while the category “dict" is
mainly associated with observations sharing the highest values of the CI and the
category “d-inst" is mainly associated with observation sharing the central values
of political instability score distribution. Hence, there is a strong relation between
ξ 3 and all of the binary variables representing the categories of indicator “demo".
Unfortunately, while relations between binary variables “dict" and “d-stb" and ξ 3
are pretty monotone (and so they can be easily approximated by a linear function),
the binary variable “d-inst" is linked to ξ 3 by a non-monotonic relation (see figure
7). As a consequence, this variable is underestimated in the model.
Binary valoues
Binary valoues
Binary valoues
Fig. 8: PLS-PM analysis of Russet data (the indicator “demo” is properly quantified): model
parameter estimates.
Tab. 2: PLS-PM analysis of Russet data as performed by Tenenhaus [1998]: model assessment.
LV R2 Mean Comm. Mean Red.
ξ1 0.731
ξ2 0.907
ξ3 0.622 0.452 0.282
Tab. 3: PLS-PM analysis of Russet data (the indicator “demo” is properly quantified): model
LV R2 Mean Comm. Mean Red.
ξ1 0.737
ξ2 0.908
ξ4 0.589 0.572 0.337
Tab. 4: PLS-PM analysis of Russet data (the indicator “demo” is properly quantified): outer
model results.
LV MV Outer weights Stand. load. Comm. Red.
ξ1 gini 0.455 0.973 0.947
farm 0.502 0.984 0.968
l_rent 0.117 0.543 0.294
ξ2 l_gnpr 0.514 0.951 0.904
l_labo -0.536 -0.955 0.911
ξ3 e_inst 0.127 0.375 0.140 0.083
l_ecks 0.329 0.853 0.728 0.429
l_death 0.370 0.826 0.682 0.402
demo 0.427 0.859 0.739 0.435
With respect to the previous one, this model shows a worst prediction ca-
pability of the latent response, while it gains on the explicative capability of the
indicator underlying the concept of Political Instability. The mean Communali-
ties of the other two blocks remain about the same. However, the global model fit
improves, as GoF passes from 0.617 to 0.643.
This analysis makes it clear that the indicator “demo" is the most important
in the construction of the CI Political Instability (see table 4). According to these
results we can conclude that the categories of the indicator “demo" are greatly
discriminant with respect to the Political Instability scores. In fact, the weight of
an MV quantified at a nominal scaling level reflects the variability of the corre-
sponding CI explained by the categories of the original categorical indicator (see
equation 6).
SEMs, and mainly PLS Path Models, are very useful tools to compute com-
posite and complex indicators. However, such models take into account only
quantitative indicators. Until now, composite indicators have been obtained only
as a mathematical combination of (quantitative) indicators [Saisana et al., 2002].
Nevertheless, considering also categorical indicators in building composite indi-
cators is very fascinating. For instance, when computing complex and composite
indicators, it could be interesting to take into account demographic variables, such
as religion or gender, and/or categorical variables defining states, such as type of
government. All these variables can play different roles: they can play the role of
a manifest moderating categorical variable (such as the variable gender in com-
puting the GDI index); they can define latent classes of units showing different
systems of weights; or they can be used as categorical indicators in computing the
composite indicators. In this work we have reviewed a suite of statistical method-
ologies for handling categorical indicators with respect to the role they play in a
system of composite indicators. In particular, a new algorithm to handle categori-
cal indicators in PLS-PM has been proposed.
As far as future developments are concerned, the identification of a compro-
mise model need to be further investigated. In particular, a two steps strategy
(mixing REBUS-PLS algorithm and the proposed version of the PLS-PM algo-
rithm) can be implemented in order to obtain a compromise model considering
detected unobserved heterogeneity in PLS-PM. Moreover, in composite indica-
tor frameworks it is of great importance to include a priori information on the
weights defining composite indicators. Research has to be done in order to apply
constraints on the weights so as to take into account this information in PLS-PM.
