Article
Article
Article
6339/22-JDS1069
October 2023 Statistical Data Science
Abstract
Multiclass probability estimation is the problem of estimating conditional probabilities of a data
point belonging to a class given its covariate information. It has broad applications in statistical
analysis and data science. Recently a class of weighted Support Vector Machines (wSVMs) has
been developed to estimate class probabilities through ensemble learning for K-class problems
(Wu et al., 2010; Wang et al., 2019), where K is the number of classes. The estimators are robust
and achieve high accuracy for probability estimation, but their learning is implemented through
pairwise coupling, which demands polynomial time in K. In this paper, we propose two new
learning schemes, the baseline learning and the One-vs-All (OVA) learning, to further improve
wSVMs in terms of computational efficiency and estimation accuracy. In particular, the baseline
learning has optimal computational complexity in the sense that it is linear in K. Though not
the most efficient in computation, the OVA is found to have the best estimation accuracy among
all the procedures under comparison. The resulting estimators are distribution-free and shown
to be consistent. We further conduct extensive numerical experiments to demonstrate their finite
sample performance.
Keywords linear time algorithm; multiclass classification; non-parametric; probability
estimation; scalability; support vector machines
1 Introduction
In machine learning and pattern recognition, the goal of multiclass classification is to assign
a data point to one of K classes based on its input, where K 3. In specific, we observe
the pair (X, Y ) from an unknown distribution P (x, y), where X ∈ IR p is the input vector and
Y ∈ A = {1, 2, . . . , K} indicates class membership, and the task is to learn a discriminant rule
φ : IR p → A for making predictions on future data. Define the conditional class probabilities as
pj (x) = P (y = j |X = x) for j ∈ {1, . . . , K}. For any classifier φ, its prediction performance can
be evaluated by its risk R(φ) = E(x,y)∼P [I (y = φ(x))] = P (y = φ(x)), also called the expected
prediction error (EPE). The optimal rule minimizing the risk is the Bayes classifier, given by:
For classification purposes, although it is sufficient to learn the argmax rule only, the es-
timation of class probabilities is usually desired as they can quantify uncertainty about the
© 2023 The Author(s). Published by the School of Statistics and the Center for Applied Statistics, Renmin
University of China. Open access article under the CC BY license.
Received June 3, 2022; Accepted September 25, 2022
Linear algorithms for multiclass probability estimation 659
We show that the baseline learning method enjoys the optimal complexity, with computational
time linear in the number of classes K. Section 6 discusses the issue of hyperparameter tuning.
Sections 7 and 8 are contributed to numerical studies by performing extensive simulated and
real-world data analysis and comparison. Section 9 presents concluding remarks and discussions.
1
n
min L(yi f (xi )) + λJ (f ), (2)
f ∈F n
i=1
where L(yf (x)) = [1 − yf (x)]+ = max{0, 1 − yf (x)} is the hinge loss, F is some functional
space, the regularization term J (f ) controls model complexity, and the tuning parameter λ > 0
n= β0 + β x.
T
balances the bias–variance tradeoff (Hastie et al., 2009). For linear SVMs, f (x)
For kernel SVMs, f employs a bivariate kernel representation form f (x) = d + i=1 ci K(xi , x),
according to the representer theorem (Kimeldorf and Wahba, 1971), where K(·, ·) is a Mercer
kernel, F = HK is the reproducing
kernel Hilbert space (RKHS, Wahba, 1990) induced by K(·, ·),
and J (f ) = f 2HK = ni=1 nj=1 ci cj K(xi , xj ). Then the optimization problem (2) amounts to:
1
n n n n
min L(yi f (xi )) + λ ci cj K(xi , xj ), where f (x) = d + ci K(xi , x). (3)
d,c1 ,...,cn n i=1 i=1 j =1 i=1
It is known that the minimizer of the expected hinge loss E(X,Y )∼P (X,Y ) [1 − Yf (X)]+ has the
same sign as the Bayes decision boundary sign[p+1 (x) − 12 ] (Lin, 2002), therefore binary SVMs
directly target on the optimal Bayes classifier without estimating p+1 (x).
Despite their success in many real-world classification problems, standard SVMs can not
estimate the conditional class probabilities p+1 (x) and p−1 (x). To overcome this limitation, bi-
nary weighted SVMs (wSVMs, Wang et al., 2008) were proposed to learn a series of classifiers by
minimizing the weighted hinge loss function, and construct probabilities by aggregating multiple
decision rules. In particular, by assigning the weight π (0 π 1) to data points from class
−1 and assigning 1 − π to those from class +1, the wSVM solves the following optimization
problem:
1
min (1 − π ) L(yi f (xi )) + π L(yi f (xi )) + λJ (f ). (4)
f ∈HK n
y =1 i y =−1 i
Linear algorithms for multiclass probability estimation 661
For any π , the minimizer of the expected weighted hinge loss E(X,Y )∼P (X,Y ) {W (Y )L[Yf (X)]} has
the same sign as sign[p+1 (X) − π ] (Wang et al., 2008). To estimate the class probabilities, we
first obtain multiple classifiers fˆπ2 , . . . , fˆπM by solving (4) with a series of values 0 = π1 < π2 <
· · · < πM < πM+1 = 1. Given any point x, the values fˆπ (x) is non-increasing in π , so there exists
a unique m∗ ∈ {1, . . . , M} such that πm∗ < p+1 (x) < πm∗ +1 . Consequently, we can construct a
consistent probability estimator by p̂+1 (x) = 12 (πm∗ + πm∗ +1 ). The numerical precision level of
p̂+1 (x) is determined by M, which controls the fineness of the grid points π ’s.
Class prediction is done by either the argmax rule, i.e., ŷ = arg max1j K p̂j (x), or the max
voting algorithm (Kallas et al., 2012; Tomar and Agarwal, 2015; Ding et al., 2019). Wang et al.
(2019) showed that the wSVMs can outperform benchmark methods like kernel multi-category
logistic regression (Zhu and Hastie, 2005), random forest, and classification trees.
j = 1, . . . , K. Assuming q̂k∗ |(k∗ ,k∗ ) = 1, the final class probability estimators are then given by:
The choice of k ∗ is critical to optimize finite performance of the baseline learning. To assure
numerical stability, we need to choose k ∗ such that the denominators q̂k∗ |(j,k∗ ) of the ratios in (6)
are kept away from zero simultaneously for all the classes j with j = k ∗ . Towards this, we propose
the following two methods of choosing k ∗ and then evaluate their finite sample performance in
numerical studies.
Choice of k ∗ (Method 2) Alternatively, we propose choosing k ∗ to be the class not too far
from any other class, in order to prevent q̂k∗ |(j,k∗ ) in (6) from being too close to zero for any
j = k ∗ . Towards this, we would look for the class which sits at the “middle” location of all
the classes. For each class j , we compute its within-class compactness Dcp and between-class
distances Dbc (Islam et al., 2016) to another class j , construct an “aggregated” distance Dagg (j )
to measure its overall distance to all the remaining classes. The class k ∗ is chosen based on the
median of Dagg (j )’s.
For each class j , we compute its within-class compactness Dcp (j ) using Sj as follows: (1)
for each point xt ∈ Sj , 1 t |S j |, calculate the sum of its Euclidean distances to all other
data points in class j , i.e. dt = ∀xs ∈Sj \{xt } xs − xt 2 ; (2) identify the median point x̃m , cor-
responding to the median of {d1 , . . . , d|Sj | }; (3) calculate the maximum distance of the points
in class j to the median, i.e., dm(j ) = maxxs ∈Sj xs − x̃m 2 ; (4) define Dcp (j ) = dm(j ). For
each pair (j, j ), 1 j = j K, their between-class distance Dbc (j, j ) is the minimum
distance between the boundary points of the two classes. In specific, it is equal to the min-
imum value of the Euclidean distances of data points from class j to those in class j , i.e.,
Dbc (j, j ) = min∀xs ∈{Sj }×∀xt ∈{Sj } xs − xt 2 . Finally, the aggregated class distance Dagg (j ) is de-
fined as
1 {j |1j =j K} Dbc (j, j )
Dagg (j ) = . (7)
K Dcp (j )
Define k ∗ = arg median({Dagg (j ) | 1 j K}). This wSVMs learner is referred to as B2-SVM.
Step 2: For each j = k ∗ , define a function Rj,k∗ (y) = 1 if y = j ; Rj,k∗ (y) = −1 if y = k ∗ . Fit the
kernel binary wSVMs for the class pair (j, k ∗ ) by solving the optimization problem:
1
min (1 − πm ) L(Rj,k∗ (yi )f (xi )) + πm L(Rj,k∗ (yi )f (xi )) + λJ (f ) (8)
f ∈HK n y =ji y =k ∗ i
over a series 0 < π2 < · · · < πM < 1. For each m = 2, . . . , M, denote the optimized
classifier of (8) as fˆj,k∗ ,πm (·). Furthermore, we define π1 = 0, πM+1 = 1, and assume
fˆj,k∗ ,πm (·) > 0 if m = 1 and fˆj,k∗ ,πm (·) < 0 if m = M + 1.
Step 3: For each class pair (j, k ∗ ), calculate the pairwise conditional probability estimator as:
1
q̂j |(j,k∗ ) (x) = arg min{fˆj,k∗ ,πm (x) < 0} + arg max{fˆj,k∗ ,πm (x) > 0} , ∀x ∈ S. (9)
2 πm πm
Since the baseline learning trains only K − 1 binary wSVMs, compared to K2 wSVMs
classifiers in the pairwise coupling method, its computational saving is substantial for a large K.
In oder to obtain optimal result, for now, the regularization parameter λ > 0 needs to be tuned
adaptively using the data at Step 2. In Section 6, we will discuss the parameter tuning in details.
3.3 Theory
We now study asymptotic properties of the conditional probability estimators given by the
baseline learning (6) and establish its Fisher consistency. At Step 2, since the baseline learning
solves the sub-problems of the pairwise coupling, Lemma 2 in Wang et al. (2019) holds in our
case as well. Therefore, for any fixed j and π , the global minimizer of the weighted hinge loss
corresponding to (8) for pair (j, k ∗ ) is a consistent estimator of qj |(j,k∗ ) (X) − π . When the sample
size n → ∞ and the functional space F is sufficiently rich, with a properly chosen λ, the solution
to (8), q̂j |(j,k∗ ) (x), will converge to qj |(j,k∗ ) (x) asymptotically. Theorem 1 below shows that the
estimator from the baseline learning wSVM, given by (10), is asymptotically consistent under
certain regularity conditions. Proof of Theorem 1 is similar to those in Wang and Shen (2006)
and Wang et al. (2008) and hence omitted.
Theorem 1. Consider the K-class classification problem with kernel weighted SVMs. Denote
the class probabilities estimator as p̂jbs (x), j ∈ {1, . . . , K} \ {k ∗ } and p̂kbs∗ (x) from (10), where k ∗ is
the predefined baseline class. Define the series of weights 0 = π1 < π2 < · · · < πM < πM+1 = 1,
and their maximum interval θπ = max∈{1,...,M} {π | π = π+1 − π }. Then the estimators are
consistent, i.e., p̂jbs (x) → pj (x) for j ∈ {1, . . . , K} if λ → 0, θπ → 0, and n → ∞.
and estimation accuracy. Recall that, the pairwise coupling requires training K2 binary wSVMs
to obtain q̂j |(j,j ) (x) for all possible (j, j ) pairs, which is computationally expensive. In the
following, we use the baseline learning solutions {q̂j |(j,k∗ ) (x), j = k ∗ } to mathematically derive
q̂j |(j,j ) (x) without actually solving the optimization problems.
For any x, we would like to calculate q̂j |(j,j ) (x) with {(j, j ) | 1 (j, j ) = k ∗ ∧(j < j ) K}.
Based on the definition of pairwise conditional probability in Wang et al. (2019), we have:
Therefore, from q̂j |(j,k∗ ) (x) and q̂j |(j ,k∗ ) (x), we have the following relationship
After reconstructing the pairwise conditional probabilities, we can estimate the class probabilities
by dynamically selecting the best baseline class k for each data point, by implementing the
optimization algorithm of pairwise coupling method discussed in Section 7, which enjoys the
linear time complexity. This new type of pairwise coupling enhanced by baseline learning, using
k ∗ chosen either by Method 1 or Method 2, is denoted as BP1-SVM and BP2-SVM, respectively.
In Sections 7 and 8, we will evaluate their performance and make comparisons with the enhanced
implementation of the pairwise coupling and the baseline learning.
1
min (1 − πm ) L(Rj,Aj (yi )f (xi )) + πm L(Rj,Aj (yi )f (xi )) + λJ (f ) (12)
f ∈HK n y =j
i iy =A
j
over a series 0 < π2 < · · · < πM < 1. Denote the solution to (12) as fˆj,Aj ,πm (·) for each
j and m. Furthermore, assume fˆj,k∗ ,π1 =0 (·) > 0 and fˆj,k∗ ,πM+1 =1 (·) < 0.
Step 3: For each class j , compute its estimated class probability as:
1
p̂j (x) = arg min{fˆj,Aj ,πm (x) < 0} + arg max{fˆj,Aj ,πm (x) > 0} , ∀x ∈ S. (13)
2 πm πm
4.3 Theory
Wang et al. (2008) show that the binary wSVMs probability estimator is asymptotically con-
sistent, i.e., p̂+1 (x) asymptotically converges to p+1 (x) as n → ∞, when the functional space
F is sufficiently rich and λ is properly chosen. For the OVA-SVM approach, the class probabil-
ity estimator p̂j (x) for each class j ∈ {1, . . . , K} is solely determined by training the classifier
fˆj,Aj ,πm (·) by solving (12). Theorem 2 below shows the OVA-SVM probability estimators given
by (13) are asymptotically consistent for the true class probabilities pj (x)’s. The proof is similar
to that of Theorems 2 and 3 in Wang et al. (2008) and thus omitted.
Theorem 2. Consider the K-class classification problem using the kernel weighted SVM. Denote
the class probabilities estimator from (13) as p̂j (x) for j ∈ {1, . . . , K}. Define the series of weights
0 = π1 < π2 < · · · < πM < πM+1 = 1, and their maximum interval θπ = max∈{1,...,M} {π |
π = π+1 − π }. Then the estimators are consistent, i.e., p̂j (x) → pj (x) for j = {1, . . . , K} if
λ → 0, θπ → 0, and the sample size n → ∞.
5 Complexity Analysis
We perform rigorous analysis on the computational complexity of a variety of divide-and-conquer
learning schemes for multiclass wSVMs, including the pairwise coupling (Wang et al., 2019), the
proposed baseline learning in Section 3, and the proposed OVA-SVM in Section 4.
Without loss of generality, assume the K classes are balanced, i.e., nj = n for any 1 j K.
When we randomly split the data as the training and tuning sets, we also keep the class size
balanced. For each class pair {(j, j ) | 1 j = j K}, denote the training set and the tuning
set as Strain and Stune , and their corresponding sample size as ntrain and ntune , respectively.
666 Zeng, L. and Zhang, H.H.
For each binary wSVMs classifier, the optimization problem is solved by Quadratic Pro-
gramming (QP) using the training set Strain . In R, we employ the quadprog package, which uses
the active set method and has the polynomial time complexity ∼ O(n3train ) (Ye and Tse, 1989;
Cairano et al., 2013). The time complexity for label prediction on the tuning set of size ntune by
the weighted SVMs classifier is O(ntune ntrain ). To train O(m) sequence of wSVMs for pairwise
class probability estimation with ntune data points, the time complexity is: O(m)[O(n3train ) +
O(ntune ntrain )] + ntune O(m log m). During the parameter tuning step, let npar be the number of
tuning parameters and {γp | p ∈ {1, . . . , npar }} be the number of grid points. Then the time
complexity for computing the pairwise conditional probability estimators is:
npar
γp O(m)(O(n3train ) + O(ntrain ntune )) + ntune O(m log m) . (14)
p=1
In the following, we focus on the RBF kernel with two hyperparameters {λ, σ }, and the
number of grid points are γλ = γ1 and γσ = γ2 . We have m as constant and ntrain = ntune = n.
Then (14) is reduced as O(γ1 γ2 n3 ). For the pairwise coupling method, we need to compute O(K 2 )
pairwise conditional probability estimators, so the total time complexity is O(γ1 γ2 K 2 n3 ).
For the baseline learning, we first choose the baseline class k ∗ , then compute O(K) pairwise
conditional class probabilities. The B1-SVM method chooses k ∗ as the largest class, and its
time complexity is O(Kn) + O(γ1 γ2 Kn3 ) = O(γ1 γ2 Kn3 ). The B2-SVM method selects k ∗ by
measuring the median class distance. Since the time complexity of computing K-class Dcp and
Dbc is O(Kn2 ), we need O(Kn2 ) to obtain k ∗ . Then the time complexity for B2-SVM is O(Kn2 )+
O(γ1 γ2 Kn3 ) = O(γ1 γ2 Kn3 ), which is the same as B1-SVM.
Now we consider the time complexity of the One-vs-All method (OVA-SVM), where we
calculate O(K) pairwise conditional class probabilities. Assume that ntrain = ntune = 0.5Kn.
From (14), we have O(γ1 γ2 K 3 n3 ), and so the total time complexity for OVA-SVM is O(γ1 γ2 K 4 n3 ).
In summary, for K-class problems with the wSVMs, for fixed γ1 , γ2 , m, and n, the time
complexity for the baseline learning methods (B1-SVM and B2-SVM) is ∝ O(K), for the pairwise
coupling is ∝ O(K 2 ), and for OVA-SVM is ∝ O(K 4 ). The proposed baseline method is linear
in K, which has the lowest computational cost among all of the divide-and-conquer wSVM
methods. The above analysis is based on a balanced scenario. For the unbalanced case, the
analysis is more complex mathematically, but the conclusion remains the same.
6 Parameter Tuning
The proposed kernel learning method involves two tuning parameters, λ > 0 and σ ∈ IR (if an
RBF kernel is used). Their values are critical to assure the optimal finite sample performance of
the wSVMs. To choose proper values for these tuning parameters, we randomly split the data
into two equal parts: the training set Strain used to train the wSVM at (8) or (12) for a given
weight π and fixed (λ, σ ), and the tuning set used to select the optimal values of {λ, σ }.
For a fixed {λ, σ }, we train M − 1 classifiers fˆπλ,σ
(x) for π ∈ { jM −1
| j = 2, . . . , M} with
Strain to obtain the pairwise conditional probability estimators q̂j |(j,j ) , where j is the common
λ,σ
baseline class k ∗ in the baseline learning and Aj for the OVA-SVM. We propose choosing the
tuning parameters using a grid search, based on the prediction performance of q̂ λ,σ ’s on the
tuning set Stune . Assume the true pairwise conditional probabilities are qj |(j,j ) (x). To quantify
Linear algorithms for multiclass probability estimation 667
the estimation accuracy of q̂’s, we use the generalized Kullback–Leibler (GKL) loss function:
The term (16) is unrelated to q̂jλ,σ |(j,j ) (X), so only the term (17) is essential. Since the true
values of qj |(j,j ) (X) are unknown in practice, we need to approximate GKL empirically based
on the tuning data. Set a function Rj,j (Y ) = 1 for yi = j ; = −1 for yi = j , which satisfies
E 2 (Rj,j (Y ) + 1)|X, Y ∈ {j, j } = qj |(j,j ) (X), so a computable proxy of GKL is given by
1
EGKL(q̂jλ,σ
|j,j ) ) (18)
1
= − (1 + Rj,j (yi )) log q̂jλ,σ
|(j,j ) (xi ) + (1 − Rj,j (yi )) log(1 − q̂j |(j,j ) (xi )) .
λ,σ
2njj
∀i:yi =j ∨j
7 Numerical Studies
7.1 Experiment Setup
We evaluate the performance of the baseline learning (denoted by B-SVM) and the OVA-SVM
(denoted by A-SVM) schemes, and the pairwise coupling enhanced by the baseline learning
(denoted by BP-SVM). We compare them with our enhanced implementation of the pairwise
coupling algorithm in Wang et al. (2019) to dynamically chooses the best baseline class for
each data point to be the class k that has the maximum voting for larger pairwise conditional
probabilities (denoted by P-SVM). All of these methods are implemented in R, and the tuning
parameters are chosen using EGKL based on a tuning set. In addition, we include five popular
benchmark methods in the literature: multinomial logistic regression (MLG), multiclass linear
discriminant analysis (MLDA), classification tree (TREE, Breiman et al., 1984), random forest
(RF, Ho, 1995), and XGBoost (XGB, Chen and Guestrin, 2016), which are implemented in R
with hyperparameters tuned by cross-validation. All of the numerical experiments are performed
on the High Performance Computing (HPC) cluster at the University of Arizona, with Lenovo’s
Nextscale M5 Intel Haswell V3 28 core processors and 18GB memory.
In all the simulation examples, we set |Strain | = |Stune | = n. The weights are specified as
πj = (j − 1)/M, j = {1, . . . , M + 1}, where M is a preset value to control estimation pre-
√
cision. Following Wang et al. (2008), we set M = n . We use the RBF kernel and select
the optimal values of {λ, σ } using GKL or EGKL based on a grid search. The range of σ is
{σM /4, 2σM /4, 3σM /4, 4σM /4, 5σM /4, 6σM /4}, where σM = medianys =yt xs − xt 2 is the median
pairwise Euclidean distance between K classes (Wu et al., 2010), and λ ∈ {5.5 × 10−8 , 1 ×
10−8 , 5.5 × 10−7 , . . . , 5.5 × 10+7 , 1 × 10+8 }. We report the results of both GKL and EGKL, show-
ing that they give a comparable performance. In real applications, since GKL is not available,
we will use EGKL for tuning.
668 Zeng, L. and Zhang, H.H.
Figure 1: The scatter plots of the training data for Examples 1–4. Different symbols are for
different classes, and the optimal decision boundaries (Bayes rules) are solid lines.
To evaluate the estimation accuracy of the probability estimator p̂j ’s, we further generate
a test set Stest = {(x̃i , ỹi ), i = 1, . . . , ñ}, of the size |Stest | = ñ. The following
metrics are used to
quantify the performance of different procedures: (i) 1-norm error ñ1 ñi=1 K j =1 |p̂j (x̃i ) − pj (x̃i )|;
1 ñ K pj (x̃i )
(ii) 2-norm error ñ i=1 j =1 (p̂j (x̃i )−pj (x̃i ))2 ; (iii) the EGKL loss ñ i=1 K 1 ñ
j =1 pj (x̃i ) log p̂j (x̃i ) ;
pj (x̃i ) (1−pj (x̃i ))
and (iv) the GKL loss ñ1 ñi=1 K j =1 pj (x̃i ) log p̂j (x̃i ) + (1 − pj (x̃i )) log (1−p̂j (x̃i )) . To evaluate the
classification accuracy of the learning methods, we report two misclassification rates based on
the maximum probability and max voting algorithms.
We consider four examples, including two linear cases (Examples 1–2) and two nonlinear
cases (Examples 3–4). The number of classes K ∈ {5, 7, 9}. MLG and MLDA are oracles methods
for Examples 1–2, but not for Examples 3–4.
Linear algorithms for multiclass probability estimation 669
Figure 1 illustrates the scatter plots of the training data points for Examples 1–4, along
with the optimal (Bayes) decision boundary shown in solid lines. For all the examples, n = 500
and ñ = 10, 000. For each example, we run Nr = 100 Monte Carlo simulations and report the
average performance measurement with standard error.
Example 2 (K = 9, linear). The data are generated as follows: (i) Generate Y uniformly
from {1, 2, 3, 4, 5, 6, 7, 8, 9}; (ii) Given Y = y, generate X from the bivariate normal distribution
N (μ(y), ), where μ(y) = (2.5 cos(2yπ/9), 2.5 sin(2yπ/9))T , = 1.52 I2 .
Examples 1–2 are two linear examples used to show that the baseline methods can handle
a large class size K 5 with competitive performance and the best running time. Table 1
summarizes the performance of the baseline learning methods (B1-SVM, B2-SVM), the pairwise
coupling enhanced by the baseline learning (BP1-SVM, BP2-SVM), respectively, based on two
methods of choosing the baseline class k ∗ , the OVA-SVM method (A-SVM), the pairwise coupling
with dynamic baseline class choosing for each data point (P-SVM), and five benchmark methods:
MLG (Oracle-1), MLDA (Oracle-2), TREE, RF and XGBoost. We consider 2 different class sizes
and report seven performance measures: running time, 1-norm error, 2-norm error, EGKL loss,
GKL loss, and√ the misclassification rate for classification. The standard error (SE), as calculated
by σv̄ = σv / Nr , is in the parenthesis.
We make the following observations. In general, SVM-based methods outperform other
methods and are closest to the oracle methods in all cases. Two B-SVM estimators have similar
probability estimation performance as the pairwise coupling methods, and they outperform all
three benchmark methods besides the oracle procedures (MLG and MLDA), showing a significant
gain in computation efficiency when K is large. The BP-SVM methods have no additional
running cost since they don’t involve additional training, showing better classification accuracy
in some cases. Based on data distribution, two B-SVM methods may select different baseline class
k ∗ , which may give slightly different yet similar results. The OVA-SVM consistently gives the
best probability estimation and classification in all methods, however with the most expensive
computational cost, which agrees with the theoretic complexity analysis. The tuning criteria
GKL and EGKL give similar performance, showing that EGKL is a good approximation as
GKL for parameter tuning. Noted, the max voting algorithm works only with the full pairwise
probability table, so it is not applicable for OVA-SVM and B-SVM, as denoted as NA. The tree
based methods, such as classification trees (TREE) and random forest (RF), may incur 0 in
class probability estimation for some x and classes, resulting “Inf” in Tables 1 and 2 for both
EGKL and GKL, thus the SEs are unavailable and output NaN (“Not A Number”) in R.
EGKL 31.8 (2.4) 31.8 (2.4) 33.9 (2.6) 33.9 (2.6) 26.8 (1.3) 30.3 (1.2) 22.5 (1.2) 22.1 (1.2) Inf (NaN) Inf (NaN) 124.6 (2.6)
GKL 25.3 (1.1) 25.3 (1.1) 26.8 (0.9) 26.8 (0.9) 15.2 (0.4) 24.6 (0.4) 14.8 (0.4) 14.4 (0.3) Inf (NaN) Inf (NaN) 162.0 (1.4)
TE1 60.8 (0.2) 60.6 (0.2) 61.2 (0.2) 60.9 (0.2) 59.9 (0.1) 60.3 (0.2) 57.6 (0.1) 57.7 (0.1) 61.9 (0.1) 64.7 (0.1) 65.7 (0.1)
TE2 NA (NA) 60.4 (0.2) NA (NA) 60.8 (0.3) NA (NA) 60.3 (0.2) NA (NA) NA (NA) NA (NA) NA (NA) NA (NA)
k∗ 1 1 5 5 NA NA NA NA NA NA NA
Example 2 (Nine-class Linear Example, MLG and MLDA as Oracle, denoted as Oracle-1 and Oracle-2)
EGKL B1-SVM BP1-SVM B2-SVM BP2-SVM A-SVM P-SVM Oracle-1 Oracle-2 TREE RF XGB
Time 0.8 (0.0) 0.8 (0.0) 0.8 (0.0) 0.8 (0.0) 31.6 (0.2) 3.4 (0.0) 0.0 (0.0) 0.0 (0.0) 0.4 (0.0) 1.8 (0.0) 7.2 (2.8)
L1 46.1 (0.3) 46.1 (0.3) 47.0 (0.3) 47.0 (0.3) 43.5 (0.1) 45.2 (0.1) 43.5 (0.2) 42.7 (0.2) 64.0 (0.2) 75.2 (0.1) 99.8 (0.1)
L2 12.4 (0.1) 12.4 (0.1) 12.3 (0.1) 12.3 (0.1) 11.1 (0.1) 12.0 (0.0) 10.6 (0.1) 10.4 (0.1) 14.5 (0.1) 21.7 (0.1) 40.0 (0.1)
EGKL 32.0 (2.4) 32.0 (2.4) 32.2 (2.2) 32.2 (2.2) 27.0 (1.6) 31.6 (1.2) 26.9 (1.4) 25.9 (1.3) Inf (NaN) Inf (NaN) 145.2 (2.8)
GKL 22.6 (1.1) 22.6 (1.1) 22.5 (1.2) 22.5 (1.2) 18.5 (0.6) 21.7 (0.5) 18.1 (0.4) 16.9 (0.4) Inf (NaN) Inf (NaN) 187.1 (1.8)
TE1 61.2 (0.2) 60.4 (0.2) 61.6 (0.3) 60.9 (0.3) 60.1 (0.1) 60.9 (0.1) 57.4 (0.1) 57.3 (0.1) 62.5 (0.1) 64.2 (0.1) 65.3 (0.1)
TE2 NA (NA) 60.3 (0.2) NA (NA) 60.6 (0.3) NA (NA) 60.9 (0.1) NA (NA) NA (NA) NA (NA) NA (NA) NA (NA)
k∗ 6 6 8 8 NA NA NA NA NA NA NA
GKL B1-SVM BP1-SVM B2-SVM BP2-SVM A-SVM P-SVM Oracle-1 Oracle-2 TREE RF XGB
Time 0.8 (0.0) 0.8 (0.0) 0.8 (0.0) 0.8 (0.0) 31.7 (0.2) 3.4 (0.0) 0.0 (0.0) 0.0 (0.0) 0.4 (0.0) 1.8 (0.0) 7.2 (2.8)
L1 46.1 (0.5) 46.1 (0.5) 47.1 (0.4) 47.1 (0.4) 43.1 (0.1) 45.2 (0.3) 43.5 (0.2) 42.7 (0.2) 64.0 (0.2) 75.2 (0.1) 99.8 (0.1)
L2 12.6 (0.2) 12.6 (0.2) 12.8 (0.1) 12.8 (0.1) 11.6 (0.1) 12.5 (0.1) 10.6 (0.1) 10.4 (0.1) 14.5 (0.1) 21.7 (0.1) 40.0 (0.1)
EGKL 32.0 (3.5) 32.0 (3.5) 32.9 (3.1) 32.9 (3.1) 27.1 (1.6) 31.6 (1.4) 26.9 (1.4) 25.9 (1.3) Inf (NaN) Inf (NaN) 145.2 (2.8)
GKL 22.5 (3.2) 22.5 (3.2) 22.3 (1.9) 22.3 (1.9) 18.1 (0.6) 21.5 (1.0) 18.1 (0.4) 16.9 (0.4) Inf (NaN) Inf (NaN) 187.1 (1.8)
TE1 62.0 (0.3) 61.9 (0.3) 62.0 (0.3) 61.8 (0.3) 60.7 (0.1) 61.3 (0.3) 57.4 (0.1) 57.3 (0.1) 62.5 (0.1) 64.2 (0.1) 65.3 (0.1)
TE2 NA (NA) 61.3 (0.3) NA (NA) 61.3 (0.3) NA (NA) 61.3 (0.2) NA (NA) NA (NA) NA (NA) NA (NA) NA (NA)
k∗ 6 6 8 8 NA NA NA NA NA NA NA
TABLE NOTE: The running time is measured in minutes. L1 (the 1-norm error), L2 (the 2-norm error), EGKL (the EGKL loss), GKL (the GKL loss), TE1
(the misclassification rate based on max probability), TE2 (the misclassification rate based on max voting), multiplied by 100 for both mean and standard
derivation in parenthesis. Baseline class k ∗ in baseline methods are calculated by statistical mode. MLG: Multinomial logistic regression; MLDA: Multiclass
linear discriminant analysis; TREE: Classification tree; RF: Random forest; XGB: XGBoost. The same explanation applies to the results of all the other
670
examples.
Table 2: Performance measure on the test set for Example 3 and Example 4.
Example 3 (Five-class Non-linear Example, MLG and MLDA are not Oracle)
EGKL B1-SVM BP1-SVM B2-SVM BP2-SVM A-SVM P-SVM MLG MLDA TREE RF XGB
Time 1.5 (0.0) 1.5 (0.0) 0.7 (0.0) 0.7 (0.0) 11.7 (0.0) 2.4 (0.0) 0.0 (0.0) 0.0 (0.0) 0.3 (0.0) 1.1 (0.0) 0.4 (0.0)
L1 19.2 (0.3) 19.2 (0.3) 21.1 (1.2) 21.1 (1.2) 15.2 (0.1) 18.9 (0.1) 115.9 (0.1) 116.3 (0.1) 27.9 (0.2) 22.4 (0.1) 27.9 (0.1)
L2 4.1 (0.2) 4.1 (0.2) 4.3 (0.7) 4.3 (0.7) 3.3 (0.1) 4.0 (0.1) 50.7 (0.1) 50.4 (0.1) 7.0 (0.1) 4.4 (0.1) 9.4 (0.1)
EGKL 12.5 (0.9) 12.5 (0.9) 13.7 (1.9) 13.7 (1.9) 9.2 (0.6) 12.2 (0.5) 60.7 (0.8) 60.8 (0.8) Inf (NaN) Inf (NaN) 28.6 (1.2)
GKL 22.6 (0.6) 22.6 (0.6) 26.9 (2.3) 26.9 (2.3) 15.6 (0.3) 20.8 (0.3) 88.5 (0.2) 89.7 (0.2) Inf (NaN) Inf (NaN) 49.7 (0.7)
671
k∗ 5 5 1 1 NA NA NA NA NA NA NA
672 Zeng, L. and Zhang, H.H.
Example 4 (K = 5, nonlinear). √ Generate X = (X1 ,√X2 )T uniformly from √ a disc {x : x12 + x22
√ Define h1 (x) = −3x1 √
100}. 5 + 3x2 , h2 (x) = −3x1 5 − 3x2 , h3 (x) = x2 3 − 1.2x1 , h4 (x) =
2x2 3 + 1.2x1 , and h5 (x) = |x1 x2 | + 1. Then consider the non-linear transformation fj (x) =
−1
(T2 (hj (x))) for j = 1, 2, 3, 4, 5, where (·) is the cumulative distribution function(cdf) of the
standard normal distribution and T2 (·) is the cdf of t2 distribution. Then we set the probability
functions as pj (x) = exp fj (x)/( 5l=1 exp(fl (x))) for j = 1, 2, 3, 4, 5.
Table 2 summarizes the performance of the proposed methods and their comparison with the
pairwise coupling and five benchmark methods. Similar to the linear cases, the proposed OVA-
SVM has the best performance among all the methods. The B-SVMs have close performance as
pairwise coupling and outperform all the benchmark methods.
Time 0.1 (0.0) 0.1 (0.0) 0.1 (0.0) 0.1 (0.0) 0.3 (0.0) 0.2 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.1 (0.0) 1.3 (0.3)
TE1 7.3 (0.5) 7.1 (0.4) 6.0 (0.4) 5.8 (0.3) 4.6 (0.4) 5.6 (0.3) 9.0 (1.3) 6.7 (0.7) 11.5 (0.7) 6.4 (0.6) 9.2 (1.2)
E.coli 4 226 110
TE2 NA (NA) 6.9 (0.3) NA (NA) 5.6 (0.3) NA (NA) 5.5 (0.3) NA (NA) NA (NA) NA (NA) NA (NA) NA (NA)
k∗ 4 4 2 2 NA NA NA NA NA NA NA
Time 1.1 (0.0) 1.1 (0.0) 0.9 (0.1) 0.9 (0.1) 13.4 (0.0) 1.9 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.4 (0.0) 2.4 (0.2)
TE1 39.3 (0.5) 38.1 (0.5) 41.7 (1.3) 41.5 (1.4) 39.0 (0.6) 39.4 (0.6) 40.0 (0.3) 40.5 (0.4) 43.0 (0.6) 37.9 (0.5) 42.4 (0.5)
Yeast 5 990 494
TE2 NA (NA) 37.9 (0.6) NA (NA) 41.8 (1.4) NA (NA) 39.7 (0.4) NA (NA) NA (NA) NA (NA) NA (NA) NA (NA)
k∗ 1 1 2 2 NA NA NA NA NA NA NA
Time 12.7 (0.1) 12.7 (0.1) 12.9 (0.3) 12.9 (0.3) 136.7 (0.9) 24.4 (0.3) 0.0 (0.0) 0.0 (0.0) 0.1 (0.0) 1.1 (0.0) 140.3 (4.7)
Pendigits TE1 1.0 (0.0) 0.9 (0.0) 1.0 (0.0) 1.0 (0.0) 0.8 (0.0) 0.9 (0.0) 3.3 (0.3) 6.8 (0.1) 7.1 (0.6) 1.3 (0.0) 2.0 (0.0)
4 2937 1372
(1369) TE2 NA (NA) 0.9 (0.0) NA (NA) 1.0 (0.0) NA (NA) 0.9 (0.0) NA (NA) NA (NA) NA (NA) NA (NA) NA (NA)
k∗ 1 1 1 1 NA NA NA NA NA NA NA
Time 43.4 (0.9) 43.5 (0.9) 53.1 (1.6) 53.1 (1.6) 1536.7 (9.4) 192.4 (1.9) 0.0 (0.0) 0.0 (0.0) 0.2 (0.0) 4.4 (0.1) 382.4 (5.4)
Pendigits TE1 4.1 (0.0) 3.9 (0.0) 4.5 (0.1) 4.4 (0.1) 3.7 (0.1) 4.0 (0.0) 10.6 (0.5) 17.1 (0.1) 25.9 (0.4) 5.3 (0.1) 5.2 (0.1)
10 7494 3498
(Full) TE2 NA (NA) 3.9 (0.0) NA (NA) 4.4 (0.1) NA (NA) 4.0 (0.0) NA (NA) NA (NA) NA (NA) NA (NA) NA (NA)
k∗ 2 2 1 1 NA NA NA NA NA NA NA
TABLE NOTE: The first column gives the names of the four benchmark data sets; the next 3 columns include the data set information: K is the number of
classes, n is the training set size, and ñ is the test set size. The remaining columns compare the performance of the proposed methods with benchmarks.
673
674 Zeng, L. and Zhang, H.H.
n = 1,484 with 10 different localization sites. Based on the biological functional similarity and
locations proximity, we formulate a K = 5 class problem, including class “membrane proteins”
with 258 observations, “CYT” with 463 observations, “NUC” with 429 observations, “MIT” with
244 observations, and class “other” with 90 observations.
For the pen-based handwritten digits data set, we randomly split the training and tuning
sets of equal sizes 10 times and report the average misclassification rate evaluated on the test
set along with the standard error. For E.coli and yeast data sets, since the class sizes are highly
unbalanced, we first randomly split the data into the training, tuning, and test sets of equal size
for each class separately, and then combine them across classes to form the final training, tuning,
and test sets, for 10 times. We report the average test error rates along with the standard errors.
Table 3 suggests that the OVA-SVM overall is the best classifier among all cases, but its
computational cost drastically increases with n and K. The two linear time methods consistently
give similar performance as pairwise coupling in terms of error rates, but with a significantly
reduced computation time for large n and K, and they outperform the benchmark methods.
In some examples, the linear time method along with pairwise reconstruction may outperform
pairwise coupling and One-vs-All methods (e.g. BP1-SVM on yeast and Pendigits data sets).
In practice, when n and K are small, One-vs-All should be used for its best performance (e.g.
E.coli data set); when K and n are large, linear methods have the best time complexity with
similar performance as the pairwise and One-vs-All methods and are hence suitable for scalable
applications.
The results suggest that the proposed One-vs-All method (OVA-SVM) consistently outper-
forms all of the methods in all the data sets, but has a high computational cost; the proposed
baseline learning methods (B-SVM and BP-SVM) have a close performance to the pairwise cou-
pling method and consistently outperform the benchmark methods, with a significant reduction
in computational time. Similar observations are made for these four studies.
Figure 2: This figure plots the estimated probabilities (shown as the bar heights) for the B-cell
lymphoma data set’s test set for 14 data points with 3 methods. From the top to the bottom
row are the evaluation of the pairwise coupling (P-SVM), OVA-SVM (A-SVM), and baseline
(B1-SVM) methods. From the left to the right columns are estimated class probabilities of one
subclass of test samples: DLBCL, CLL, and FL, labeled as 1, 2, 3, respectively, by the three
methods. Three colors (red, azure, green) represent the estimated probabilities of test data points
belonging to class DLBCL, CLL, and FL.
For both data sets, we normalize the gene expression values, obtain their Z-scores, and rank
the genes by their BW criterion of the marginal relevance to the class label, calculated as the
ratio of each gene’s between-group to within-group sums of squares, as specified in Dudoit et al.
(2002). We select top 5 genes as predictor variables based on the order of the relevance measures
of the BW score. Figure 2 plots the estimated class probability for B-cell lymphoma data set on
14 test data points with the baseline method 1 (B1-SVM), OVA-SVM (A-SVM), and pairwise
coupling (P-SVM), whose running time of 1 sec, 3 secs, and 2 secs, respectively.
676 Zeng, L. and Zhang, H.H.
Figure 3: This figure plots the estimated probabilities (shown as the bar heights) for the acute
lymphoblastic leukemia test set for 60 data points with 3 methods. From the top to the bot-
tom row are the pairwise coupling (P-SVM), OVA-SVM (A-SVM), and the baseline (B2-SVM)
methods. From the left to the right are estimated class probabilities of one subclass of test
samples: T-ALL, T+B+H, MLL, and E2A-PBX1, labeled as 1,2,3,4 respectively, by the three
methods. Four colors (blue, orange, green, garnet) represent the estimated probabilities of test
data belonging to class T-ALL, T+B+H, MLL, and E2A-PBX1.
Figure 3 plots the estimated class probability for acute lymphoblastic leukemia data set
on 60 test data points with baseline method 2 (B2-SVM), OVA-SVM (A-SVM), and pairwise
coupling (P-SVM), whose running time of 5 secs, 18 secs, and 13 secs. For both data sets, the two
baseline methods and OVA-SVM have similar class probability estimations for each test data
point as pairwise coupling, with no misclassification for any classes (we illustrated one baseline
model for each data set). However, the baseline methods have the fastest computational time.
With the appropriate variable selection method, our proposed linear algorithms and One-vs-
All methods can work with scalable high-dimensional data with high accuracy on multiclass
probability estimation and classification.
Linear algorithms for multiclass probability estimation 677
9 Concluding Remarks
In this article, we propose two new learning schemes, the baseline learning and the One-vs-All
learning, to improve the computation time and estimation accuracy of the wSVMs methods. The
One-vs-All method shows consistent best performance in probability estimation and classification
with the most expensive computational cost, which is good for a small sample size with a class
size K 5. The two baseline methods have comparable performance to pairwise coupling method
with great value in reducing the time complexity especially when class size K and sample size
n is big, which makes it a favorable approach for multiclass probability estimation in real-world
applications for both low-dimensional and high-dimensional data analysis.
We also establish the consistency theory for the baseline models and One-vs-All multi-
class probability estimators, which is similar to the pairwise coupling method. The nature of the
baseline method for training K −1 binary wSVMs, makes it enjoys the advantage of parallel com-
puting and adapts the general-purpose computing on graphics processing units (GPGPU), which
could further reduce the computational time and optimized for complex multiclass probability
estimation problems. One interesting question is how to use our proposed wSVMs framework
for image classification, such as brain tumor MRI images, for fast screening. Since the widely
used image classification approaches based on Convolutional Neural Networks (CNNs) lack the
accurate class probability estimation (Guo et al., 2017; Minderer et al., 2021), which is important
for pre-cancer diagnosis to establish the strength of confidence for classification.
Currently, we only evaluate the dense feature set. The class probability estimation perfor-
mance would decrease significantly if the feature sets are sparse with highly correlated predictor
variables and abundant noisy features (as shown in the high-dimensional data sets in 8.2). Since
all the methods discussed in this article are based on the L2 -norm regularization of the binary
wSVMs optimization problem, we don’t have the automatic variable selection power with the
current framework. One of the important directions for future research is how to incorporate
variable selection in the proposed class probability estimation framework for wSVMs, which is
an important problem for data with sparse signals and highly correlated features. In many real
applications, abundant features are collected but only a few of them have predictive power. We
will this topic a more thorough treatment in future work.
Supplementary Material
We developed the R Package MPEwSVMs for multiclass probability estimation based on the pro-
posed methods and have submitted it to the GitHub repository at https://github.com/zly555/
MPEwSVMs. Detailed instructions on how to use the package and as well as a numerical simu-
lation example are shown in the README.md file.
References
Alimoglu F, Alpaydin E (1997). Combining multiple representations and classifiers for pen-
based handwritten digit recognition. In: Proceedings of the Fourth International Conference
on Document Analysis and Recognition (J Schürmann, ed.), volume 2, 637–640. Ulm, Germany.
Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, et al. (2000). Distinct types
of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403(6769):
503–511.
678 Zeng, L. and Zhang, H.H.
Wu Y, Zhang HH, Liu Y (2010). Robust model-free multiclass probability estimation. Journal
of the American Statistical Association, 105: 424–436.
Ye Y, Tse E (1989). An extension of Karmarkar’s projective algorithm for convex quadratic
programming. Mathematical Programming, 44(1–3): 157–179.
Yeoh EJ, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, et al. (2002). Classification,
subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by
gene expression profiling. Cancer Cell, 1(2): 133–143.
Zhang C, Liu Y (2013). Multicategory large-margin unified machines. Journal of Machine Learn-
ing Research, 14: 1349–1386.
Zhu J, Hastie T (2005). Kernel logistic regression and the import vector machine. Journal of
Computational and Graphical Statistics, 14: 185–205.
Zhu J, Rosset S, Hastie T, Tibshirani R (2003). 1-norm support vector machines. In: Proceedings
of the 16th International Conference on Neural Information Processing Systems (S Thrun, LK
Saul, B Schölkopf, eds.), 49–56. Whistler, Canada.