Expectation Maximisation Algorithm
Expectation Maximisation Algorithm
Expectation Maximisation Algorithm
ISSN: 1549-3636
© 2014 Science Publications
doi:10.3844/jcssp.2014.2124.2134 Published Online 10 (10) 2014 (http://www.thescipub.com/jcs.toc)
In this study we present the CUDA parallel The algorithm starts from a Θ(0) (usually defined
implementation of the Expectation-Maximization arbitrarily, choice) and iterates through both steps
algorithm for the estimation of Gaussian Mixture until a stop criterion is satisfied. The widely used
Models. The Gaussian mixture model is one of the most criterion is the variation of Q between steps, defined as
widely used statistical models for machine learning Equation 4:
tasks, being the most flexible parametric model.
We are particularly interested in verifying whether a Θt +1 − Θt ≤ ε (4)
more efficient global memory access can reduce the
concerned overhead, providing better usage of CUDA
cores and, thus, improving performance. 1.2. EM for GMM Estimation
1.1. Expectation-Maximization Algorithm (EM) Gaussian classifiers are the most widely used
methods for supervised classification. However, these
Statistical models are used in many machine methods have limitations when dealing with problems
learning techniques. The maximum likelihood method where the classes cannot be linearly separable. Also, they
(Maximum- Likelihood Estimation, or just MLE) can cannot deal with non-Gaussian data, since their
estimate the parameters of a statistical model from a discriminant functions are linear or quadratic. A
set of sample data, for further usage in classification workaround for such limitation is to combine probability
tasks, for instance. functions (pdf's). Indeed, this approach is widely used
An important concern is what to do when some data because it is a parametric method that can be applied to
sample are missing. It is yet possible to perform non-linear classification problems. Such technique is
estimation of model parameters. The Expectation- known as Finite Mixture Model and its probability
Maximization (EM) allows learning of parameters that function is defined as:
govern the distribution of the sample data with some
missing features (Sujaritha and Annadurai, 2011;
g
∑π
Poongothai and Sathiyabama, 2012; Malarvezhi and
p ( x) = j p ( x; Θ j ) (5)
Kumar, 2013).
j =i
The MLE is defined as Equation 1:
(
∂ ln p y ( yk ; Θ) ) where, g is the number of components (pdf) of the
Θ ML : ∑ k
∂Θ
(1) mixture; πj is the probability of the components
(commonly known as the weight of the component),
such that 1 ∑ j =1π j and p (x; Θj) is the pdf of the
g
where, y represents the full set of sample data. To deal component in regards to the parameters Θj.
with missing data, tough, the EM can iteratively When we use Gaussian models, each component
maximize the hope of the likelihood function, given assumes a multivariate normal distribution, where Θj =
the observed samples and the estimate of the current {µ; Σj}. This model is known as Gaussian Mixture
iteration Θ. Model (GMM) (Shanmugapriya and Nallusamy, 2014),
The EM algorithm consists of two steps. (Ramalingam and Dhanalakshmi, 2014). Equation 5 cant
The E-step computes the hope of logarithmic thus be rewritten as Equation 6:
likelihood, conditionally to the set of observed data and
the current value of the parameters, Θt Equation 2: g
p ( x) = ∑π N ( x; µ ∑ j) (6)
∑ ( ( ))
j j
Q (Θ; Θt ) ≡ E ln p y yk ; Θ X ; Θt (2) j =i
∑
n Kumar et al. (2009), the authors spread the EM
wij x j
µj = i =1 (9) algorithm over six CUDA kernels for a fast parallel
∑
n parametric estimation of GMM. The work focus is in
wij
i =1 speeding up the EM algorithm through improvements
of the kernels and data organization, not using it for
n
∑w (x )( xi − µ j )
specific problems.
∑
1 T
ij i −µj (10)
j nπ j Azhari and Ergün (2011), the authors implements a
i =1
CUDA version of the EM algorithm for speaker
verification based on Gaussian Mixture Modeling-
As described above, the EM algorithm iterates until Universal Background Modeling (GMMUBM). The
the convergence of the model likelihood (stopping major difference between this and the previous presented
criterion). It is possible, though, that the algorithm works is that it uses only 2 kernels for the EM, one for
becomes stuck in a local minimum, leading to the E-step and one for the M-Step. There is also a
nonoptimal solutions. It is thus a common practice, parallel implementation of the k-means algorithm.
repeat the training process few times more, initializing Machlica et al. (2011), the authors present an
the parameters with different values and in the end, implementation of the parallel EM algorithm for
choose the best solution (Webb and Copsey, 2011). GMM training. According to the paper, their approach
Moreover, both the calculation of w and the calculation offers better memory occupancy and greater speedup
of parameters π, µ and Σ iterate over all sample data. due to less coalesced access. Their results were
For a large dataset, the time of the training process obtained using adapted data taken from 2008 NIST
can be huge, especially in cases where there are high Speaker Recognition Evaluation.
numbers of components. Despite such limitation,
calculations performed for each data are independent and 2. MATERIALS AND METHODS
thus, fully parallelizable.
The method to provide parallelized implementation
1.3. Previous Work on Parallel Implementation
of EM for GMM is greatly founded on how to deal
of EM and GMM Learning with the specificities of CUDA. In order to provide
Tagare et al. (2010), the authors present a strategy to better understanding of the approach, firstly, we
speed up the EM algorithm using domain reduction. The depict CUDA itself.
approach considers the use of three different kernels to
2.1. Technological Background: CUDA
compute the calculation of latent probabilities and the
Riemann sums for the parameter updates. The EM is Multiprocessors are responsible for GPU internal
used for reconstructing 3D volumes from noisy Electron processing and there may exist many of them, varying
Cryomicroscopy images of single macromolecular according to graphics card's model. Every multiprocessor
particles. The work focus on problems other than GMM. is composed by smaller processors, the so-called Core
Chen et al. (2012), the authors derive an algorithmic processors. These core processors share the same
method for incremental GMM learning from a instruction chip, which belongs to the multiprocessor.
This means that CUDA™ architecture works as a The implemented CUDA kernels are depicted as
Single Instruction Multiple Data (SIMD) system, where follows:
every multiprocessor is capable of processing only one
instruction at a time. The basic parallel processing • p-kernel: For each Gaussian component, j computes
element is a thread, just like CPU, but there are two the probability of each data xi conditional to
others important concepts: The block and the grid. A parameters Θj, multiplied by the weight of
block is a composition of up to t threads, where t is the component πj. In this kernel, the thread blocks are
maximum value supported by the GPU. It is also the arranged in a grid jmx, where m blocks of line j are
element seen in the multiprocessors, responsible for fully responsible for the calculation for the component j
process all the threads of a block, when thus a new ready • ^p-kernel: For each data xi, normalizes their
block is chosen. A grid, on the other hand, is an probabilities computed in the previous kernel for
aggregation of multiple blocks. each component j. It concerns the wij values of
Both the grid of blocks and the blocks of threads can Equation 7. In this step m, blocks of threads are used
be uni-, two- or three-dimensional. A kernel call needs to and each block is responsible for normalizing the
specify the dimensions of grid and blocks and thus, it is probabilities for a given data at a time, until the
possible to run kernels with different arrangements of entire probability base is normalized
threads within the same application. • m-kernel: For each Gaussian, estimates its marginal
The memory hierarchy consists of local memory, probability, that is, calculates the sum of the
global memory and shared memory. The local memory is probabilities of the data related to each component j.
a high speed memory and private to each thread. The j blocks of threads are used and each block is
shared memory is larger and slower than the local responsible for performing the sum of a component
memory, but it is accessible by all threads of the same µ-kernel: For each Gaussian, re-estimates the mean
block, allowing threads to work collaboratively within a vector µ that maximizes the likelihood, as described
single run. The global memory is the largest and slowest in Equation 9. Again, using j blocks of threads, each
memory of GPU, but it is accessible by any thread, block is responsible for a component
thereby allowing different kernels to share common data. • Σ-kernel: Re-estimates the covariance matrices Σ of the
components. In this step, we use an array of 2D blocks,
2.2. Rationale of the Parallelization Approach where blocks of threads are organized in a square
The calculation of wij in E-step (Equation 7) and the matrix of order N, where N is the dimension of the
calculations of weights πj, means µj and covariances Σj data. Thus, each block is responsible for reestimate an
are extremely parallelizable as they iterate over all the element sij of each of the covariance matrices
data and are independent of each other. • π-kernel: Re-estimates the weights π of components.
An important point to be considered is the transfer of Since the weight of a given component is given by
data from the host (main memory) to the GPU memory. the marginal probability normalized, as described in
The bus transfer between these two memories is slow Equation 8, this step contains only a single block of j
and its usage should be avoided. As the algorithm must threads, which perform the summation and
normalization of the marginal probabilities
run iteratively in order to satisfy a stopping criterion and
all steps are parallelizable, it would be more effective if • φ-kernel: Calculates determinants and inverse
matrices of covariance matrices, which are used to
the whole main loop of the algorithm could run on
calculate the probabilities of p-kernel step. In this
GPU, to avoid such data transfer. However, the
step, the matrix decomposition technique called LU
arrangement of threads is statically defined in the
decomposition is used. In such technique, we rewrite
kernel. This becomes an inconvenience, since the the matrix as the product of a lower triangular
arrangement of threads is an important setting for a matrix (L matrix, lower) by an upper triangular
better efficiency of parallelization and each step of the matrix (matrix U, upper). Using j blocks of one
EM algorithm requires a different arrangement. thread only, the thread executes sequentially the LU
Similarly to the approach of (Machlica et al., 2011) decomposition algorithm
and (Kumar et al., 2009), in our proposal the main loop
of the algorithm is implemented sequentially and Figure 1 summarizes the distribution of component
different CUDA kernels are in charge of running parameters and input data on the arrangement of blocks
different steps of the algorithm. and threads of the grid.
algorithm and Tp is the execution time of the parallel regards to memory access is the coalesced access to the
algorithm. global memory, i.e., the warp threads need to access
adjacent memory blocks and aligned with the "cache
3.3. Experimentation Sets line" (fixed blocks of memory that are loaded once to the
The GPU used in the first two sets of experiments cache memory). In this way, the accesses required by the
was a NVidia GeForce GTS 250 with 16 processors. various warp threads are part of a single transaction,
In the first experimentation set, the target number of thereby reducing the overhead of the global memory
Gaussians (components) have been varied with a fixed access. The arrangement of threads in the grid and its
quantity of 30 iterations to estimate parameters. Figure 2 blocks are also important for the coalesced access, since
shows a comparison between the time of parallel a better arrangement ensures that the accesses of the
execution and its respective sequential version. This active warp threads are aligned on cache size.
execution has shown a Speedup S16 = 7. The runtime of In order to verify the performance gain achieved with
the parallelized version varies from 0.781 to 15.094 sec. the changes, tests with a NVidia GeForce GT 555M
This time results from the execution of φ-kernel GPU have been performed. An important metric to
step, which actually performs sequentially on the consider when considering the arrangement and
GPU. The second set of experiments was conducted coalesced access is the occupancy, which refers to how
varying the number of the instances in dataset: 23,344 effectively the hardware (the CUDA cores) is kept busy,
to 263,256. In this case, the algorithms have
i.e., the longer busy, best the hardware effective use.
performed 30 iterations to estimate two Gaussians.
Experiments have been performed by varying the
Results of this step are shown in Fig. 3. In this step,
the Speedup of parallel algorithm was S16 = 6. number of Gaussians and size of the database, for both
versions of the p and p kernel codes (E-Step). Firstly, we
3.4. Coalesced Access to Global Memory have fixed the database size of 23344 instances and
varied the number of Gaussians. In the second set of
We have carried out modifications to the code of
experiments, we have varied the size of the database for
the kernels p-kernel and p-kernel to provide better
a fixed number of Gaussians (eight). Figure 4 and 5
access to global memory and shared memory usage.
The kernels used in such arrangements have been show the execution time (in milliseconds) for both
rearranged according to the size of the warp in order kernels by varying the number of Gaussians: 1, 2, 4, 8,
to ensure aligned access with the “cache line” and, at 16, 24 and 32. Figures 6 and 7 illustrate the runtimes of
the same time, maintaining all the CUDA cores both kernels varying the size of the database. In such
employed as long as possible. case, the average speedup was 19x and 30x for the
We have carried out modifications to the code in kernels p and p, respectively. In both kernels, the largest
order to allow a more efficient memory access and a observed speedups (~21x and ~32x) occurred with the
more appropriate array of threads. An important issue in database containing 152526 instances.
Fig. 2. Execution time for both parallel and sequential versions of EM as the number of gaussians increases
Fig. 3. Execution time for both parallel and sequential versions of EM as the number of instances of dataset varies
Fig. 4. Comparison of the execution time of the p-kernel with and without coalesced access by increasing the number of Gaussians
Fig. 5. Comparison of the runtime of the ^p-kernel with and without coalesced access by increasing the number of Gaussians
Fig. 6. Comparison of the execution time of the p-kernel with and without coalesced access by increasing the database size
Fig. 7. Comparison of the runtime of the ˆp -kernel with and without coalesced access by increasing the database size
the algorithm. Pseudocode for each CUDA kernel Machlica, L., J. Vanek and Z. Zajic, 2011. Fast
implementation is properly provided. estimation of gaussian mixture model parameters
Experiments performed over a UCI database and on GPU Using CUDA. Proceedings of the 12th
varying number of Gaussians have shown a speedup International Conference on Parallel and
of 7 if compared to sequential implementation of EM Distributed Computing, Applications and
algorithm. We have also provided modifications in Technologies, (AT ‘11), pp: 167-172.
two of the CUDA kernels in order to allow more Malarvezhi P. and R. Kumar, 2013. A novel two stage
coalesced access to global memory. As a result we carrier frequency off set estimation and
have achieved up to 56.4% of achieved occupancy. compensation scheme in multiple input multiple
The proposed approach thus contributes to the output-orthogonal frequency division
stateof-the-art of the research in ASR by providing an multiplexing system using expectation and
effective algorithm for training GMM. maximization iteration. J. Comput. Sci., 9: 1526-
Future work consists in providing modifications to 1533. DOI: 10.3844/jcssp.2013.1526.1533.
the other three CUDA kernels in order to provide more Meng, C., L. Wang, Z. Cao, X. Ye and L. Feng, 2013.
coalesced access to global memory in a similar manner Acceleration of a High Order Finite-Difference
we have made to the first two of them. WENO Scheme for Large-Scale Cosmological
Simulations on GPU. Proceedings of the IEEE
6. ACKNOWLEDGEMENT 27th International Symp. on Parallel and
Distributed Processing Workshops and PhD
The researchers thank the Conselho Nacional de
Desenvolvimento Científico e Tecnológico (CNPq- Forum, (DPW’ 13), IEEE Computer Society, pp:
Brasil) for the financial support [Universal 14/2012, 2071-2078
Processo 483437/2012-3] and for granting a scholarship Mielikainen, J., B. Huang, H. Huang and M. Goldberg
to M.V.O. Medeiros. The authors also thank the 2012. Improved GPU/CUDA Based Parallel
Fundação de Apoio à Pesquisa e à Inovação Tecnológica Weather and Research Forecast (WRF) Single
do Estado de Sergipe (FAPITEC-Sergipe-Brasil) for Moment 5-Class (WSM5) Cloud Microphysics.
granting a scholarship to G.F. Araújo. IEEE J. Selected Top. Applied Earth Observ.
Remote Sens., 5: 1256-1265.
7. REFERENCES Pangborn, A., 2010. Scalable data clustering using
GPUs. MSc Thesis, Rochester Institute of
Azhari, M. and C. Ergün, 2011. Fast Universal Technology.
Background Model (UBM) training on GPUs Pham, V., P. Vo and V. Hung, 2010. GPU
using Compute Unified Device Architecture implementation of extended gaussian mixture
(CUDA). Int. J. Elec. Comput. Sci., 11: 49-55. model for background subtraction. Proceedings of
Chen, C., D. Mu, H. Zhang and B. Hong, 2012. A the International Conference on Computing and
GPUaccelerated approximate algorithm for
Communication Technology Research, Innovation
incremental learning of gaussian mixture model.
and Vision for the Future, (VF’ 10), pp: 1-4.
Proceedings of the IEEE 26th International
Poongothai, K. and S. Sathiyabama, 2012. Efficient
Parallel and Distributed Processing Symposium,
(WF‘12), pp: 1937-1943. web usage miner using decisive induction rules. J.
Kumar, N.S.L.P., S. Satoor and I. Buck, 2009. Fast Comput. Sci., 8: 835-840. DOI:
parallel expectation maximization for Gaussian 10.3844/jcssp.2012.835.840.
mixture models on GPUs Using CUDA. Ramalingam, T. and P. Dhanalakshmi, 2014. Speech/
Proceedings of the 11th IEEE International music classification using wavelet based feature
Conference on High Performance Computing and extraction techniques. J. Comput. Sci., 10: 34-44.
Commu., (CC‘09), pp: 103-109. DOI: 10.3844/jcssp.2014.34.44.
Lee, S. and D. Park, 2012. Evaluation of CUDA for Santos, B. and H. Macedo, 2012. Improving CUDA™
XRay Imaging System. In: Computational C/C++ encoding readability to foster parallel
Intelligence and Intelligent Systems, Springer application development. Sigsoft Softw. Eng.
Berlin Heidelberg, pp: 621-625. Notes 37: 1-5. DOI: 10.1145/2088883.2088897
Shanmugapriya, N. and R. Nallusamy, 2014. A new Tagare, H., A. Barthel and F. Sigworth, 2010. An
content based image retrieval system using GMM adaptive expectation-maximization algorithm with
and relevance feedback. J. Comput. Sci., 10: 330- GPU implementation for electron cryomicroscopy.
340. DOI: 10.3844/jcssp.2014.330.340. J. Struct. Biol., 171: 256-265.
Subbaraj, P. and P. Sivakumar, 2012. Parallel DOI:10.1016/j.jsb.2010.06.004.
memetic algorithm for VLSI circuit partitioning Tharawadee, N., P. Terdtoon and N. Kammuang-lue,
problem using graphical processing units. J. 2013. An investigation of thermal characteristics of
Comput. Sci., 8: 705-710. DOI: a sintered-wick heat pipe with double heat sources.
10.3844/jcssp.2012.705.710, Am. J. Applied Sci., 10: 1077-1086. DOI:
10.3844/ajassp.2013.1077.1086
Sujaritha, M. and S. Annadurai, 2011. A new
Webb, A. and K. Copsey, 2011. Statistical Pattern
modified gaussian mixture model for color-
Recognition. 3rd Edn., John Wiley and Sons,
texture segmentation. J. Comput. Sci., 7: 279-283.
Chichester, ISBN-10: 1119952964, pp: 672.
DOI: 10.3844/jcssp.2011.279.283.