Expectation Maximisation Algorithm

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Journal of Computer Science 10 (10): 2124-2134, 2014

ISSN: 1549-3636
© 2014 Science Publications
doi:10.3844/jcssp.2014.2124.2134 Published Online 10 (10) 2014 (http://www.thescipub.com/jcs.toc)

PARALLEL IMPLEMENTATION OF EXPECTATION-


MAXIMISATION ALGORITHM FOR THE TRAINING OF
GAUSSIAN MIXTURE MODELS
1
Araújo, G.F., 2H.T. Macedo, 2M.T. Chella, 2C.A.E. Montesco and 2M.V.O. Medeiros
1
Núcleo de Pós-Graduação em Ciência da Computação, UFS, São Cristóvão and Brasil
2
Departamento de Computação, UFS, São Cristóvão, Brasil

Received 2013-11-18; Revised 2013-11-22; Accepted 2014-07-07


ABSTRACT
Most machine learning algorithms need to handle large data sets. This feature often leads to limitations on
processing time and memory. The Expectation-Maximization (EM) is one of such algorithms, which is used
to train one of the most commonly used parametric statistical models, the Gaussian Mixture Models
(GMM). All steps of the algorithm are potentially parallelizable once they iterate over the entire data set. In
this study, we propose a parallel implementation of EM for training GMM using CUDA. Experiments are
performed with a UCI dataset and results show a speedup of 7 if compared to the sequential version. We
have also carried out modifications to the code in order to provide better access to global memory and
shared memory usage. We have achieved up to 56.4% of achieved occupancy, regardless the number of
Gaussians considered in the set of experiments.

Keywords: Expectation-Maximization (EM), Gaussian Mixture Models (GMM), CUDA

1. INTRODUCTION The clear advantage of using GPUs is the small costs


if compared to clusters or supercomputers and its
Machine Learning (ML) algorithms are often costly, processing power if compared to multi-core processors.
since learning is a task that requires a large amount of Even the former NVIDIATM GeForce™ 8400 GS
knowledge and constant improvement of it, thus graphics card, for instance, is able to run up to 32 threads
requiring massive data computation. A major problem of in parallel per clock cycle, under some restrictions.
massive computing is the limitation of mainstream The work on CUDA to provide parallelized
sequential processing in older computer architectures. implementations of important algorithms in different
Such limitation can be overcome using a parallel domains can be observed in recent scientific literature
processing of data provided on newer architectures. (Subbaraj and Sivakumar, 2012; Tharawadee et al.,
One of these recent architecture is the NVIDIA™ 2013; Meng et al., 2013; Mielikainen et al., 2012;
CUDA™ architecture, which is a framework for Lee and Park, 2012).
developing general programs source code and using the Results show average performance gains of up to
power of Graphical Processing Units (GPUs) to perform 30 times compared to processing the same problem
execution. It is possible to use the CUDA-C using conventional CPUs. There are also efforts to
programming language, for instance, to provide a further develop the readability of CUDA programs
parallelized source code. through the development of an Application
GPUs have high amount of internal multiprocessors, Programming Interface (API) for C/C++ which
optimized for doing several Computer Graphics automate the processing of sequential code into
calculations in parallel. parallelized code (Santos and Macedo, 2012).
Corresponding Author: Araújo, G.F., Núcleo de Pós-Graduação em Ciência da Computação, UFS, São Cristóvão and Brasil

Science Publications 2124 JCS


Araújo, G.F. et al. / Journal of Computer Science 10 (10): 2124.2134, 2014

In this study we present the CUDA parallel The algorithm starts from a Θ(0) (usually defined
implementation of the Expectation-Maximization arbitrarily, choice) and iterates through both steps
algorithm for the estimation of Gaussian Mixture until a stop criterion is satisfied. The widely used
Models. The Gaussian mixture model is one of the most criterion is the variation of Q between steps, defined as
widely used statistical models for machine learning Equation 4:
tasks, being the most flexible parametric model.
We are particularly interested in verifying whether a Θt +1 − Θt ≤ ε (4)
more efficient global memory access can reduce the
concerned overhead, providing better usage of CUDA
cores and, thus, improving performance. 1.2. EM for GMM Estimation
1.1. Expectation-Maximization Algorithm (EM) Gaussian classifiers are the most widely used
methods for supervised classification. However, these
Statistical models are used in many machine methods have limitations when dealing with problems
learning techniques. The maximum likelihood method where the classes cannot be linearly separable. Also, they
(Maximum- Likelihood Estimation, or just MLE) can cannot deal with non-Gaussian data, since their
estimate the parameters of a statistical model from a discriminant functions are linear or quadratic. A
set of sample data, for further usage in classification workaround for such limitation is to combine probability
tasks, for instance. functions (pdf's). Indeed, this approach is widely used
An important concern is what to do when some data because it is a parametric method that can be applied to
sample are missing. It is yet possible to perform non-linear classification problems. Such technique is
estimation of model parameters. The Expectation- known as Finite Mixture Model and its probability
Maximization (EM) allows learning of parameters that function is defined as:
govern the distribution of the sample data with some
missing features (Sujaritha and Annadurai, 2011;
g

∑π
Poongothai and Sathiyabama, 2012; Malarvezhi and
p ( x) = j p ( x; Θ j ) (5)
Kumar, 2013).
j =i
The MLE is defined as Equation 1:

(
∂ ln p y ( yk ; Θ) ) where, g is the number of components (pdf) of the
Θ ML : ∑ k
∂Θ
(1) mixture; πj is the probability of the components
(commonly known as the weight of the component),
such that 1 ∑ j =1π j and p (x; Θj) is the pdf of the
g

where, y represents the full set of sample data. To deal component in regards to the parameters Θj.
with missing data, tough, the EM can iteratively When we use Gaussian models, each component
maximize the hope of the likelihood function, given assumes a multivariate normal distribution, where Θj =
the observed samples and the estimate of the current {µ; Σj}. This model is known as Gaussian Mixture
iteration Θ. Model (GMM) (Shanmugapriya and Nallusamy, 2014),
The EM algorithm consists of two steps. (Ramalingam and Dhanalakshmi, 2014). Equation 5 cant
The E-step computes the hope of logarithmic thus be rewritten as Equation 6:
likelihood, conditionally to the set of observed data and
the current value of the parameters, Θt Equation 2: g
p ( x) = ∑π N ( x; µ ∑ j) (6)
∑ ( ( ))
j j
Q (Θ; Θt ) ≡ E  ln p y yk ; Θ X ; Θt  (2) j =i
 

But how to find the parameters that maximize the


The M-step computes the (t+1)-th parameter vector Θ
likelihood of the GMM? Typically, the parameters of the
that maximizes Q (Θ; Θt), given by Equation 3:
components of GMMs are estimated using the EM
algorithm described in the previous section. For the
∂Q(Θ; Θt )
Θt +1 : (3) GMM, the EM steps are defined as follows.
∂Θ E-step: Calculate for each given i:

Science Publications 2125 JCS


Araújo, G.F. et al. / Journal of Computer Science 10 (10): 2124.2134, 2014

  hypothesis-test and merging based algorithm. EM is not



t
π tj N  xi ; π tj
 used. The most time-consuming part of the algorithm is
wij =   j
(7) accelerated by GPU. Davies-Bouldin index is used to
∑ π N  x ;π , ∑ 
t t t
k i k  measure the cluster quality of the algorithm.
k k 
Pham et al. (2010), the authors proposes a GPU
where, πj,µj and Σj are the weights, means and covariance implementation of the Extended GMM to Background
matrices of component j at step t. Subtraction (BGS), which is used in various computer
M-step: For each given j, update the parameters vision problems.
Equation 8 to 10: Pangborn (2010), the author presents a strategy to
speed up the EM algorithm for clustering in single and
n multiple GPUs. The approach consists in breaking the
∑w
1 Estep into two kernels and the M-step into three
πj = ij (8)
n kernels. The work includes a parallel version of the
i =1
cmeans algorithm.


n Kumar et al. (2009), the authors spread the EM
wij x j
µj = i =1 (9) algorithm over six CUDA kernels for a fast parallel

n parametric estimation of GMM. The work focus is in
wij
i =1 speeding up the EM algorithm through improvements
of the kernels and data organization, not using it for
n

∑w (x )( xi − µ j )
specific problems.

1 T
ij i −µj (10)
j nπ j Azhari and Ergün (2011), the authors implements a
i =1
CUDA version of the EM algorithm for speaker
verification based on Gaussian Mixture Modeling-
As described above, the EM algorithm iterates until Universal Background Modeling (GMMUBM). The
the convergence of the model likelihood (stopping major difference between this and the previous presented
criterion). It is possible, though, that the algorithm works is that it uses only 2 kernels for the EM, one for
becomes stuck in a local minimum, leading to the E-step and one for the M-Step. There is also a
nonoptimal solutions. It is thus a common practice, parallel implementation of the k-means algorithm.
repeat the training process few times more, initializing Machlica et al. (2011), the authors present an
the parameters with different values and in the end, implementation of the parallel EM algorithm for
choose the best solution (Webb and Copsey, 2011). GMM training. According to the paper, their approach
Moreover, both the calculation of w and the calculation offers better memory occupancy and greater speedup
of parameters π, µ and Σ iterate over all sample data. due to less coalesced access. Their results were
For a large dataset, the time of the training process obtained using adapted data taken from 2008 NIST
can be huge, especially in cases where there are high Speaker Recognition Evaluation.
numbers of components. Despite such limitation,
calculations performed for each data are independent and 2. MATERIALS AND METHODS
thus, fully parallelizable.
The method to provide parallelized implementation
1.3. Previous Work on Parallel Implementation
of EM for GMM is greatly founded on how to deal
of EM and GMM Learning with the specificities of CUDA. In order to provide
Tagare et al. (2010), the authors present a strategy to better understanding of the approach, firstly, we
speed up the EM algorithm using domain reduction. The depict CUDA itself.
approach considers the use of three different kernels to
2.1. Technological Background: CUDA
compute the calculation of latent probabilities and the
Riemann sums for the parameter updates. The EM is Multiprocessors are responsible for GPU internal
used for reconstructing 3D volumes from noisy Electron processing and there may exist many of them, varying
Cryomicroscopy images of single macromolecular according to graphics card's model. Every multiprocessor
particles. The work focus on problems other than GMM. is composed by smaller processors, the so-called Core
Chen et al. (2012), the authors derive an algorithmic processors. These core processors share the same
method for incremental GMM learning from a instruction chip, which belongs to the multiprocessor.

Science Publications 2126 JCS


Araújo, G.F. et al. / Journal of Computer Science 10 (10): 2124.2134, 2014

This means that CUDA™ architecture works as a The implemented CUDA kernels are depicted as
Single Instruction Multiple Data (SIMD) system, where follows:
every multiprocessor is capable of processing only one
instruction at a time. The basic parallel processing • p-kernel: For each Gaussian component, j computes
element is a thread, just like CPU, but there are two the probability of each data xi conditional to
others important concepts: The block and the grid. A parameters Θj, multiplied by the weight of
block is a composition of up to t threads, where t is the component πj. In this kernel, the thread blocks are
maximum value supported by the GPU. It is also the arranged in a grid jmx, where m blocks of line j are
element seen in the multiprocessors, responsible for fully responsible for the calculation for the component j
process all the threads of a block, when thus a new ready • ^p-kernel: For each data xi, normalizes their
block is chosen. A grid, on the other hand, is an probabilities computed in the previous kernel for
aggregation of multiple blocks. each component j. It concerns the wij values of
Both the grid of blocks and the blocks of threads can Equation 7. In this step m, blocks of threads are used
be uni-, two- or three-dimensional. A kernel call needs to and each block is responsible for normalizing the
specify the dimensions of grid and blocks and thus, it is probabilities for a given data at a time, until the
possible to run kernels with different arrangements of entire probability base is normalized
threads within the same application. • m-kernel: For each Gaussian, estimates its marginal
The memory hierarchy consists of local memory, probability, that is, calculates the sum of the
global memory and shared memory. The local memory is probabilities of the data related to each component j.
a high speed memory and private to each thread. The j blocks of threads are used and each block is
shared memory is larger and slower than the local responsible for performing the sum of a component
memory, but it is accessible by all threads of the same µ-kernel: For each Gaussian, re-estimates the mean
block, allowing threads to work collaboratively within a vector µ that maximizes the likelihood, as described
single run. The global memory is the largest and slowest in Equation 9. Again, using j blocks of threads, each
memory of GPU, but it is accessible by any thread, block is responsible for a component
thereby allowing different kernels to share common data. • Σ-kernel: Re-estimates the covariance matrices Σ of the
components. In this step, we use an array of 2D blocks,
2.2. Rationale of the Parallelization Approach where blocks of threads are organized in a square
The calculation of wij in E-step (Equation 7) and the matrix of order N, where N is the dimension of the
calculations of weights πj, means µj and covariances Σj data. Thus, each block is responsible for reestimate an
are extremely parallelizable as they iterate over all the element sij of each of the covariance matrices
data and are independent of each other. • π-kernel: Re-estimates the weights π of components.
An important point to be considered is the transfer of Since the weight of a given component is given by
data from the host (main memory) to the GPU memory. the marginal probability normalized, as described in
The bus transfer between these two memories is slow Equation 8, this step contains only a single block of j
and its usage should be avoided. As the algorithm must threads, which perform the summation and
normalization of the marginal probabilities
run iteratively in order to satisfy a stopping criterion and
all steps are parallelizable, it would be more effective if • φ-kernel: Calculates determinants and inverse
matrices of covariance matrices, which are used to
the whole main loop of the algorithm could run on
calculate the probabilities of p-kernel step. In this
GPU, to avoid such data transfer. However, the
step, the matrix decomposition technique called LU
arrangement of threads is statically defined in the
decomposition is used. In such technique, we rewrite
kernel. This becomes an inconvenience, since the the matrix as the product of a lower triangular
arrangement of threads is an important setting for a matrix (L matrix, lower) by an upper triangular
better efficiency of parallelization and each step of the matrix (matrix U, upper). Using j blocks of one
EM algorithm requires a different arrangement. thread only, the thread executes sequentially the LU
Similarly to the approach of (Machlica et al., 2011) decomposition algorithm
and (Kumar et al., 2009), in our proposal the main loop
of the algorithm is implemented sequentially and Figure 1 summarizes the distribution of component
different CUDA kernels are in charge of running parameters and input data on the arrangement of blocks
different steps of the algorithm. and threads of the grid.

Science Publications 2127 JCS


Araújo, G.F. et al. / Journal of Computer Science 10 (10): 2124.2134, 2014

Fig. 1. Distribution of component parameters and input data on the grid

2.3. The Algorithm Output: πI, µi, Σj|i∈{1,2…., Gaussiannum}


for i←1, Gaussiannum do
The EM for GMM estimation algorithm takes as input
the samples from the dataset, the number of gaussians to Initialize parameters (πI, µi,Σi);
be estimated and the threshold as the stopping criterion. end for
At each iteration, the algorithm initializes the parameters while-stop condition () do
of each gaussian (weight π, mean µ and covariance for j ←1, Samplesnum do
matrix Σ). Next, for each sample, it estimates the for i←1, Gaussiannum do
likelihoods for each Gaussian and normalizes them. likelihoodij← Calculatelikelihood(Samplej, πi,
Finally, the parameters of the gaussian are re-estimated µi, Σi);
using the likelihood values. Iterations occur until the end for
stopping criterion is satisfied (Algorithm 1). Each thread, likelihoodj←Normalize Likelihood (likelihoodsj);
one per block, estimates the likelihood of the sample j on end for
Gaussian i, according to the position (i, j) of its block at for i←1, Gaussiannum do
the grid (Algorithm 2). The set of threads in a block πi ←UpdateWeight (likelihoodj);
performs the normalization likelihoods of values of a µi ←Update Mean (likelihoodsj, Samples, πi);
sample, using the reduction technique (Algorithm 3).
Σi ←Update Couariance(likelihoodsj, Samples, πi,
The number of threads in a block performs the
calculation of the probability of a marginal Gaussian µi);
(Algorithm 4). The number of threads in a block end for
performs the reestimation of the parameter of a Gaussian end while
mean µ (Algorithm 5). The number of threads in a block Algorithm 2 CUDA Parallel p-kernel
performs the reestimation of the parameter covariance
matrix Σ of a gaussian (Algorithm 6). Input: Samples, Samplesnum, πi, µi, Σi
Output: Likelihoods
Algorithm 1 EM for GMM’s estimation i←Block Index. Y;
Input: samples, samplesnum, Gaussiannum: j←Block Index. X;
Thresholdmin likelihoodij←πi×N(samplesj, µi,Σi);

Science Publications 2128 JCS


Araújo, G.F. et al. / Journal of Computer Science 10 (10): 2124.2134, 2014

Algorithm 3 CUDA Parallel p̂ -kernel Algorithm 6 CUDA Parallel Σ-kernel


Input: Liklelihoods, Samplesnum Input: Liklelihoods, Samples, Samplesnum, marginals,
Output: Likelihoods Gaussiannum, Dimensionnum
i←ThreadIndex. X; Output: Σk
j←BlockIndex. X; l←ThreadIndex.X;
cachei ←likelihoodij; i←BlockIndex.X;
SymchronizeThreads(); j←BlockIndex.Y;
Limit ←ThreadsPerBlock/2; for k←1Gaussiannum do
While limit ≠ 0 do sub1←Sample11-µki;
If i<limit then sub2←Sampleij-µkj;
cachei←cachej+cachei+limit; cache1←(sub1*sub2*likelihoodki);
end if SymchronizeThreads();
SymchronizeThreads(); limit ←ThreadsPerBlock/2;
limit←limit/2; while limit ≠ 0 do
end while If i< limit then
likelihoodij← likelihoodij/cache0; Cachel←cachel+cachel+limit;
end if
Algorithm 4 CUDA Parallel m-kernel SymchronizeThreads();
Input: Liklelihoods, Samplesnum limit←limit/2;
Output: Marginals end while
j←ThreadIndex. X; Σkij← cache0/marginalk;
i←BlockIndex. X; End for
cachei ←likelihoodij;
SymchronizeThreads(); 3. RESULTS
Limit ←ThreadsPerBlock/2;
While limit ≠ 0 do 3.1. Dataset
If i< limit then We have used the dataset Arabic Spoken Digit 3 from
cachei←cachej+cachej+limit; UCI Repository in order to test the algorithm
end if implementation. This dataset consists of instances with
SymchronizeThreads(); 13 Mel Frequency Cepstral Coefficients (MFCC), widely
limit←limit/2; used to represent audio signals in speech processing
end while systems, which commonly use GMMs to model the
marginal← likelihoodij/cache0; distribution of phones in the language. The database
Algorithm 5 CUDA Parallel µ-kernel consists of 8800 instances: A training base with 6600
instances and a testing base with 2200 instances. These
Input: Liklelihoods, Samples, Samplesnum, marginals instances correspond to audios of 88 speakers (44 males
Output: µi and 44 females) pronouncing the digits 0 to 9 in Arabic.
j←ThreadIndex.X;
i←BlockIndex.X; 3.2. Metrics
cachei ← Samplej* likelihoodij; In parallel programming, Speedup (or Speed-up) is
SymchronizeThreads(); the most widely used metric to evaluate how much a
Limit ← ThreadsPerBlock/2; parallel algorithm is faster than its sequential version. It
While limit ≠ 0 do its defined as Equation 11:
If i < limit then
Cachei ← cachej+cachej+limit;
T1
end if Sp = (11)
SymchronizeThreads(); Tp
limit←limit/2;
end while where, p is the number of processors on which the
µi← cache0/marginalj; algorithm is running, T1 is the execution time of the

Science Publications 2129 JCS


Araújo, G.F. et al. / Journal of Computer Science 10 (10): 2124.2134, 2014

algorithm and Tp is the execution time of the parallel regards to memory access is the coalesced access to the
algorithm. global memory, i.e., the warp threads need to access
adjacent memory blocks and aligned with the "cache
3.3. Experimentation Sets line" (fixed blocks of memory that are loaded once to the
The GPU used in the first two sets of experiments cache memory). In this way, the accesses required by the
was a NVidia GeForce GTS 250 with 16 processors. various warp threads are part of a single transaction,
In the first experimentation set, the target number of thereby reducing the overhead of the global memory
Gaussians (components) have been varied with a fixed access. The arrangement of threads in the grid and its
quantity of 30 iterations to estimate parameters. Figure 2 blocks are also important for the coalesced access, since
shows a comparison between the time of parallel a better arrangement ensures that the accesses of the
execution and its respective sequential version. This active warp threads are aligned on cache size.
execution has shown a Speedup S16 = 7. The runtime of In order to verify the performance gain achieved with
the parallelized version varies from 0.781 to 15.094 sec. the changes, tests with a NVidia GeForce GT 555M
This time results from the execution of φ-kernel GPU have been performed. An important metric to
step, which actually performs sequentially on the consider when considering the arrangement and
GPU. The second set of experiments was conducted coalesced access is the occupancy, which refers to how
varying the number of the instances in dataset: 23,344 effectively the hardware (the CUDA cores) is kept busy,
to 263,256. In this case, the algorithms have
i.e., the longer busy, best the hardware effective use.
performed 30 iterations to estimate two Gaussians.
Experiments have been performed by varying the
Results of this step are shown in Fig. 3. In this step,
the Speedup of parallel algorithm was S16 = 6. number of Gaussians and size of the database, for both
versions of the p and p kernel codes (E-Step). Firstly, we
3.4. Coalesced Access to Global Memory have fixed the database size of 23344 instances and
varied the number of Gaussians. In the second set of
We have carried out modifications to the code of
experiments, we have varied the size of the database for
the kernels p-kernel and p-kernel to provide better
a fixed number of Gaussians (eight). Figure 4 and 5
access to global memory and shared memory usage.
The kernels used in such arrangements have been show the execution time (in milliseconds) for both
rearranged according to the size of the warp in order kernels by varying the number of Gaussians: 1, 2, 4, 8,
to ensure aligned access with the “cache line” and, at 16, 24 and 32. Figures 6 and 7 illustrate the runtimes of
the same time, maintaining all the CUDA cores both kernels varying the size of the database. In such
employed as long as possible. case, the average speedup was 19x and 30x for the
We have carried out modifications to the code in kernels p and p, respectively. In both kernels, the largest
order to allow a more efficient memory access and a observed speedups (~21x and ~32x) occurred with the
more appropriate array of threads. An important issue in database containing 152526 instances.

Fig. 2. Execution time for both parallel and sequential versions of EM as the number of gaussians increases

Science Publications 2130 JCS


Araújo, G.F. et al. / Journal of Computer Science 10 (10): 2124.2134, 2014

Fig. 3. Execution time for both parallel and sequential versions of EM as the number of instances of dataset varies

Fig. 4. Comparison of the execution time of the p-kernel with and without coalesced access by increasing the number of Gaussians

Fig. 5. Comparison of the runtime of the ^p-kernel with and without coalesced access by increasing the number of Gaussians

Science Publications 2131 JCS


Araújo, G.F. et al. / Journal of Computer Science 10 (10): 2124.2134, 2014

Fig. 6. Comparison of the execution time of the p-kernel with and without coalesced access by increasing the database size

Fig. 7. Comparison of the runtime of the ˆp -kernel with and without coalesced access by increasing the database size

4. DISCUSSION Furthermore, it is important to notice that the


execution settings and resource usage need to be
Results have shows an increase from 38.8 to 50.2% adjustable to the capabilities of the available
on the average achieved occupancy for pkernel, hardware, since the architectures of GPUs with
ranging from 16.6 to 60.1% with the growth in the support to CUDA have undergone changes over the
number of Gaussians. In the case of p-kernel, the years, including the cache memory management,
growth was 16.2 to 56.4% of achieved occupancy, which directly affects the transfer between the
regardless the number of Gaussians. Neither kernels different available memories.
have varied with the increase in the size of the
database. As had been initially suspected, this 5. CONCLUSION
increased occupancy directly reflects the execution
time of kernels. This is clearly shown in Fig. 4-7. In this study, we propose an approach to provide a
The new version of p-kernel code, for instance, has parallelized implementation of Expectation Maximization
achieved an average speed up of ~22x if compared to the (EM) algorithm for training Gaussian Mixture Models
previous one. In regards to the p-kernel, the average (GMM). GMM is vastly used in Automatic Speech
speedup is ~33x and the highest value has been observed Recognition (ASR) systems, for instance.
for only one Gaussian. In our approach for parallelization, the main loop of
These results show a clear performance gain when the algorithm is implemented sequentially and different
there is a greater control of hardware resources. CUDA kernels are in charge of running different steps of

Science Publications 2132 JCS


Araújo, G.F. et al. / Journal of Computer Science 10 (10): 2124.2134, 2014

the algorithm. Pseudocode for each CUDA kernel Machlica, L., J. Vanek and Z. Zajic, 2011. Fast
implementation is properly provided. estimation of gaussian mixture model parameters
Experiments performed over a UCI database and on GPU Using CUDA. Proceedings of the 12th
varying number of Gaussians have shown a speedup International Conference on Parallel and
of 7 if compared to sequential implementation of EM Distributed Computing, Applications and
algorithm. We have also provided modifications in Technologies, (AT ‘11), pp: 167-172.
two of the CUDA kernels in order to allow more Malarvezhi P. and R. Kumar, 2013. A novel two stage
coalesced access to global memory. As a result we carrier frequency off set estimation and
have achieved up to 56.4% of achieved occupancy. compensation scheme in multiple input multiple
The proposed approach thus contributes to the output-orthogonal frequency division
stateof-the-art of the research in ASR by providing an multiplexing system using expectation and
effective algorithm for training GMM. maximization iteration. J. Comput. Sci., 9: 1526-
Future work consists in providing modifications to 1533. DOI: 10.3844/jcssp.2013.1526.1533.
the other three CUDA kernels in order to provide more Meng, C., L. Wang, Z. Cao, X. Ye and L. Feng, 2013.
coalesced access to global memory in a similar manner Acceleration of a High Order Finite-Difference
we have made to the first two of them. WENO Scheme for Large-Scale Cosmological
Simulations on GPU. Proceedings of the IEEE
6. ACKNOWLEDGEMENT 27th International Symp. on Parallel and
Distributed Processing Workshops and PhD
The researchers thank the Conselho Nacional de
Desenvolvimento Científico e Tecnológico (CNPq- Forum, (DPW’ 13), IEEE Computer Society, pp:
Brasil) for the financial support [Universal 14/2012, 2071-2078
Processo 483437/2012-3] and for granting a scholarship Mielikainen, J., B. Huang, H. Huang and M. Goldberg
to M.V.O. Medeiros. The authors also thank the 2012. Improved GPU/CUDA Based Parallel
Fundação de Apoio à Pesquisa e à Inovação Tecnológica Weather and Research Forecast (WRF) Single
do Estado de Sergipe (FAPITEC-Sergipe-Brasil) for Moment 5-Class (WSM5) Cloud Microphysics.
granting a scholarship to G.F. Araújo. IEEE J. Selected Top. Applied Earth Observ.
Remote Sens., 5: 1256-1265.
7. REFERENCES Pangborn, A., 2010. Scalable data clustering using
GPUs. MSc Thesis, Rochester Institute of
Azhari, M. and C. Ergün, 2011. Fast Universal Technology.
Background Model (UBM) training on GPUs Pham, V., P. Vo and V. Hung, 2010. GPU
using Compute Unified Device Architecture implementation of extended gaussian mixture
(CUDA). Int. J. Elec. Comput. Sci., 11: 49-55. model for background subtraction. Proceedings of
Chen, C., D. Mu, H. Zhang and B. Hong, 2012. A the International Conference on Computing and
GPUaccelerated approximate algorithm for
Communication Technology Research, Innovation
incremental learning of gaussian mixture model.
and Vision for the Future, (VF’ 10), pp: 1-4.
Proceedings of the IEEE 26th International
Poongothai, K. and S. Sathiyabama, 2012. Efficient
Parallel and Distributed Processing Symposium,
(WF‘12), pp: 1937-1943. web usage miner using decisive induction rules. J.
Kumar, N.S.L.P., S. Satoor and I. Buck, 2009. Fast Comput. Sci., 8: 835-840. DOI:
parallel expectation maximization for Gaussian 10.3844/jcssp.2012.835.840.
mixture models on GPUs Using CUDA. Ramalingam, T. and P. Dhanalakshmi, 2014. Speech/
Proceedings of the 11th IEEE International music classification using wavelet based feature
Conference on High Performance Computing and extraction techniques. J. Comput. Sci., 10: 34-44.
Commu., (CC‘09), pp: 103-109. DOI: 10.3844/jcssp.2014.34.44.
Lee, S. and D. Park, 2012. Evaluation of CUDA for Santos, B. and H. Macedo, 2012. Improving CUDA™
XRay Imaging System. In: Computational C/C++ encoding readability to foster parallel
Intelligence and Intelligent Systems, Springer application development. Sigsoft Softw. Eng.
Berlin Heidelberg, pp: 621-625. Notes 37: 1-5. DOI: 10.1145/2088883.2088897

Science Publications 2133 JCS


Araújo, G.F. et al. / Journal of Computer Science 10 (10): 2124.2134, 2014

Shanmugapriya, N. and R. Nallusamy, 2014. A new Tagare, H., A. Barthel and F. Sigworth, 2010. An
content based image retrieval system using GMM adaptive expectation-maximization algorithm with
and relevance feedback. J. Comput. Sci., 10: 330- GPU implementation for electron cryomicroscopy.
340. DOI: 10.3844/jcssp.2014.330.340. J. Struct. Biol., 171: 256-265.
Subbaraj, P. and P. Sivakumar, 2012. Parallel DOI:10.1016/j.jsb.2010.06.004.
memetic algorithm for VLSI circuit partitioning Tharawadee, N., P. Terdtoon and N. Kammuang-lue,
problem using graphical processing units. J. 2013. An investigation of thermal characteristics of
Comput. Sci., 8: 705-710. DOI: a sintered-wick heat pipe with double heat sources.
10.3844/jcssp.2012.705.710, Am. J. Applied Sci., 10: 1077-1086. DOI:
10.3844/ajassp.2013.1077.1086
Sujaritha, M. and S. Annadurai, 2011. A new
Webb, A. and K. Copsey, 2011. Statistical Pattern
modified gaussian mixture model for color-
Recognition. 3rd Edn., John Wiley and Sons,
texture segmentation. J. Comput. Sci., 7: 279-283.
Chichester, ISBN-10: 1119952964, pp: 672.
DOI: 10.3844/jcssp.2011.279.283.

Science Publications 2134 JCS

You might also like