Choi Asap14 Kmeans
Choi Asap14 Kmeans
Choi Asap14 Kmeans
Abstract—The design and implementation of the k-means extensive evaluations on inter- and intra-node communica-
clustering algorithm on an FPGA-accelerated computer clus- tion were performed. In addition, by varying the number of
ter is presented. The implementation followed the map-reduce available mappers and their distributions among the set of
programming model, with both the map and reduce functions
executing autonomously to the CPU on multiple FPGAs. A hard- FPGAs in the system, tradeoff between computation and I/O
ware/software framework was developed to manage gateware bandwidth was studied.
execution on multiple FPGAs across the cluster. Using this k- The remainder of this paper is organized as follows: Sec-
means implementation as an example, system-level tradeoff study tion II provides background information and discusses related
between computation and I/O performance in the target multi-
work. Section III introduces the hardware and software archi-
FPGA execution environment was performed. When compared
to a similar software implementation executing over the Hadoop tecture of the k-means implementation. Section IV presents
MapReduce framework, 15.5× to 20.6× performance improve- the experimental results from comparing our FPGA k-means
ment has been achieved across a range of input datasets. implementation against a software counterpart. Section V
contains system-level performance evaluation on our FPGA
I. I NTRODUCTION system. Section VI concludes this paper with future extension
to this work.
The k-means clustering algorithm is a simple yet powerful
algorithm that forms the basis of many data analysis applica- II. BACKGROUND AND R ELATED W ORK
tions. It has found applications across many domains includ-
ing image processing, pattern recognition, machine learning, A. k-means
bioinformatics, data mining and business analytics, etc. Be- The k-means algorithm is one of the most commonly used
cause of its importance and high computational requirements, unsupervised clustering algorithm for data analysis. The goal
various acceleration attempts have already been proposed over of the k-means algorithm is to partition the input data into
the years. k clusters such that data within a cluster are similar to each
While researchers have demonstrated the performance ben- other in some way while being dissimilar to data in other
efit of implementing the k-means algorithm with FPGAs in clusters. The algorithm proceeds in iterations. Each iteration
various settings, most existing works focused on optimizing begins with k centroids corresponding to the k clusters. Each
the hardware architecture to accelerate their specific applica- input data object is then assigned to one of the k clusters
tion. These solutions worked well on the specific single-FPGA whose centroid is at minimal distance to the data based on a
accelerators, but are not readily applicable to problems that distance metric, such as its Euclidean or Manhattan distance.
demand large-scale distributed cluster computing facilities. Once all data objects are assigned, the centroids of each cluster
To that end, we present the design and implementation of are updated according to the new partitioning. The algorithm
the k-means clustering algorithm on our multi-FPGA system. repeats until the centroids remain unchanged at the end of the
Our implementation was designed following the map-reduce iteration.
programming model and was targeted to run on multiple A number of previous works have already demonstrated the
FPGAs installed as accelerators across the cluster. Compared benefit of utilizing FPGAs in k-means implementations. Sev-
to previous works, our design is unique in that it is general- eral early attempts have been done to accelerate hyperspectral
purpose; it executes across multiple FPGAs, and is readily images clustering using FPGAs. In [6], [9], hardware-software
scalable to larger systems with additional FPGAs. systems combining processor and reconfigurable fabric were
Using our k-means implementation as an example, this work proposed, demonstrating over 11× speedup over the corre-
explores the performance benefits and design considerations of sponding software implementations. Subsequently, the authors
utilizing FPGAs as accelerators in a distributed heterogeneous in [5], [10] achieved similar speedup in processing multi-
computer cluster. System-level performance was evaluated and spectral and hyper-spectral images by utilizing an FPGA-
compared against similar software implementations executing optimized distance calculation in the k-means algorithm. Like-
on top of the popular Hadoop MapReduce framework. As wise, FPGA implementation of k-means has also found useful
I/O performance is often the system performance bottleneck, in real-time image clustering [12].
More recently, in [7], [8], the FPGA implementation of k- Inter-FPGA network
FPGA was employed. Their focuses were to accelerate the Head Node
PCIe bus PCIe bus PCIe bus
Compute Node Compute Node Compute Node
particular application using FPGA-specific hardware structure.
In contrast, this work examines the use of multiple FPGAs
to process large-scale k-means problems by systematically General-purpose local network
Mapper 1 Mapper 1
DMA Mapper 2 DMA Mapper 2
Scheduler Scheduler
core core
…
Mapper n Mapper n
(Key, value) pairs (Key, value) pairs
Ethernet switch
FPGA 3
Ethernet core
…
Reducer n
Compute node provides resource and process management across the cluster.
It also manages the usage of FPGA accelerators among
users, and to facilitate efficient inter-FPGA communication
Data autonomous to the hosting CPUs.
PCIe driver Streamer
Manager
To run a hardware/software job in the cluster, as illustrated
in Figure 4, a top-level program is needed to submit a job
request to the head node following a simple server/client
model. The job tracker program running at the head node
PCIe bus Data
handles the job request by searching a list of available compute
nodes and allocating enough nodes for the job. The top-
level program is also required to submit the job with a
configuration file. The configuration file contains information
DMA core Hard-disk
such as location of the executable of the Data Manager, the
FPGA
input data file and so on. After the configuration file is parsed,
the executables of both the Data Manager and the Streamer are
Fig. 3: PC-to-FPGA communication. copied to the allocated compute nodes and remotely executed
using Open MPI library [2]. The Streamer process is spawned
and instructed to reset the underlying FPGA. Then, input data
payload to the Scheduler. Finally, the Scheduler looks up the are retrieved and streamed to the FPGAs in different compute
key in the received pair and forwards it to the corresponding nodes in parallel. When the job is completed, the job tracker
reducer in the FPGA. Final results are transferred back to the carries out some clean-up works such as freeing the used
host through PCIe bus. compute nodes.
As such, most of the hardware and software components
C. Hardware/Software Management System described in previous section are general-purpose modules
To facilitate the execution of hardware/software applications that are reusable. In the PC-to-FPGA communication, the
with multiple FPGA accelerators in the cluster, a general- communication channel between the Data Manager and the
purpose management system has been developed. The system Streamer is a software FIFO. Therefore, the Streamer can be
25
Performance speedup
program tracker
15
Job execution
10
Compute node Compute node Compute node
Data Data Data 5
Manager Manager Manager
data data data 0
3 6 12
Streamer Streamer Streamer Number of clusters
PCIe PCIe
... PCIe
FPGA, 2D FPGA, 4D
Throughput (MBps)
Throughput (MBps)
Throughput (MBps)
Throughput (MBps)
400 400 400
350 350 350
300 300 300
250 250 250
200 200 200
150 150 150
100 100 100
50 50 50
0 0 0
M = 16 M = 32 M = 64 M = 16 M = 32 M = 64 M = 16 M = 32 M = 64
Number of mappers Number of mappers Number of mappers
k-means k-means map-only k-means k-means map-only k-means k-means map-only
Host-to-FPGA channel Ethernet line speed Host-to-FPGA channel Ethernet line speed Host-to-FPGA channel Ethernet line speed
Theoretical throughput Theoretical throughput Theoretical throughput
Fig. 7: Performance model of the k-means application. Input = 100 M points and D = 2 in all cases.
the second case only 1 FPGA was used with all 24 mappers limitation within the k-means application was the host-to-
implemented. In all experiments, various sizes of input data FPGA communication channel. Various studies show that, by
and data dimensions were used, while K remained at 3. The taking advantage of multiple FPGAs, the I/O communication
input data set was equally divided into 3 subsets, which were overhead can be relieved and greater performance improve-
individually stored on the compute nodes. Figure 8 shows ment can be achieved.
their performance with various input parameters. It is apparent In the future, we plan to increase the scale of the exper-
that distributing the same number of mappers across 3 FPGAs iment to evaluate the case for utilizing multiple FPGAs in
consistently outperforms that using 1 FPGA. processing large data sets in data centers. Further experiments
We attribute the performance benefit of the multi-FPGA are planned to better understand the trade-off between I/O and
implementation to the reduced I/O bandwidth requirement on computational performance limitations.
the mapper FPGA as we split the set of mappers into multiple
R EFERENCES
FPGAs. This effect is particularly prominent with large D.
Consider each pair of lines (dotted, dashed and solid). For [1] “Apache Hadoop, http://hadoop.apache.org/.”
[2] “Open MPI, http://www.open-mpi.org/.”
the case of D = 4, with 90M data entries, the single mapper [3] K. Bache and M. Lichman, “UCI Machine Learning Repository,” 2013.
FPGA version is more than 2 times slower. With 90M 8-D [Online]. Available: http://archive.ics.uci.edu/ml
data points, the performance of using only one mapper FPGA [4] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on
Large Clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008.
is about 4 times slower. [5] M. Estlick, M. Leeser, J. Theiler, and J. J. Szymanski, “Algorithmic
Further experiments may be done in the future in order transformations in the implementation of k- means clustering on re-
to better understand the balanced ratio between mappers and configurable hardware,” in Proceedings of the 2001 ACM/SIGDA ninth
international symposium on Field programmable gate arrays, ser. FPGA
FPGAs. ’01. ACM, 2001, pp. 103–110.
[6] M. Gokhale, J. Frigo, K. Mccabe, J. Theiler, C. Wolinski, and D. Lave-
C. Resource Consumption nier, “Experience with a hybrid processor: K-means clustering,” The
Table III summarizes the resource consumption of the FPGA Journal of Supercomputing, vol. 26, no. 2, pp. 131–148, 2003.
[7] H. Hussain, K. Benkrid, H. Seker, and A. Erdogan, “FPGA Im-
k-means clustering design. The modules Map and Reduce plementation of K-means Algorithm for Bioinformatics Application:
show the resource utilization of individual map and reduce An Accelerated Approach to Clustering Microarray Data,” NASA/ESA
function. The Overall Map and Overall Reduce modules Conference on Adaptive Hardware and Systems (AHS), pp. 248–255,
Jun. 2011.
indicate resource usage of the largest map and reduce [8] H. Hussain, K. Benkrid, A. Erdogan, and H. Seker, “Highly param-
designs, which were implemented with M = 32 and K = 12 eterized k-means clustering on fpgas: Comparative results with gpps
respectively. and gpus,” in Reconfigurable Computing and FPGAs (ReConFig), 2011
International Conference on, 2011, pp. 475–480.
VI. C ONCLUSIONS [9] D. Lavenier, “FPGA implementation of the k-means clustering algorithm
for hyperspectral images,” Los Alamos National Laboratory LAUR, pp.
In this paper, we have presented an implementation of the 00–3079, 2000.
k-means algorithm using multiple FPGAs in a heterogeneous [10] M. Leeser, J. Theiler, M. Estlick, and J. Szymanski, “Design tradeoffs
in a hardware implementation of the k-means clustering algorithm,” in
computer cluster. The implementation took advantage of the Proceedings of the 2000 IEEE Sensor Array and Multichannel Signal
map-reduce programming model to allow easy scaling and Processing Workshop, 2000, pp. 520–524.
parallelization across the distributed computer system. A ded- [11] Z. Lin, C. Lo, and P. Chow, “K-means implementation on FPGA for
high-dimensional data using triangle inequality,” Field-Programmable
icated inter-FPGA communication channel was built in order Logic and Applications, pp. 437–442, Aug. 2012.
to allow an efficient and autonomous data movement between [12] T. Saegusa and T. Maruyama, “An fpga implementation of k-means clus-
FPGAs. A cluster management system was also developed to tering for color images based on kd-tree,” in International Conference
on Field Programmable Logic and Applications, 2006., ser. FPL ’06,
handle job requests and monitor the overall cluster. Perfor- 2006, pp. 1–6.
mance evaluations using real-life statistics as input data were [13] Y. Shan, B. Wang, J. Yan, Y. Wang, N. Xu, and H. Yang, “FPMR:
carried out and the experimental results were compared with MapReduce framework on FPGA,” in Proceedings of the 18th annual
ACM/SIGDA international symposium on Field programmable gate
an equivalent software k-means solution running on Hadoop arrays, ser. FPGA ’10, 2010, pp. 93–102.
cluster. Our performance results show that the multiple-FPGA [14] J. H. C. Yeung, C. C. Tsang, K. H. Tsoi, B. S. H. Kwan, C. C. C.
design can outperform the Hadoop software version in all test Cheung, A. P. C. Chan, and P. H. W. Leong, “Map-reduce as a
Programming Model for Custom Computing Machines,” in Proceedings
cases, offering speedup from 15.5× to 20.6×. I/O bottleneck of the 2008 16th International Symposium on Field-Programmable
analysis was performed with several specifically designed Custom Computing Machines, ser. FCCM ’08, 2008, pp. 149–159.
experiments. It was found that the primary source of I/O