Choi Asap14 Kmeans

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Map-Reduce Processing of K-means Algorithm

with FPGA-accelerated Computer Cluster


Yuk-Ming Choi, Hayden Kwok-Hay So
Department of Electrical and Electronic Engineering
The University of Hong Kong
{ymchoi, hso}@eee.hku.hk

Abstract—The design and implementation of the k-means extensive evaluations on inter- and intra-node communica-
clustering algorithm on an FPGA-accelerated computer clus- tion were performed. In addition, by varying the number of
ter is presented. The implementation followed the map-reduce available mappers and their distributions among the set of
programming model, with both the map and reduce functions
executing autonomously to the CPU on multiple FPGAs. A hard- FPGAs in the system, tradeoff between computation and I/O
ware/software framework was developed to manage gateware bandwidth was studied.
execution on multiple FPGAs across the cluster. Using this k- The remainder of this paper is organized as follows: Sec-
means implementation as an example, system-level tradeoff study tion II provides background information and discusses related
between computation and I/O performance in the target multi-
work. Section III introduces the hardware and software archi-
FPGA execution environment was performed. When compared
to a similar software implementation executing over the Hadoop tecture of the k-means implementation. Section IV presents
MapReduce framework, 15.5× to 20.6× performance improve- the experimental results from comparing our FPGA k-means
ment has been achieved across a range of input datasets. implementation against a software counterpart. Section V
contains system-level performance evaluation on our FPGA
I. I NTRODUCTION system. Section VI concludes this paper with future extension
to this work.
The k-means clustering algorithm is a simple yet powerful
algorithm that forms the basis of many data analysis applica- II. BACKGROUND AND R ELATED W ORK
tions. It has found applications across many domains includ-
ing image processing, pattern recognition, machine learning, A. k-means
bioinformatics, data mining and business analytics, etc. Be- The k-means algorithm is one of the most commonly used
cause of its importance and high computational requirements, unsupervised clustering algorithm for data analysis. The goal
various acceleration attempts have already been proposed over of the k-means algorithm is to partition the input data into
the years. k clusters such that data within a cluster are similar to each
While researchers have demonstrated the performance ben- other in some way while being dissimilar to data in other
efit of implementing the k-means algorithm with FPGAs in clusters. The algorithm proceeds in iterations. Each iteration
various settings, most existing works focused on optimizing begins with k centroids corresponding to the k clusters. Each
the hardware architecture to accelerate their specific applica- input data object is then assigned to one of the k clusters
tion. These solutions worked well on the specific single-FPGA whose centroid is at minimal distance to the data based on a
accelerators, but are not readily applicable to problems that distance metric, such as its Euclidean or Manhattan distance.
demand large-scale distributed cluster computing facilities. Once all data objects are assigned, the centroids of each cluster
To that end, we present the design and implementation of are updated according to the new partitioning. The algorithm
the k-means clustering algorithm on our multi-FPGA system. repeats until the centroids remain unchanged at the end of the
Our implementation was designed following the map-reduce iteration.
programming model and was targeted to run on multiple A number of previous works have already demonstrated the
FPGAs installed as accelerators across the cluster. Compared benefit of utilizing FPGAs in k-means implementations. Sev-
to previous works, our design is unique in that it is general- eral early attempts have been done to accelerate hyperspectral
purpose; it executes across multiple FPGAs, and is readily images clustering using FPGAs. In [6], [9], hardware-software
scalable to larger systems with additional FPGAs. systems combining processor and reconfigurable fabric were
Using our k-means implementation as an example, this work proposed, demonstrating over 11× speedup over the corre-
explores the performance benefits and design considerations of sponding software implementations. Subsequently, the authors
utilizing FPGAs as accelerators in a distributed heterogeneous in [5], [10] achieved similar speedup in processing multi-
computer cluster. System-level performance was evaluated and spectral and hyper-spectral images by utilizing an FPGA-
compared against similar software implementations executing optimized distance calculation in the k-means algorithm. Like-
on top of the popular Hadoop MapReduce framework. As wise, FPGA implementation of k-means has also found useful
I/O performance is often the system performance bottleneck, in real-time image clustering [12].
More recently, in [7], [8], the FPGA implementation of k- Inter-FPGA network

means algorithm for processing microarray data was examined


and compared to a similar GPU implementation. Further-
more, the authors in [11] explored the use of FPGA-specific
hardware structure to accelerate distance computation with
high-dimension data in k-means. While these previous works
have demonstrated respectable performance, only a single
CPU CPU FPGA CPU FPGA
... CPU FPGA

FPGA was employed. Their focuses were to accelerate the Head Node
PCIe bus PCIe bus PCIe bus
Compute Node Compute Node Compute Node
particular application using FPGA-specific hardware structure.
In contrast, this work examines the use of multiple FPGAs
to process large-scale k-means problems by systematically General-purpose local network

following the map-reduce programming model.


Fig. 1: Cluster overview.
B. Map-Reduce
In its most basic form, map-reduce is a simple programming III. FPGA-ACCELERATED C LUSTER I MPLEMENTATION
model that systematically applies an application-specific map One key feature of our k-means algorithm is that it runs
and reduce pure function to the input data list in stages. on multiple FPGAs installed in a computer cluster. Here, we
The map function is first applied to each element of the first describe our heterogeneous computer cluster, followed by
input list to produce an intermediate list, whose elements are the design of the FPGA-based MapReduce k-means algorithm
subsequently combined by the reduce function. As there is and the software support system for such implementation.
no communication between individual instances of map, and
reduce is fully associative and commutative, they offer an A. Target Computer Cluster
enormous opportunity for parallelization. Figure 1 shows an overview of our target heterogeneous
This powerful programming model is popularized by computer cluster. Our cluster consists of a homogeneous array
Google as the MapReduce framework [4], with a now de facto of compute nodes. Each compute node is equipped with
open-source implementation from Apache Foundation called an FPGA accelerator. On top of basic communication and
Hadoop [1]. While MapReduce also offers important features memory access to the host, the FPGAs are also connected
such as fault tolerance, the underlying operating principle is by a dedicated communication network. This simple cluster
similar to the basic map-reduce model. The map function architecture is scalable and backward compatible with existing
takes in a list of key-value pairs and generates an intermediate systems, therefore enabling us to perform end-to-end perfor-
list of tuples. These intermediate key-value pairs are then mance comparison with existing distributed software systems.
processed by different instances of reduce function with Each compute node is a heterogeneous computer with an
optional sorting and grouping according to the keys. FPGA accelerator connected through standard PCIe connec-
As such a promising programming model, map-reduce has tion. The host CPU is responsible for both computation and
also found interest in the FPGA community. To facilitate managing the attached FPGA. For instance, the host CPU is
application development, Yeung et al. proposed a map-reduce responsible for configuring the FPGA and marshaling data
library for both GPU and FPGA accelerators [14]. Similarly, between the general UNIX file system and the FPGA.
focusing on FPGA implementations, Shan et al. presented One of the compute nodes is designated either physically or
a MapReduce framework that virtualizes the data synchro- logically as the head node. The main purpose of the head node
nization and communication among task scheduler, mappers is to perform job execution and monitoring in the cluster. In
and reducers [13]. Both works aimed to improve design- the case of our FPGA implementation, a custom management
ers’ productivity and portability of the generated MapReduce software framework has been developed to manage all FPGA
applications. Our work shares a similar goal of promoting execution from the head node. The node maintains information
productivity of designers for large, multi-FPGA map-reduce of compute nodes like IP address and its availability for
applications with a focus on CPU-FPGA and inter-FPGA computation.
communications. Finally, the inter-FPGA network allows direct FPGA-FPGA
communication without involving any CPU. In fact, the soft-
ware system may be completely unaware of this mode of
C. Map-Reduce Processing of k-means
inter-FPGA communication. While such network is essential
Our implementation of k-means was based loosely on the to provide low latency and high bandwidth inter-FPGA com-
MapReduce programming model. Each iteration of the k- munication, it must also be carefully designed such that it is
means algorithm was implemented as a MapReduce job, with as scalable as the cluster itself. In our current version, we opt
the distance calculation implemented as map tasks, and the for standard Gigabit Ethernet as the physical implementation
k recentering of centroids implemented as parallel reduce of this network as a tradeoff among performance, scalability
tasks. and reliability.
B. Map-Reduce k-means Implementation each of them may be responsible exclusively for computing
(i)
Each iteration of the k-means algorithm is formulated as a the µk of its own cluster. Each reducer has its own set of
map-reduce job in our implementation. input and output buffers. It fetches key-value pairs from the
Let N be the number of input data and K be the number input buffer and extracts from it the data point. Subsequently,
of clusters these data are partitioned into. Also, let C (i) = it accumulates all the received data points and eventually
(i) (i)
{µk : k ∈ 1, 2, . . . , K} be the set of center of gravity of computes µk based on the accumulated value. The newly
the K clusters in iteration i, where i ∈ Z+ . The set of initial computed centroid is then stored into the output buffer.
centroids C (0) are K randomly chosen data from the input. 2) Communication: There are two main communication
The map function takes 1 input data and produces 1 key- channels in our k-means implementation. The first involves
value pair where the key is the closest centroid to the input retrieving input data from the general file system to be
data, and value is the input data itself. To facilitate this com- processed by the mappers; while the other involves passing
putation, it is assumed that the map function receives C (i−1) intermediate key-value pairs between mappers and reducers
before start of iteration i. In our current implementation, in multiple FPGAs. The output of the reducers contains K
Euclidean distance between the centroid and the data is used. centroid locations and incurs negligible I/O bandwidth.
These intermediate key-value pairs are then grouped by their The input to our k-means algorithm is large data file with up
key in a shuffling stage and are collectively passed to the to 100 million data points. In a commercial computer cluster,
reduce function. The reduce function thus takes as input these large data files are normally stored on a distributed file
a set of key-value pairs with the same key and computes the system, such as the Hadoop Distributed File System (HDFS)
(i) so multiple processing nodes can process them concurrently.
new centroid, µk , among them.
1) FPGA Computation: Both the map and reduce func- To provide similar facilities to our FPGA design, the input data
tions are implemented using FPGA fabrics. We call each files are also partitioned into smaller chunks and distributed to
physical implementation of the map function in FPGA a the respective hosting nodes of the FPGA accelerators before
mapper and that of the reduce function a reducer. the algorithm executes.
In theory, with N input data, a maximum of N instances During the execution of the k-means algorithm, the hard-
of map may be executed in parallel for best performance. In ware/software system as shown in Figure 3 is responsible for
practice, only M physical mappers, where M  N , are used continuously streaming data from the host to the FPGA. A
to timeshare the workload of the N map instances. The M simple C++ program, called the Data Manager, is executed
mappers may further be implemented across multiple FPGAs. on each compute node whose responsibility is to retrieve data
In our k-means implementation, the number of reduce from the partitioned file in the general-purpose file system.
instances required depends on the actual input data. Since Another software program, called the Streamer, interacts with
K is relatively small in our design, exactly K reducers are the Data Manager. Whenever the Data Manager has a batch
used in our design, each of them responsible for computing of input data, the Streamer takes over the batch and streams
centroid of one cluster. Currently, all K reducers are physically it to the FPGA, with the help of a PCIe driver residing in the
located within 1 single FPGA. However, multiple FPGAs may Linux Kernel. The PCIe driver copies the batch of data from
be designated as reducer FPGA without change in the system user-space memory to kernel-space memory. Then, the FPGA
architecture for large values of K. is instructed by the driver to perform Direct Memory Access
Figure 2 shows a high-level block diagram of our k-means (DMA) read operation on the kernel-space memory. By doing
design using 3 FPGAs. Within a mapper FPGA, multiple map- the DMA transfer, the data points are transferred from the main
pers and a central scheduler are presented. At the beginning of memory to the FPGA through the PCIe channel. Performance
each iteration, the starting centroids, C (i−1) are first sent from of this first communication path is limited by the combined
the host CPU to each on-chip mapper. Each mapper stores effect of hard-disk speed, OS overhead, as well as PCIe bus
these centroids within its own BRAM. Subsequently, input transfer speed.
data are continuously streamed from the host CPU through The other major communication facility in the system is
DMA transfers to an on-chip input buffer within the scheduler. the FPGA-to-FPGA communication path between the mappers
Each input data is a D-dimensional double-precision floating and the reducers. This communication path takes place through
point number. As soon as a mapper becomes available, it the direct inter-FPGA network as mentioned in Section III-A.
fetches data from this input buffer, computes its Euclidean Refer back to Figure 2, the intermediate key-value pairs
distance against the stored centroids, and produces a key-value generated from the mappers are collected by the Scheduler.
pair accordingly. The key part is an integer representing the The pairs are then transmitted to the Ethernet core, which
closest centroid found and the value part is the data point. packetizes the pairs into Ethernet frames. The MAC address
Finally, the computed key-value pair is stored in an output of the reducer FPGA is inserted automatically to the header of
buffer, ready to be sent to the reducer FPGA. each frame by the hardware support system. These frames are
On the reducer FPGA, a similar structure can be observed. then transmitted to the Gigabit Ethernet switch, which routes
Once the key-value pairs are received from the dedicated the frames to the destination FPGA according to the frame’s
FPGA network, they are processed by the corresponding on- header. The Ethernet core on reducer FPGA de-packetizes
chip reducer module. K reducers are implemented such that the received frames and forwards the key-value pairs in the
Input data stream Input data stream

Mapper 1 Mapper 1
DMA Mapper 2 DMA Mapper 2
Scheduler Scheduler
core core


Mapper n Mapper n
(Key, value) pairs (Key, value) pairs

Ethernet core Ethernet core


FPGA 1 FPGA 2

Ethernet switch

FPGA 3
Ethernet core

(Key, value) pairs


Reducer 1
DMA Reducer 2
Output data stream Scheduler
core


Reducer n

Fig. 2: System architecture block diagram.

Compute node provides resource and process management across the cluster.
It also manages the usage of FPGA accelerators among
users, and to facilitate efficient inter-FPGA communication
Data autonomous to the hosting CPUs.
PCIe driver Streamer
Manager
To run a hardware/software job in the cluster, as illustrated
in Figure 4, a top-level program is needed to submit a job
request to the head node following a simple server/client
model. The job tracker program running at the head node
PCIe bus Data
handles the job request by searching a list of available compute
nodes and allocating enough nodes for the job. The top-
level program is also required to submit the job with a
configuration file. The configuration file contains information
DMA core Hard-disk
such as location of the executable of the Data Manager, the
FPGA
input data file and so on. After the configuration file is parsed,
the executables of both the Data Manager and the Streamer are
Fig. 3: PC-to-FPGA communication. copied to the allocated compute nodes and remotely executed
using Open MPI library [2]. The Streamer process is spawned
and instructed to reset the underlying FPGA. Then, input data
payload to the Scheduler. Finally, the Scheduler looks up the are retrieved and streamed to the FPGAs in different compute
key in the received pair and forwards it to the corresponding nodes in parallel. When the job is completed, the job tracker
reducer in the FPGA. Final results are transferred back to the carries out some clean-up works such as freeing the used
host through PCIe bus. compute nodes.
As such, most of the hardware and software components
C. Hardware/Software Management System described in previous section are general-purpose modules
To facilitate the execution of hardware/software applications that are reusable. In the PC-to-FPGA communication, the
with multiple FPGA accelerators in the cluster, a general- communication channel between the Data Manager and the
purpose management system has been developed. The system Streamer is a software FIFO. Therefore, the Streamer can be
25

Compute node Head node


20
Top-level Job request Job

Performance speedup
program tracker
15
Job execution
10
Compute node Compute node Compute node
Data Data Data 5
Manager Manager Manager
data data data 0
3 6 12
Streamer Streamer Streamer Number of clusters

PCIe PCIe
... PCIe
FPGA, 2D FPGA, 4D

Fig. 5: Performance speedup of the k-means implementation


on multiple FPGAs compared to software version on Hadoop.
driver driver driver
M = 64 in all cases.

FPGA FPGA FPGA


clock speed. The software implementation of k-means closely
followed the original multiple-FPGA design so as to provide
an one-to-one comparison.
Fig. 4: The cluster management system. Input data set to both the hardware and software designs
was chosen from the UCI Machine Learning Repository
[3]. The data set is an individual household electric power
reused if another application that handles input data differently consumption data set. It contains a 4-year record of electric
in the Data Manager is to be implemented. Also in the FPGA power consumption of a household with sampling rate of one
gateware design, the DMA core and the Ethernet core are minute. The data set consists of 2075259 entries of 9 attributes.
designed to be reused for other applications as well. The Based on the UCI data set, we created two sets of input
two cores communicate with the Scheduler using FIFO-like data to our k-means implementation. For the first input data
buffers. As long as the FIFO-like interface is followed, the set, it consists of 2 attributes extracted from the UCI data
logic design of the Scheduler is independent of the two cores. set, namely global active power and global reactive power.
For the second input data set, 4 attributes from the UCI data
IV. E XPERIMENTAL RESULTS set are used, which are global active power, sub metering 1,
A. Experimental Setup sub metering 2 and sub metering 3. By using the two input
data sets in our k-means algorithm, seasonal or monthly power
The performance of our FPGA k-means implementation consumption pattern can be observed. Both data sets were
was evaluated in the targeted heterogeneous computer cluster. stored on compute nodes as a set of D-dimensional data points,
The experiment was run on three compute nodes, each con- with each dimension being a 64-bit double-precision floating
taining a KC705 FPGA board from Xilinx. Each KC705 board point number. As mentioned, the data set used in the multiple-
contains a Kintex-7 FPGA connected to the CPU through a FPGA version was manually partitioned into two halves such
×4 gen. 1 PCIe connection. The boards were connected with that the two mapper FPGAs shared the workload from the
a dedicated Gigabit Ethernet network. Two of the boards were input data set, while the one in Hadoop version was stored on
configured as mappers and one was configured as reducer as the Hadoop Distributed File System (HDFS).
shown in Figure 2. The maximum number of mappers M = 64 In both FPGA and Hadoop versions, the processing time
was employed on the two mapper FPGAs, with 32 mappers was measured between reading the first byte of data by the
executing on each FPGA. The final FPGA served as a reducer mappers and writing the final results to CPU by the reducer.
FPGA with K reducer modules implemented. Each FPGA
design operated at 250MHz. B. Experimental Results
The performance of our FPGA implementation was com- Figure 5 shows the performance speedup of our FPGA k-
pared against a software implementation executing on a means implementation against the software implementation.
Hadoop cluster (version 2.0.0) with 3 compute nodes and From the results, it can be seen that performance of the
1 head node connected with Gigabit Ethernet. All computer FPGA implementation is 15.5× to 20.6× faster than its
nodes run CentOS 2.6.32 and are equipped with Intel Core i5 software counterpart. Unfortunately, the speedup advantage
and 8 GB of RAM. Each i5 CPU has 4 cores with 2.90 GHz of FPGA implementation over software shrinks as the input
data dimension increases. We expect such a small decrease in attached FPGA. On the mapper FPGAs, the data streamed
speedup is due to the increased overhead in streaming data in from the host were discarded so that the overhead of the
the FPGA implementation when compared to the optimized k-means algorithm would not be counted. The measurement
software implementation. would solely represent the performance of the data channel.
In the software k-means implementation, data are retrieved Table I shows the throughput performance of the data stream-
from the HDFS, copied to main memory and processed by the ing process. For the case of 100 M data points, the throughput
CPU directly. On the other hand, the FPGA implementation achieved was 99.44 Megabyte per second (MBps). In other
has the additional overhead of copying the entire data set words, the maximum data streaming capacity in the FPGA
from main memory to FPGA through the PCIe bus. When system was 99.44 MBps.
compared to the software implementation, such additional The second test was to measure the throughput of the
memory copying overhead only increases as input data size Ethernet channel between mapper and reducer FPGAs. Key-
increases, diminishing the speedup advantage of the FPGA value pairs were artificially generated by the two mapper FP-
implementation. GAs. The generated pairs were then transmitted to the reducer
FPGA. The overall transmission time for the key-value pairs in
V. S YSTEM -L EVEL P ERFORMANCE E VALUATION
the mapper-reducer communication was measured and shown
Despite the fact that up to 20.6× performance speedup in Table II. It can be seen that the maximum performance of
is obtained with our k-means implementation on multiple the mapper-reducer communication was 111.67 MBps.
FPGAs, it is important to identify the I/O bottleneck in the The third test was the same as the baseline experiment,
system for future improvement. In order to do so, a micro- except that the intermediate key-value pairs were discarded at
benchmark designed to measure the I/O performance in the the output of all mapper modules. The key-value pairs were not
system was carried out with a baseline experiment executing transmitted to the reducer FPGA for performing the reduce
our k-means algorithm with synthetic data. The baseline function and hence the latency overhead from the Ethernet
experiment accepted input data sets ranged from 100 k to communication was removed. The overall time for streaming
100 M 2-D data points that were randomly generated. Same data to FPGA and generating all key-value pairs was measured.
hardware setup as the previous experiment was used, with 3 Figure 7 summarizes the results from previous experiments.
KC705 boards as mapper and reducer FPGAs. It shows a model of factors determining the performance of
The micro-benchmark also involves experiments evaluating the k-means application. For simplicity, only the case of the
the performance of major data paths so that the I/O bottleneck largest input data set, 100 M input vectors, is considered. The
in our system can be pinpointed. The major data paths lie in theoretical throughput as shown in the figure was measured
the host-to-FPGA and inter-FPGA communication channels. using the number of hardware cycles required by a mapper to
The host-to-FPGA communication includes data movement compute the closest centroid in our current implementation.
from hard-disk, through main memory and down to FPGA For small number of mappers available, such as M = 16,
via PCIe bus. The inter-FPGA channel refers to the mapper- the computational power of the mapper modules is the major
reducer FPGAs communication through the Gigabit Ethernet limiting factor to the k-means application. This effect is
switch. It is believed that these two channels contribute a large clearly indicated in Figure 7b and Figure 7c, where the three
portion of communication overhead to the system. solid lines are very close at M = 16, implying that the
A. I/O Performance Evaluation k-means application performance is heavily limited by the
compute capacity of the FPGA system. As the number of
Figure 6 shows the throughput performance of the baseline mappers increases, the host-to-FPGA communication becomes
experiment as K varies. By considering the throughput per- the major bottleneck to the application performance. For the
formance, it can be seen that the throughput performance of cases of M = 32 and M = 64 in Figure 7a, the solid line
all three values of K stay roughly constant for M = 64. In (k-means map-only) overlaps with the dotted line (Host-to-
all 3 cases, it is conjectured that the performance are limited FPGA channel), pointing out that the k-means performance
by data I/O in the system. On the other hand, the performance is bounded by the data streaming capacity. Therefore, the
for M = 16 is more likely limited by the smaller number performance of our current implementation of the k-means
of mappers. As K increases, the performance is particularly application cannot be maximized. Immediate next step is
vulnerable to the increased complexity. accordingly to develop a more efficient implementation on the
To pinpoint the system I/O bottleneck, the following specifi- host-to-FPGA communication.
cally designed experiments were used to measure the through-
put performance of the major data paths: B. Effect of Number of FPGAs
1) Host-to-FPGA test Finally, the benefit of utilizing multiple FPGAs to ease the
2) Ethernet line speed test data streaming workload in large data processing is explored.
3) k-means map-only test So far, all the experiments in previous subsections have been
The first test aimed to specifically evaluate the performance performed with 2 mapper FPGAs and 1 reducer FPGA.
of the data channel between host PC and FPGA board. Input To show the advantage of employing multiple FPGAs, two
data were stored on two compute nodes and streamed to the different system setups were evaluated. In both cases, 24
80.00 80.00 80.00

70.00 70.00 70.00

Throughput (MBps)
Throughput (MBps)
Throughput (MBps)

60.00 60.00 60.00

50.00 50.00 50.00

40.00 40.00 40.00

30.00 30.00 30.00

20.00 20.00 20.00

10.00 10.00 10.00

0.00 0.00 0.00


100k 1M 10M 100M 100k 1M 10M 100M 100k 1M 10M 100M
Number of data points Number of data points Number of data points
M = 64 M = 32 M = 16 M = 64 M = 32 M = 16 M = 64 M = 32 M = 16

(a) K = 3 (b) K = 6 (c) K = 12

Fig. 6: Effect of varying on K on system throughput performance. D = 2 in all cases.

500 500 500


450 450 450
Throughput (MBps)
Throughput (MBps)

Throughput (MBps)
400 400 400
350 350 350
300 300 300
250 250 250
200 200 200
150 150 150
100 100 100
50 50 50
0 0 0
M = 16 M = 32 M = 64 M = 16 M = 32 M = 64 M = 16 M = 32 M = 64
Number of mappers Number of mappers Number of mappers
k-means k-means map-only k-means k-means map-only k-means k-means map-only
Host-to-FPGA channel Ethernet line speed Host-to-FPGA channel Ethernet line speed Host-to-FPGA channel Ethernet line speed
Theoretical throughput Theoretical throughput Theoretical throughput

(a) K = 3 (b) K = 6 (c) K = 12

Fig. 7: Performance model of the k-means application. Input = 100 M points and D = 2 in all cases.

TABLE I: Processing time of the data streaming process on


1000 two mapper FPGAs.
K=3
No. of input data points Processing time Throughput
100 100 k 0.5665 sec 2.69 MBps
Run time (second)

1M 0.7035 sec 21.69 MBps


10 M 2.2721 sec 67.16 MBps
10
100 M 15.3445 sec 99.44 MBps

TABLE II: Processing time of key-value pairs transmission in


1 mapper-reducer communication.
No. of input data points Processing time Throughput
100 k 0.0181 sec 105.50 MBps
0.1
90k 900k 9M 90M
1M 0.1720 sec 110.91 MBps
Number of data points 10 M 1.7091 sec 111.60 MBps
100 M 17.0805 sec 111.67 MBps
2D, 24x1 2D, 8x3 4D, 24x1 4D, 8x3 8D, 24x1 8D, 8x3

Fig. 8: Execution time of FPGA design on variable number of


FPGAs. mappers were employed in the system. However, in the first
case, 3 FPGAs each containing 8 mappers were used, while in
TABLE III: Resource consumption of map and reduce functions on FPGA (Xilinx Kintex-7 XC7K325T). Overall resource
consumption for M = 32 and K = 12.
Modules Registers LUTs DSP48E1s BRAM
Map 4188 (1%) 4108 (2%) 13 (1%) 72kB (1%)
Reduce 14422 (3%) 13358 (6%) 24 (2%) 36kB (1%)
Overall Map 147343 (36%) 140654 (69%) 416 (49%) 5328kB (33%)
Overall Reduce 179554 (44%) 145159 (71%) 288 (34%) 3960kB (24%)

the second case only 1 FPGA was used with all 24 mappers limitation within the k-means application was the host-to-
implemented. In all experiments, various sizes of input data FPGA communication channel. Various studies show that, by
and data dimensions were used, while K remained at 3. The taking advantage of multiple FPGAs, the I/O communication
input data set was equally divided into 3 subsets, which were overhead can be relieved and greater performance improve-
individually stored on the compute nodes. Figure 8 shows ment can be achieved.
their performance with various input parameters. It is apparent In the future, we plan to increase the scale of the exper-
that distributing the same number of mappers across 3 FPGAs iment to evaluate the case for utilizing multiple FPGAs in
consistently outperforms that using 1 FPGA. processing large data sets in data centers. Further experiments
We attribute the performance benefit of the multi-FPGA are planned to better understand the trade-off between I/O and
implementation to the reduced I/O bandwidth requirement on computational performance limitations.
the mapper FPGA as we split the set of mappers into multiple
R EFERENCES
FPGAs. This effect is particularly prominent with large D.
Consider each pair of lines (dotted, dashed and solid). For [1] “Apache Hadoop, http://hadoop.apache.org/.”
[2] “Open MPI, http://www.open-mpi.org/.”
the case of D = 4, with 90M data entries, the single mapper [3] K. Bache and M. Lichman, “UCI Machine Learning Repository,” 2013.
FPGA version is more than 2 times slower. With 90M 8-D [Online]. Available: http://archive.ics.uci.edu/ml
data points, the performance of using only one mapper FPGA [4] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on
Large Clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008.
is about 4 times slower. [5] M. Estlick, M. Leeser, J. Theiler, and J. J. Szymanski, “Algorithmic
Further experiments may be done in the future in order transformations in the implementation of k- means clustering on re-
to better understand the balanced ratio between mappers and configurable hardware,” in Proceedings of the 2001 ACM/SIGDA ninth
international symposium on Field programmable gate arrays, ser. FPGA
FPGAs. ’01. ACM, 2001, pp. 103–110.
[6] M. Gokhale, J. Frigo, K. Mccabe, J. Theiler, C. Wolinski, and D. Lave-
C. Resource Consumption nier, “Experience with a hybrid processor: K-means clustering,” The
Table III summarizes the resource consumption of the FPGA Journal of Supercomputing, vol. 26, no. 2, pp. 131–148, 2003.
[7] H. Hussain, K. Benkrid, H. Seker, and A. Erdogan, “FPGA Im-
k-means clustering design. The modules Map and Reduce plementation of K-means Algorithm for Bioinformatics Application:
show the resource utilization of individual map and reduce An Accelerated Approach to Clustering Microarray Data,” NASA/ESA
function. The Overall Map and Overall Reduce modules Conference on Adaptive Hardware and Systems (AHS), pp. 248–255,
Jun. 2011.
indicate resource usage of the largest map and reduce [8] H. Hussain, K. Benkrid, A. Erdogan, and H. Seker, “Highly param-
designs, which were implemented with M = 32 and K = 12 eterized k-means clustering on fpgas: Comparative results with gpps
respectively. and gpus,” in Reconfigurable Computing and FPGAs (ReConFig), 2011
International Conference on, 2011, pp. 475–480.
VI. C ONCLUSIONS [9] D. Lavenier, “FPGA implementation of the k-means clustering algorithm
for hyperspectral images,” Los Alamos National Laboratory LAUR, pp.
In this paper, we have presented an implementation of the 00–3079, 2000.
k-means algorithm using multiple FPGAs in a heterogeneous [10] M. Leeser, J. Theiler, M. Estlick, and J. Szymanski, “Design tradeoffs
in a hardware implementation of the k-means clustering algorithm,” in
computer cluster. The implementation took advantage of the Proceedings of the 2000 IEEE Sensor Array and Multichannel Signal
map-reduce programming model to allow easy scaling and Processing Workshop, 2000, pp. 520–524.
parallelization across the distributed computer system. A ded- [11] Z. Lin, C. Lo, and P. Chow, “K-means implementation on FPGA for
high-dimensional data using triangle inequality,” Field-Programmable
icated inter-FPGA communication channel was built in order Logic and Applications, pp. 437–442, Aug. 2012.
to allow an efficient and autonomous data movement between [12] T. Saegusa and T. Maruyama, “An fpga implementation of k-means clus-
FPGAs. A cluster management system was also developed to tering for color images based on kd-tree,” in International Conference
on Field Programmable Logic and Applications, 2006., ser. FPL ’06,
handle job requests and monitor the overall cluster. Perfor- 2006, pp. 1–6.
mance evaluations using real-life statistics as input data were [13] Y. Shan, B. Wang, J. Yan, Y. Wang, N. Xu, and H. Yang, “FPMR:
carried out and the experimental results were compared with MapReduce framework on FPGA,” in Proceedings of the 18th annual
ACM/SIGDA international symposium on Field programmable gate
an equivalent software k-means solution running on Hadoop arrays, ser. FPGA ’10, 2010, pp. 93–102.
cluster. Our performance results show that the multiple-FPGA [14] J. H. C. Yeung, C. C. Tsang, K. H. Tsoi, B. S. H. Kwan, C. C. C.
design can outperform the Hadoop software version in all test Cheung, A. P. C. Chan, and P. H. W. Leong, “Map-reduce as a
Programming Model for Custom Computing Machines,” in Proceedings
cases, offering speedup from 15.5× to 20.6×. I/O bottleneck of the 2008 16th International Symposium on Field-Programmable
analysis was performed with several specifically designed Custom Computing Machines, ser. FCCM ’08, 2008, pp. 149–159.
experiments. It was found that the primary source of I/O

You might also like