2012 IEEE Computer Society Annual Symposium on VLSI
SCOC IP Cores for Custom Built Supercomputing Nodes
Venkateswaran Nagarajan*, Rajagopal Hariharan‡, Vinesh Srinivasan‡, Ram Srivatsa Kannan‡, Prashanth Thinakaran‡,
Vigneshwaren Sankaran‡,Bharanidharan Vasudevan‡, Ravindhiran Mukundrajan †, Nachiappan Chidambaram Nachiappan†,
Aswin Sridharan†, Karthikeyan Palavedu Saravanan†, Vignesh Adhinarayanan†and Vignesh Veppur Sankaranarayanan†
*Director, ‡WARFT Research Trainee, †Previously affiliated with WARFT
Waran Research Foundation [WARFT], India
Email:
[email protected]
Set Architecture (ALISA) is a superset of other instruction
sets such as vector instructions, CISC and VLIW which
are used in various multi-core/many core processors. The
term ALFU based design, frequently used in this paper,
refers to the design of heterogeneous many core processors
using ALFUs and scalars. The ALFUs are designed for
a wide variety of numeric, semi-numeric and non-numeric
algorithms.
Abstract—A high performance and low power node architecture becomes crucial in the design of future generation
supercomputers. In this paper, we present a generic set of
cells for designing complex functional units that are capable
of executing an algorithm of reasonable size. They are called
Algorithm Level Functional Units (ALFUs) and a suitable VLSI
design paradigm for them is proposed in this paper. We provide
a comparative analysis of many core processors based on
ALFUs against ALUs to show the reduced generation of control
signals and lesser number of memory accesses, instruction
fetches along with increased cache hit rates, resulting in better
performance and power consumption. ALFUs have led to the
inception of the SuperComputer On Chip (SCOC) IP core
paradigm for designing high performance and low power
supercomputing clusters. The proposed SCOC IP cores are
compared with the existing IP cores used in supercomputing
clusters to bring out the improved features of the former.
Algorithm Level Functional Units (ALFUs) are designed
by hardwiring a set of scalars based on the characteristics of
an algorithm. A balanced mix of these ALFUs and scalars
can be used to execute applications that are computationally
intensive. The use of such units provides a variety of advantages such as a reduction in overall power consumption and
increased performance. A significant drop in the number of
control sequences associated with each instruction, memory
access, the number of instruction fetches and the overall
control complexity are be observed. Additionally, the cache
performance is improved.
Keywords-Supercomputing; heterogeneous cores; SCOC IP
Cores; complex functional units; ALFUs
I. I NTRODUCTION
The future of computational sciences depends on the
deliverance of exascale computing power in the foreseeable future. However, global energy crisis places a huge
constraint in achieving this goal with traditional off-theshelf processors. The onus of delivering exascale computing by overcoming the energy consumption concerns lies
with computer architects. A well known solution to obtain
high energy efficiency is the use of application specific
custom processors. However, the design effort and the high
cost overhead associated with ASICs prevent them from
mainstream use. Therefore, the twin constraints – energy
efficiency and infrastructure costs, force us to explore new
frontiers in the design space where custom specialization
co-exists with an existing architecture.
In order to exploit this young design space, a novel node
architecture model [1] based on the use of large functional
units and other architectural elements was proposed. In this
paper, we present the design and analysis of a more generic
set of cells used in building Algorithm Level Functional
Units (ALFUs). These ALFUs are capable of executing a
complete algorithm of reasonable size driven by a single
instruction. The corresponding Algorithm Level Instruction
978-0-7695-4767-1/12 $26.00 © 2012 IEEE
DOI 10.1109/ISVLSI.2012.80
The simultaneous execution of multiple applications without space or time sharing (SMAPP) at the supercomputing
cluster level would enable cost sharing, while avoiding
substantial performance sharing amongst multiple users [2].
A new class of IP cores called SuperComputer On Chip
(SCOC) IP cores for low power supercomputing clusters
based on ALFUs is introduced in this paper. The design
of low power, yet high performance many core processors
which support the simultaneous execution of multiple applications without space or time sharing (SMAPP) at the
supercomputing cluster level is simplified by the use of the
SCOC IP cores. These IP cores are customizable to a great
extent and can be designed for any architectural set. Section
V elucidates the design of SCOC IP cores for architectures
based on the CUBEMACH design paradigm [3].
The use of large functional units that provide ASIClike functionality to the cores has earlier been advocated
in the works of [1] and [4]. Our paper provides the design
methodology and a systematic analysis of the use of ALFUs
in heterogeneous many core processors.
255
Table I
T YPES OF CELLS USED TO DESIGN ALFU S
Cell
Input
Output
Functionality
DACSRAM A
A i , Bi
Ci
Ci = A i ⊕ B i
Adder,max/min finder
DACSRAM B
A i , Bi
Ci
Ci = A i + B i
Multiplier, inner product
DACSRAM C
A i , Bi
Ci
Ci = A i B i
DAASRAM
A i , B i , Ci
Sum, Ci
Sum = Ai ⊕ Bi ⊕ Ci
Carry = (Ai ⊕ Bi )Ci + Ai Bi
#
#
"#!$#
"#!$##&
#
#
"#!$#
"#!$##&
#
#
"#!$#
"#!$##&
#
#
"#!$#
"#!$##&
#
#
"#!$#
"#!$##&
Employed In
Comparator unit,sorter BFS and
DFS
Multiple operand adder,KL graph
unit
DACSRAM A1
A i , Bi
Pi , G i
P i = A i + B i , G i = Ai B i
Adder/subtractor,comparator,Matrix
adder
DACSRAM A2
A i , Bi
A i , B i , Ci
Sum = Ai ⊕ Bi ⊕ Ci
Adder/subtractor,matrix multiplier
DACSRAM B1
Gi,j , Gj+1,k , Pi,j , Pj+1,k
Gi,k , Pi,k
Pi,k = Pi,j .Pj+1,k
Gi,k = (Gij .Pj+1,k ) + Gj+1,k
DACSRAM B2
Pi,j , Gi,j , Ci
Ci,j
Ci = (Pi,j .Ci ) + Gi,j
Adder/subtractor,comparator,
sorter
Adder/subtractor,crout unit
'(# !%
'(# $
II. R ELATED W ORK
Figure 1.
It is important not to confuse the ALFUs with accelerators. Processors relying on accelerators for higher performance seldom have binary compatibility and require a standalone module for decoupling the Instruction Set Architecture [5]. Unlike accelerators that are stand-alone units, the
ALFUs are an integral part of the processor itself, thereby
eliminating the issues of compatibility that the accelerators
bring in.
Another interesting work that can be compared with the
ALFU is the Dynamically Specialized Datapaths [6].The
overheads due to switching in the DySER blocks are ones
that ALFUs do not face and hence the ALFUs are expected to offer better performance. The fusion of various
instructions into Macro-ops would mean that the Instruction
Decode units would still have to actually decode as many
instructions even though they have been fused into a single
Macro-op. A single ALISA instruction triggers the execution
of an ALFU. Hence, the fetch and decode complexities of
the ALISA is lesser.
Lawrence Berkeley National Laboratory recently adopted
the Tensilica System-On-Chip IP cores which are partially
customizable, to design a supercomputing cluster for climate
modeling [7]. The Xtensa IP cores are customizable with
the provision of adding a single application specific block
and its associated instructions. The SCOC IP cores, on the
other hand provide complete customization of the design of
all the ALFUs or scalars, the communication backbone and
even the compiler/instruction scheduler.
Cell based generic ALFU architecture
" $
0 1
/ 0
0
0
2.
0 %(!%(%','
" $
(!
/ 0
0
)!" !#"
!#"
-"%#$+ -"%#$+ -"%#$+
"*+%(
+%)( &#
)!"
Figure 2.
ALFU architecture of the Minimum Spanning Tree Algorithm
cells that have most commonly been used for designing
ALFUs are shown in Table I.
A set of appropriate cells are selected from the cell library
to build pipelinable stages with suitable interconnection nets
to form the ALFUs and scalars. The cell based generic
architecture of an ALFU is shown in Figure 1.
Understandably, the cell with maximum delay decides the
delay of the particular stage. Suitable latching is provided
to match the delays of the cells. The cell with maximum
delay across all the stages decides the pipelining rate of the
ALFU. By suitable arrangement of the cells, the pipelining
delay of the ALFU can be reduced.
A. ALFU for Minimum Spanning Tree Algorithm
III. A RCHITECTURE OF A LGORITHM L EVEL
F UNCTIONAL U NITS
The ALFU designed for the Minimum Spanning Tree
(MST) algorithm shown in Figure 2 is a table based architecture. The control circuit present in the ALFU generates
the appropriate control signals to enable searching the table
for the corresponding node, check for formation of cycles
and choose the shortest edge from a selected entry. The
table contains the set of node and edge weights, whose
values are compared against the source node, checked for
A set of cells capable of performing basic operations has
been designed which forms the building blocks of ALFUs.
The cells developed and their functionalities are shown in
Table I.
ALFUs are designed for a wide class of algorithms
(Numeric, Semi-Numeric and Non-Numeric). Some of the
256
H
Table II
G ENERIC SET OF EQUATIONS FOR MOV AND COMPUTE INSTRUCTIONS
ASSOCIATED WITH ALFU S AND ALU S . H ERE , N IS THE PROBLEM SIZE
OF THE APPLICATION AND α IS THE PROBLEM SIZE OF THE ALFU.
G
F
SUBTRACTOR
DACSRAM_A
DACSRAM_A
DACSRAM_A
DACSRAM_B
E
D
C
B
A
DACSRAM_A
DACSRAM_B
Algorithm
COMPARATOR
ADACSRAM_A
COMPARATOR
Matrix Multiplication
DACSRAM_B
A>B
Type of Instruction
ALU
ALFU
MOV
N3
N3
α3
COMPUTE
N 2 (N + 1)
N 2 (N +1)
α2
MOV
N2
N3
α3
COMPUTE
N2
N2
α2
MOV
N
N
COMPUTE
N
N
α
MOV
N
N
COMPUTE
N
)
(N
α
MOV
N
N −1
COMPUTE
N
2
1
MOV
N
N
α
COMPUTE
N +1
COMPARATOR
DACSRAM_B
A<B
A>=B
DACSRAM_B
A<=B
LESSER
A=B
LESSER
LESSER
LESSER
Matrix Addition
COMPARATOR
COMPARATOR
LESSER
LESSER
Max/Min Finder
COMPARATOR
MIN
Minimum Spanning Tree
Figure 3.
Cell level architecture for Minimum Operand Finder Unit
Multiple Operand Adder
Inner Product
Odd Even Transposition Sorting
Kernighan Lin Graph Partitioning
MOV
3N
2
(N + 1)
COMPUTE
N (N −1)
2
MOV
N 2 +2N +16
8
Figure 4. The working of the MST architecture was verified functionally.
2
Graph Traversal
formation of cycles and the shortest edge is chosen using
the Minimum Operand Finder shown in Figure 3. This is
iterated taking every node of the graph as source node to
find the Prim’s Minimum Spanning Tree. This effectively
minimizes the number of control sequences associated with
the corresponding instructions. The architecture was verified
for a small problem size using Verilog HDL as shown in
Figure 4.
The number of pipeline stages of the ALFUs implies
that the pipelining depth is variable as per requirement to
suit simultaneous execution of multiple instructions from
multiple applications in a single ALFU. The description of
other ALFU architectures can be found in [8].
N
α
3N
2α
+1
(N
+ 1)
α
− 1)/2α
N( N
α
N 2 +2 N +16
α
α2
8
N 2 +10 N +8
α
α2
COMPUTE
N +10N +8
8
MOV
N −1
N/α − 1
COMPUTE
N −1
N/α − 1
8
A. Impact of ALFUs on instruction complexity
The nature of the ALISA used in ALFU based processors
implies that the number of instructions to be executed by the
functional units to run a particular application is inherently
lesser in comparison with ALU based processors. This
would imply that the power consumed due to instruction
fetch and decode in ALFU based processors is significantly
lower. The circuitry used in association with instruction fetch
and decode is used less often, amounting to lower power
consumption. From the tabulated results in Table II, it
observed that there is a drop in power consumption with
respect to the instruction fetch.
A single ALISA instruction is inherently equivalent to
several dependent and independent ALU instructions. In this
sense, a single ALISA instruction is equivalent to several
VLIW or vector instructions. In effect, ALISA instructions
are such that a single instruction is equivalent to 10s or
even 100s of scalar instructions. A single ALISA instruction
computes a complete algorithm, it is a superset of all other
ISAs, VLIW, vector or otherwise.
The Table II, contains a generalized set of equations are
derived for the number of ALISA instructions (COMPUTE)
that are required to execute a particular algorithm using
ALFU based and ALU based cores. Along with these equations, the number of instructions that is associated with move
operations (MOV) are also estimated. Here, N represents the
problem size of algorithm being executed and α represents
the problem size for which the ALFU is designed.
IV. A LGORITHM L EVEL F UNCTIONAL U NITS :
P ERFORMANCE AND POWER ANALYSIS
257
Overall Performance in G Ops
Table III
C OMPARISON OF COMPLETE SET OF CONTROL SEQUENCES OF ALFU S
WITH MINIMUM NUMBER OF EXPLICIT CONTROL SEQUENCES OF ALU S
Algorithm
No. of control sequences
ALU
ALFU
Problem Size
Matrix Multiplication
2×2
54 (Pipelined)
46 (Parallel)
38 (Pipelined)
32 (Parallel)
Matrix Addition
2×2
42 (Pipelined)
40 (Parallel)
34 (Pipelined)
32 (Parallel)
Crouts
2×2
24
18
Matrix Inverse
3×3
86
64
8 Node
164
108
Max/Min Finder
8 Operands
22
17
Multiple Operand Adder
9 Operands
28
22
Minimum Spanning Tree
KL Graph Partitioning
ALU
319.3
4 Node
140
108
Inner Product
8 Operands
47
38
Sorting
8 Operands
29
22
astar
80
Cache Hit Ratio
70
60
50
ALFU
40
ALU
30
20
10
0
Mcf
Omnetpp
h264ref
7.8
mcf
15.3
omnetpp
23.5
gcc
Cache with varying sizes (shown in Table IV) adopting
the 4-way set associative mapping and a heuristic based
replacement policy. As the data requirement of each ALFU
is quite large because of the size of operands that each
ALFU operates on is large. The increased hit ratios of ALFU
based architecture over ALU is because of the existence
of dependencies across instructions which get localized as
a consequence of the hardwired scalar units. A set of
cache replacement heuristics have been developed for the
CUBEMACH design paradigm which suit the execution
of multiple spplications simultaneously without space or
time sharing [3]. This reduces number of conflict misses
as well as capacity misses thereby showing a considerable
improvement in cache hit ratios as compared to conventional
ALU based heterogeneous many core processors.
90
Gcc
bzip2
7.43
213.6
C. Impact of ALFUs on cache hit in a heterogeneous many
core processor
Cache Performance
Bsip2
15.3
130.32
Figure 6. A comparison of the overall performance metric of ALFU based
processors against their ALU based counterparts
100
Astar
812.4
97.6
14.32
ALFU
730.5
H264ref
Figure 5. The cache hit ratio for each of the SPEC equivalent benchmarks
that were provided as worload inputs to the WIMAC simulator [3]
B. Impact of ALFUs on control complexity
D. Impact of ALFUs on the overall performance figures
Another important aspect of the usage of ALFUs in many
core processors is that the number of control sequences is
considerably reduced in comparison with their ALU based
counterparts. This is particularly because the ALFUs are
made up of several hardwired scalar units. As a result,
much of the control sequences get absorbed within the large
functional unit.
The overall performance of the ALFU based cores is
found to be higher than that of the ALU based cores. Figure 6 shows the simulation results of a CUBEMACH design
paradigm [3] based architecture. The WIMAC simulator [3]
has been used, whose workload inputs are SPEC equivalent
Benchmarks. The overall performance of the ALFU based
cores is understandably higher due to the reduced number
of memory accesses from the use of ALFUs. The Figure 6
is not to scale.
By studying the nature of algorithms, the number of
control sequences needed for those have been computed.
The complete analysis of the control sequences associated
with the ALFUs is done by estimating the control sequences associated for each operation. There are no specific
benchmarks developed to estimate the number of control
sequences associated with ALU instructions. So the minimum number of explicit control sequences that is associated
with the execution on an algorithm in ALU based cores
are considered for comparison. This was done in order
to reduce the complexity of analysis and was found that
the control sequences associated with the ALFU based
processors was significantly lesser than the minimum set
of explicit sequences that were considered themselves.
E. Estimation of power consumption of individual ALFUs
Design of ALFU based heterogeneous cores needs to be
done very meticulously, keeping in mind various constraints
such as the interconnection between ALFUs, the power
consumed by the ALFUs etc. A wide range of power
estimation methods have been discussed in [9]. The method
used by us is a generic model to estimate the dynamic
power consumption of any functional unit. The development
of tools that can effectively estimate the dynamic power
consumption for different architectures would sufficiently
simplify the task of the designer.
258
Table V
P OWER ESTIMATION RESULTS FOR THE 8 NODE S ORTER
ARCHITECTURE GIVEN ABOVE .
Table IV
CUBEMACH DESIGN PARADIGM BASED ARCHITECTURAL
SPECIFICATION WHICH IS AN INPUT TO THE WIMAC SIMULATOR
SPEC EQUIVALENT WORKLOAD
Architectural Parameters
astar
gcc
bzip2
mcf
omnetpp
h264ref
Word-length
Activity Factor
Gate count
Cores
4
4
4
4
4
4
8
0.165367
1152
961
L1
L2
L3
32kB
256kB
4MB
64kB
512kB
8MB
32kB
256kB
4MB
32kB
256kB
4MB
32kB
256kB
4MB
64kB
512kB
8MB
16
0.164885
2688
2244
Input
Output
4
4
8
8
4
4
4
4
4
4
8
8
32
0.164774
5760
4810
SubLocal Router Stages
64
0.164748
11904
9942
Local Router Stages
Input
Output
8
4
12
6
8
4
8
4
8
4
12
6
128
0.164742
24192
3985
Global Router Stages
Input
Output
12
6
18
12
12
6
12
6
12
6
18
12
Instruction word Buffer Size
16kB
32kB
16kB
16kB
16kB
32kB
Network Based
16kB
32kB
16kB
16kB
16kB
32kB
Cache Size
Clock Frequency
800MHz
Cache
Organization
A
B
SUBTRACTOR
DACSRAM_A
DACSRAM_A
DACSRAM_A
DACSRAM_B
ADACSRAM_A
DACSRAM_A
DACSRAM_B
COMPARATOR
COMPARATOR
COMPARATOR
STORED
DACSRAM_B
DACSRAM_B
A>B
Number of idle gates
A<B
DACSRAM_B
A>=B
A=B
A<=B
STORED
COMPARATOR
COMPARATOR
COMPARATOR
COMPARATOR
COMPARATOR
COMPARATOR
COMPARATOR
COMPARATOR
COMPARATOR
COMPARATOR
COMPARATOR
COMPARATOR
Figure 8.
STORED
SORTER
Structure of the proposed SCOC IP core
STORED
V. D ESIGN OF SCOC IP CORES FOR HETEROGENEOUS
MANY CORE PROCESSORS USING ALFU S
OUTPUT
Figure 7. The architecture of the sorter ALFU has been illustrated above.
The architecture shown here is an 8 element Batcher’s Odd Even Algorithm
based Sorter.
In the design of the ALFU based processors, the cost
factor should not be a deterrent. A class of IP cores for
heterogeneous many core processors can be designed using
ALFUs and are called SuperComputer On Chip (SCOC) IP
cores. These IP cores are a large library of scalable and
customizable cores which can be designed at different levels
of abstractions.
The structure of the IP core is as shown in Fig. 8 . The
SCOC IP core comprises of three main elements; the architectural components of the CUBEMACH design paradigm
[3] Compilation Accelerator on Silicon, On Core Network,
on chip memory organization, ALFUs/scalars; WARFT India
MAny Core (WIMAC) simulator; the Optimizer Engine. The
SCOC IP core presented here is a plug and play module and
will not require additional configuration.
The Tensilica Xtensa IP core that is used in the Green
Flash supercomputing cluster, on the other hand, consists
of the Base CPU, a cycle accurate simulator, the application
specific datapath, the set of registers and Floating Point Unit.
The SCOC IP cores support simultaneous execution of
multiple applications (SMAPP) at the supercomputing cluster level. This means that the pipeline of the ALFUs in
the SCOC IP cores can contain instructions from multiple
applications in different stages. SMAPP inherently cost
A probabilistic model has been developed to estimate the
total activity factor of each ALFU, with the inputs to an
ALFU being in any distribution of choice. The model has
been evolved in a bottom up manner. As shown in Figure
1, the stages of the ALFUs are designed using the standard
cells.
The results of the power analysis for a simple ALFU
architecture has been shown. Figure 7 shows the architecture
of a Batchers Odd Even Transposition Sorter. The architecture of the Sorter ALFU is scalable to any number of
elements, but the architecture given in Figure 7 is an 8
element Sorter. Based on the model, activity factors of the
cells used to make up the Sorter ALFU are obtained and the
dynamic power consumption of the ALFU is estimated. The
inputs to the ALFU are considered to be a set of stochastic
variables based on a distribution of a particular type. The
Table V shows the results of power estimation of the Sorter
ALFU with the inputs to the ALFUs assumed to be normally
distributed.
259
and hardware sharing across multiple applications run by
multiple users or multiple applications run by a single user.
The Compilation Accelerator on Silicon (CAS) architecture that has been detailed in [10], is a customizable hardware code generator cum dynamic scheduler. The architecture specifications and the netlist of the CAS is an important
part of the IP core. In comparison with the compiler in
the Xtensa IP core used in the Green Flash supercomputer,
which is a vectorizing compiler that is software based, the
hardware based CAS offers a multitude of advantages. The
instruction issue rate and scheduling rate is much higher due
to the hardware instruction generator and dynamic scheduler.
The On Core Network (OCN) is a circuit switched network that forms the communication backbone across the
many core processor. The OCN structure is based on the
Multi-stage Interconnection Network (MIN) and is completely scalable and customizable in accordance with the
specification.
A balanced mix of ALFUs and the scalars are employed
for computation purposes. The design of ALFUs has already
been elaborated in Section II. In comparison with the Green
Flash supercomputing cluster, the Xtensa IP cores offer
customization for only one application-specific block of the
core. The SCOC IP core provides complete customization
with respect to the kind of units that should be present in
every core.
The aforementioned architectural components are provided as inputs to the WARFT India MAny Core Simulator
(WIMAC), which is a cycle accurate simulator tuned for
the CUBEMACH design paradigm. The WIMAC simulator
is tightly coupled with an Optimizer Engine, based on Game
Theory and Simulated Annealing that prunes the design
space search of the CUBEMACH based architectures and
also contributes to the core formation based on the KL graph
partitioning algorithm.
architecture as SCOC IP cores and their effectiveness when
we simultaneously execute multiple applications without
space or time sharing.
R EFERENCES
[1] N. Venkateswaran, D. Srinivasan, M. Manivannan, T. P. R.
Sai Sagar, S. Gopalakrishnan, V. Elangovan, K. Chandrasekar,
P. K. Ramesh, V. Venkatesan, A. Babu, and Sudharshan,
“Future generation supercomputers i: a paradigm for node
architecture,” SIGARCH Comput. Archit. News, vol. 35,
no. 5, pp. 49–60, Dec. 2007. [Online]. Available: http:
//doi.acm.org/10.1145/1360464.1360466
[2] N. Venkateswaran, V. Elangovan, K. Ganesan, T. Sagar,
S. Aananthakrishanan, S. Ramalingam, S. Gopalakrishnan,
M. Manivannan, D. Srinivasan, V. Krishnamurthy, K. Chandrasekar, V. Venkatesan, B. Subramaniam, V. Sangkar, A. Vasudevan, S. Ganapathy, S. Murali, and M. Thyagarajan, “On
the concept of simultaneous execution of multiple applications on hierarchically based cluster and the silicon operating
system,” in Parallel and Distributed Processing, 2008. IPDPS
2008. IEEE International Symposium on, april 2008, pp. 1 –8.
[3] “Cubemach design paradigm simulator,” June 2012. [Online].
Available: http://www.warftindia.org/cas/cas.pdf
[4] R. Hameed, W. Qadeer, M. Wachs, O. Azizi,
A. Solomatnikov, B. C. Lee, S. Richardson, C. Kozyrakis,
and M. Horowitz, “Understanding sources of inefficiency in
general-purpose chips,” in Proceedings of the 37th annual
international symposium on Computer architecture, ser. ISCA
’10. New York, NY, USA: ACM, 2010, pp. 37–47. [Online].
Available: http://doi.acm.org/10.1145/1815961.1815968
[5] N. Clark, A. Hormati, and S. Mahlke, “Veal: Virtualized
execution accelerator for loops,” in Proceedings of the 35th
Annual International Symposium on Computer Architecture,
ser. ISCA ’08. Washington, DC, USA: IEEE Computer
Society, 2008, pp. 389–400. [Online]. Available: http:
//dx.doi.org/10.1109/ISCA.2008.33
[6] V. Govindaraju, C.-H. Ho, and K. Sankaralingam, “Dynamically specialized datapaths for energy efficient computing,”
in High Performance Computer Architecture (HPCA), 2011
IEEE 17th International Symposium on, feb. 2011, pp. 503
–514.
VI. C ONCLUSION
Future generation supercomputers would ideally have high
performance per watt. To achieve this, processors should
be designed with ASIC-like efficiency. Algorithm Level
Functional Units (ALFUs), proposed in this paper, can
aid in achieving such high performance, while maintaining
reasonable energy efficiency. The design of these ALFUs
are completely cell based and is performed by hard wiring
a suitable set of scalar units based on their respective parallel
algorithms. The use of ALFU improves processor efficiency
by generating reduced number of control signals, memory
accesses and instruction fetches, along with better cache hit
rates. The power consumption of the associated instruction
fetch and control circuitry is also reduced significantly. Our
experimental evaluations show that high improvement in
performance can be observed when ALFU based cores are
used instead of ALU based cores. Further, we also show how
the ALFUs can be implemented in a CUBEMACH based
[7] D. Donofrio, L. Oliker, J. Shalf, M. F. Wehner, C. Rowen,
J. Krueger, S. Kamil, and M. Mohiyuddin, “Energyefficient computing for extreme-scale science,” Computer,
vol. 42, no. 11, pp. 62–71, Nov. 2009. [Online]. Available:
http://dx.doi.org/10.1109/MC.2009.353
[8] “Description of alfu architectures,” June 2012. [Online].
Available: http://www.warftindia.org/alfu architecture.pdf
[9] F. N. Najm, “A survey of power estimation techniques in
vlsi circuits,” IEEE Trans. Very Large Scale Integr. Syst.,
vol. 2, no. 4, pp. 446–455, Dec. 1994. [Online]. Available:
http://dx.doi.org/10.1109/92.335013
[10] N. Venkateswaran, V. Srinivasan, R. S. Kannan, P. Thinakaran, R. Hariharan, B. Vasudevan, K. Saravanan,
N. Nachiappan, A. Sridharan, V. Sankaran, V. Adhinarayanan,
V. Sankaranarayanan, and R. Mukundrajan, in IEEE Computer Society Annual Symposium on VLSI, August 2012.
260