E103.d 2020pap0015

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/347538569
RVCoreP: An Optimized RISC-V Soft Processor of Five-Stage Pipelining
Article in IEICE Transactions on Information and Systems · December 2020

DOI: 10.1587/transinf.2020PAP0015
CITATIONS READS
11 2,954
4 authors, including:
Hiromu Miyazaki Ashraful Islam

Tokyo Institute of Technology Tokyo Institute of Technology
6 PUBLICATIONS 11 CITATIONS 6 PUBLICATIONS 17 CITATIONS
SEE PROFILE SEE PROFILE
Kenji Kise
Tokyo Institute of Technology
116 PUBLICATIONS 708 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Cost-Effective Merge Architecture View project
Many-core processor emulation View project
All content following this page was uploaded by Ashraful Islam on 08 February 2021.
The user has requested enhancement of the downloaded file.

IEICE TRANS. INF. & SYST., VOL.E103–D, NO.12 DECEMBER 2020
2494
PAPER Special Section on Parallel, Distributed, and Reconfigurable Computing, and Networking
RVCoreP: An Optimized RISC-V Soft Processor of Five-Stage

Pipelining
Hiromu MIYAZAKI†a) , Student Member, Takuto KANAMORI†b) , Md Ashraful ISLAM†c) , Nonmembers,
and Kenji KISE†d) , Member
SUMMARY RISC-V is a RISC based open and loyalty free instruction version of a general-purpose instruction set.
set architecture which has been developed since 2010, and can be used for Among these instruction sets, we focus on RV32I in
cost-effective soft processors on FPGAs. The basic 32-bit integer instruc-
this paper because it is sufficient to support the operating
tion set in RISC-V is defined as RV32I, which is sufficient to support the
operating system environment and suits for embedded systems. In this pa- system environment and suits for embedded systems. RV32I
per, we propose an optimized RV32I soft processor named RVCoreP adopt- can emulate other extensions of M, F, and D, and can be
ing five-stage pipelining. Three effective methods are applied to the pro- configured with fewer hardware resources than processors
cessor to improve the operating frequency. These methods are instruction supporting RV32G. Although several soft processors that
fetch unit optimization, ALU optimization, and data memory optimization.
We implement RVCoreP in Verilog HDL and verify the behavior using
support RV32I have been released [5], they are not highly
Verilog simulation and an actual Xilinx Atrix-7 FPGA board. We evaluate optimized for FPGAs.
IPC (instructions per cycle), operating frequency, hardware resource uti- In this paper, we propose an optimized RV32I soft pro-
lization, and processor performance. From the evaluation results, we show cessor named RVCoreP of five-stage pipelining which is
that RVCoreP achieves 30.0% performance improvement compared with
highly optimized for FPGAs. The main contributions of this
VexRiscv, which is a high-performance and open source RV32I processor
selected from some related works. paper are as follows.
key words: soft processor, FPGA, RISC-V, RV32I, Verilog HDL, five-stage
pipelining • We propose an optimized RV32I soft processor of
five-stage pipelining highly optimized for FPGAs. To
1. Introduction improve the operating frequency, three optimization
methods are applied to the processor. They are instruc-
RISC-V [1] is becoming popular as an open and loyalty free tion fetch unit optimization, ALU optimization, and
instruction set architecture (ISA) which has been developed data memory optimization.
at the University of California, Berkeley since 2010. It can • We implement the proposal in Verilog HDL and evalu-
be used for cost-effective soft processors on FPGAs like ate IPC (instructions per cycle), operating frequency,
MicroBlaze [2] and Nios II [3]. The purpose of our research hardware resource utilization, and processor perfor-
is to design and implement a cost-effective RISC-V scalar mance. From the evaluation results, we show that the
processor adopting a typical pipeline configuration. proposed processor achieves much better performance
The RISC-V ISA is defined as some basic integer in- than VexRiscv, which is a high-performance and open
struction sets and some extended instruction sets. We can source RV32I processor.
select the necessary instruction sets by the application re-
quirements [4]. The basic 32-bit integer instruction set is 2. Related Works
defined as RV32I. Other typical extended instruction sets
are defined as M for integer multiplication and division in- Rocket Core [6] is a RISC-V in-order scalar processor de-
structions, F for single-precision floating-point ones, D for veloped by the University of California, Berkeley. It is a
double-precision floating-point ones, and A for atomic ones. pipelined processor supporting RV32G and RV64G. It sup-
In addition to these, a 32-bit general-purpose instruction set ports processing of privilege levels, and has an MMU (mem-
is defined as RV32G as the set of RV32I, M, A, F, and D. ory management unit) with virtual memory and data cache,
This is an instruction set architecture for general-purpose and a branch prediction unit. Because of this rich function-
computing systems of a broad range. RV64G is a 64-bit ality and hard customization, it is not suitable for embedded
Manuscript received January 7, 2020. systems.
Manuscript revised May 22, 2020. Rocket Core has another drawback. It is written in
Manuscript publicized September 7, 2020. Chisel [7], a domain-specific language based on Scala. Be-
†
The authors are with the School of Computing, Tokyo Insti- cause Chisel is a new hardware description language since
tute of Technology, Tokyo, 152–8552 Japan. 2012, it may be difficult for hardware developers who have
a) E-mail: [email protected]
not mastered Chisel to change the design effortlessly. Ac-
b) E-mail: [email protected]
c) E-mail: [email protected] cording to the work [5], Verilog HDL and SystemVerilog are
d) E-mail: [email protected] the dominant languages used to implement the processors,
DOI: 10.1587/transinf.2020PAP0015 and they may be the best choice for easy-to-use processor
Copyright
c 2020 The Institute of Electronics, Information and Communication Engineers
MIYAZAKI et al.: RVCOREP: AN OPTIMIZED RISC-V SOFT PROCESSOR OF FIVE-STAGE PIPELINING
2495
implementations. Therefore, we implement our processors consists of five-stage indicated by the instruction fetch stage
in Verilog HDL, a dominant hardware description language. (If stage), instruction decode stage (Id stage), instruction ex-
VexRiscv [8] is a RISC-V pipelined soft processor sup- ecution stage (Ex stage), memory access stage (Ma stage),
porting RV32I. The integer multiplication and division, and write back stage (Wb stage).
other extensions, and the MMU with instruction cache and The green rectangles are registers that are updated at
data cache can be added as options. In addition, the branch the positive clock edge. The yellow rectangles are modules
prediction scheme, implementation choice of shift instruc- including the memory which is composed of block RAM on
tion, data forwarding path, and so on can be tuned for im- Xilinx FPGA. The gray rectangle is a register file read asyn-
plementation. VexRiscv is written in an open source and chronously consisting of 32 registers. The red modules are
new hardware description language called SpinalHDL [9], an ALU or adders, and the other blue modules are combina-
and the corresponding RTL description can be generated as tional circuits.
a Verilog HDL file. Since the generated Verilog HDL code The baseline has an instruction memory named
is not hierarchical, debugging and understanding this gener- m imem shown at the bottom left of the figure, and a data
ated code is not easy. memory named m dmem shown at the right of the figure.
VexRiscv has won the 1st place at the highest- The branch prediction scheme is gshare [15] which
performance implementation category of the RISC-V Soft- contains a branch history register (BHR) named r BHR, a
CPU Contest in 2018 hosted by the RISC-V Founda- pattern history table (PHT) named m PHT, and a branch tar-
tion [10]. Therefore, it is an optimized soft processor for get buffer (BTB) named m BTB. To mitigate the data hazard,
high-performance, and the highest performance RV32I soft it has two forwarding paths. The red path from Ma stage to
processor available as an open source as far as we know. We Ex stage provides the register value for the next dependent
use VexRiscv as a reference for making the comparison with instruction. Similarly, the blue path provides register value
our proposed processors. from the Wb stage to the Ex stage.
There are other RISC-V processors for education such In the If stage, the instruction is fetched from the in-
as riscv-mini [11] and Sodor Processor [12] both are de- struction memory using the program counter (PC) as an ad-
veloped by the University of California, Berkeley, and dress. The register for PC named r pc is updated in every
Clarvi [13] developed by the University of Cambridge. cycle with the next PC value named w npc.
These educational RISC-V processors are easy-to-use, but There are four candidates for w npc in following de-
their performance is not high as VexRiscv because they are scending priority order. The highest priority one is the cor-
not highly optimized for high-performance. rect PC value named ExMa pc true from the Ma stage. The
second priority one is the current PC value from r pc in case
3. Design of a Typical Five-Stage Pipelined Processor of pipeline stalling. The third priority one is the branch tar-
get address named w btb which is output from the BTB. The
We design a typical five-stage pipelined processor with lowest priority one is r pc+4 for the instruction of the next
branch prediction referring [14], and this design is used as address.
a baseline for the proposal. There are three control signals to select the proper one
Figure 1 shows a block diagram of the baseline among four candidates. The first signal is named w bmis
Fig. 1 A block diagram of typical five-stage pipelined processor (baseline).

2496
which indicates whether a branch misprediction has oc- Code 1 The simplified description of a typical ALU.
curred. The second one is named w stall for pipeline stalling 1 module ALU (in1, in2, sel, rslt);
2 input wire [31:0] in1, in2;
due to the data dependency on the load instruction. The third 3 input wire [2:0] sel;
one is named w btkn from branch predictor to provide a pre- 4 output wire [31:0] rslt;
diction result as predicted taken or not taken. 5 reg [31:0] r rslt;
In the baseline, the path that determines the next PC 6 always @(*) begin
7 case(sel)
value from the four candidates through a multiplexer using
8 0 : r rslt = in1 + in2; // add
three control signals is the critical path that determines the 9 1 : r rslt = in1 - in2; // sub
maximum operating frequency. The next critical path is the 10 2 : r rslt = in1 ˆ in2; // ex-or
data path to store the executed result in the Ex stage from 11 3 : r rslt = in1 | in2; // or
ALU which uses two data forwarding values. Another slow 12 4 : r rslt = in1 & in2; // and
13 5 : r rslt = in1 << in2[4:0]; // shift left
path is aligning and sign-extending the values of reading 14 6 : r rslt = in1 >> in2[4:0]; // shift right
data from the data memory on the Ma stage which will be 15 default : r rslt = 0;
stored into the MaWb pipeline register. 16 endcase
In our proposed processor, the operating frequency is 17 end
18 assign rslt = r rslt;
improved by optimizing these critical paths.
19 endmodule
4. Design and Implementation of RVCoreP

Code 2 The simplified description of the optimized ALU.
In this section, we propose an optimized RV32I soft pro- 1 module ALU opt (in1, in2, sel, rslt);
cessor named RVCoreP (RISC-V core pipelined version). 2 input wire [31:0] in1, in2;
Firstly, we describe three optimization methods. Then, we 3 input wire [7:0] sel;
4 output wire [31:0] rslt;
describe the design and implementation of our proposal. 5 wire [31:0] w0 = (sel[0]) ? in1 + in2 : 0;
6 wire [31:0] w1 = (sel[1]) ? in1 - in2 : 0;
4.1 ALU Optimization 7 wire [31:0] w2 = (sel[2]) ? in1 ˆ in2 : 0;
8 wire [31:0] w3 = (sel[3]) ? in1 | in2 : 0;
9 wire [31:0] w4 = (sel[4]) ? in1 & in2 : 0;
The data path to store the executed result in ALU using two
10 wire [31:0] w5 = (sel[5]) ? in1 >> in2[4:0] : 0;
data forwarding values to the ExMa pipeline register is the 11 wire [31:0] w6 = (sel[6]) ? in1 << in2[4:0] : 0;
critical path in the baseline design. To mitigate the delay of 12 assign rslt = w0 ˆ w1 ˆ w2 ˆ w3 ˆ w4 ˆ w5 ˆ w6;
this critical path, we discuss the ALU optimization scheme. 13 endmodule
According to the related work [16], the circuit speed
is improved by using exclusive OR instead of multiplexer
to select the operation result for the ALU optimization on using a large multiplexer, this circuit is faster than the typi-
FPGA. Therefore, in this design of RVCoreP, exclusive OR cal one.
is used to select the 32-bit executed result of ALU. The preliminary evaluation of the operating frequency
As mentioned in the related work [17], one-hot encod- of the ALU alone targetting Xilinx Artix-7 FPGA showed
ing is used instead of the usual binary encoding for the con- that the frequency of the typical ALU was 230MHz while
trol signal generation to select the ALU calculation result. the frequency of the optimized ALU was 240MHz. This
As only one bit of the bit vector is 1 and the other bits are 0, optimization is expected to improve the operating frequency
and the control decisions are determined by the correspond- of ALU by about 10MHz.
ing flip-flop bit in parallel. Therefore, the proposal adopts a
one-hot encoding for ALU. 4.2 Alignment and Sign-Extension Optimization
The code 1 is the simplified description of a typical
ALU in the baseline where some operations of RV32I are After applying the ALU optimization and the instruction
excluded. The register named r rslt is the executed result fetch unit optimization described later, the critical path con-
of ALU. This value is selected by the 3-bit signal named tains the data memory access, the alignment, and the sign-
sel, which is described in a case statement from line 7 to extension. The combinational circuit of the alignment and
line 16. Since this description is mapped to hardware as a the sign-extension is named align/extend on the Ma stage in
multiplexer that selects one from eight values, this circuit Fig. 1.
takes a certain time through several LUTs. RV32I has five load instructions which are load byte
The code 2 is the simplified description of the opti- (LB) to load 8-bit signed data, load byte unsigned (LBU)
mized ALU equivalent to the previous description in the to load 8-bit unsigned data, load halfword (LH) to load 16-
code 1. The executed result of ALU named rslt is selected bit signed data, load halfword unsigned (LHU) to load 16-
from eight values including 0 using exclusive OR on line 12. bit unsigned data, and load word (LW) to load 32-bit data.
Each selected value is determined in advance using small Therefore, align/extend unit has to align the loaded data by
multiplexers by a one-hot encoded selection signal named shifting 8, 16, or 24 bits right depends on the memory ad-
sel of 8-bit. Since this scheme can select a value without dress and operation code of the load instructions. Then,
2497
Fig. 3 The pipeline diagram of instruction fetching using pipelined

branch prediction mechanism in RVCoreP.
and clock skew. In our preliminary evaluation, the access

Fig. 2 A general configuration and the two-stage pipelining one for
of the BTB composed of block RAM on a Xilinx Artix-7
branch prediction mechanism including gshare, BTB and instruction mem-
ory. FPGA takes about 2.54ns on the red path. Since the access
of one LUT takes about 0.12ns, the access to the three LUTs
necessary on the red path takes about 0.36ns. Also, the ac-
sign-extension or zero extension is needed for load byte and cess to a multiplexer implemented using hard macro takes
load halfword instructions. Finally, the unit selects a proper about 0.24ns, and other wiring delays and clock skew takes
value using a large multiplexer depends on the operation about 3ns. Therefore, the total delay of the path exceeds
code of the instruction. 6.1ns (165MHz) by adding the above delays.
We optimized the alignment and sign-extension using To improve the operating frequency for the proposed
the similar approach to the ALU optimization which is one- processor, we split this critical path by two registers.
hot encoding and using exclusive OR for value selection. Figure 2 (b) shows the block diagram of the pipelined
gshare and pipelined BTB for RVCoreP. The red critical
4.3 Instruction Fetch Unit Optimization path in Fig. 2 (a) is divided into three paths by two inserted
registers named r btb and r pcx. The data acquired from the
We propose the two-stage pipelining of the branch predictor BTB is stored in the register r btb, and the register r pcx is
to improve the operating frequency of the instruction fetch inserted before exclusive OR to generate the index of PHT.
unit, which contains the critical path on the baseline proces- It takes two cycles to determine the value of the next
sor. PC in the instruction fetch stage. In the first cycle in the
The related works [18], [19] have shown that the preIf stage, accessing the BTB and exclusive OR processing
pipelining of the branch predictor can improve the operat- to determine the PHT index are performed. In the second
ing frequency of the soft processor when a complex branch cycle in the If stage, the value of the next PC is determined
predictor is used. The similar approach is applied to the pro- by using the results from the preIf stage and the instruction
posed branch prediction including gshare and BTB. is fetched from the instruction memory.
Figure 2 (a) shows a block diagram of a general branch Figure 3 shows the pipeline diagram of instruction
predictor and BTB in the baseline where the prediction is fetching using the pipelined branch prediction in RVCoreP.
made in a single cycle. In gshare branch predictor, the index The rectangles written as preIf represents the processing of
to access the PHT named m PHT is obtained by exclusive the preIf stage, and the rectangles written as If represents the
OR of PC and BHR named r BHR. If the fetched instruc- processing of the If stage. Assuming that four instructions
tion is predicted as a conditional branch instruction using are fetched in the order of Inst A, Inst B, Inst M, and Inst N.
the BTB, it updates the BHR speculatively using the branch Inst A and Inst N are add and sub instructions, and these in-
prediction in the If stage. struction addresses are 0x100 and 0x134, respectively. Inst
The combinational circuit named join shifts BHR left B and Inst M are beq (branch if equal) and bne (branch if
by 1-bit, and connects the branch prediction result to the not equal) instruction, and these instruction addresses are
least significant bit of the BHR. If the branch prediction 0x104 and 0x130, respectively. The next PC of Inst B is
missed, the BHR is updated with the correct branch his- 0x130 when the branch is taken.
tory. The combinational circuit named comb that receives In the clock cycle 1 when the value of PC is 0x100, the
the value read from the BTB and the value read from the If stage for Inst A and the preIf stage for the next instruction
PHT generates the branch prediction named w btkn. are executed. In the preIf stage, the value of BTB and PHT
The critical path of a general branch prediction mecha- index used in the If stage for the next instruction 0x104 are
nism is the red data path in Fig. 2 (a) which includes the ac- prepared by using the current PC value of 0x100.
cess of the BTB, three LUTs, a multiplexer, wiring delays, In the clock cycle 2 when the value of PC is 0x104, the
2498
Fig. 4 The block diagram of the proposed processor named RVCoreP.
If stage for Inst B and the preIf stage for the next instruc-
tion are executed. In the If stage for Inst B, the next PC
value is determined by using the value prepared in the preIf
stage one cycle before. In the preIf stage, the value of BTB
and PHT index used in the If stage for the next instruction
0x108 are prepared by using the current PC value of 0x104.
Since Inst B is a conditional branch and assuming that it is
predicted as taken, the next PC value is 0x130.
In the clock cycle 3 when the value of PC is 0x130, the
If stage for Inst M and the preIf stage for the next instruction
are executed. In the If stage for Inst M, the next PC value
is determined as well but branch prediction is invalid, be-
cause the value prepared in the preIf stage one cycle before
is for the instruction whose address is 0x108, and this value
cannot be used in the If stage for Inst M whose address is
0x130. Therefore, if the value obtained by adding 4 to the Fig. 5 The pipeline diagrams of the proposed processor.
previous PC does not match the current PC value, the branch
prediction is invalid. a five-stage pipelined processor including an instruction
From the above, the PC value used for BTB access is memory, a data memory, pipelined gshare and pipelined
the one cycle earlier value of PC, and the PC value used for BTB. The ALU optimization, the alignment and sign-
PHT access is the value one cycle before the branch predic- extension optimization, and the instruction fetch unit op-
tion is output. As a result, gshare outputs a prediction in 2 timization are applied to the proposal. The unit named
cycles. The BTB entry is updated using a value obtained ALU opt in Fig. 4 is the optimized ALU.
by subtracting 4 from the PC value of the branch instruc- The detection timing of the load-use dependency be-
tion. When updating a PHT entry, we have to keep the PHT tween a load instruction and the following instruction using
index value used for the prediction and to update the PHT the loaded data is changed from the Id state on the base-
entry using this index when the actual branch outcome will line in Fig. 1 to the If stage using the combinational circuit
be available. named Load-use. To support the detection, a part of in-
The prediction accuracy might drop slightly due to the struction decoder named decoder if is implemented in the If
adverse effect of this optimization to make a prediction and stage. decoder if decodes two source registers and one des-
update the index with the one cycle earlier value of PC. tination register for one instruction, and generates the write
signals for the register file and data memory. This partial de-
4.4 RVCoreP Soft Processor coding of instruction in If stage allows us to detect the data
dependency including load-use dependency in advance.
Figure 4 shows the block diagram of RVCoreP which is Figure 5 shows the pipeline diagrams of the proposed
2499
processor. Figure 5 (a) shows the case where the pipeline is configurations for comparative evaluation with the propos-
flushed due to a branch prediction miss. In the branch pre- als. The code used for RVCoreP is Ver.0.4.5. The code
diction mechanism, the branch target address from the BTB version of VexRiscv processor used for the evaluation is
is used when the BTB is hit and the branch is predicted to SpinalHDL/VexRiscv@ca228a3 committed on 26 Septem-
be taken. The correct branch destination address calculation ber 2019 in GitHub page [8].
and check whether the branch prediction is correct or not is The four versions of RVCoreP are named as follows.
executed in the Ex stage and stored in the ExMa pipeline The version that applies all the optimizations described
register. If the branch instruction is at the Ma stage and the above is called RVP-optALL, and the simple version that
branch prediction missed, the instructions in the If stage, does not apply any optimizations is defined as RVP-simple.
Id stage, and Ex stage are flushed, which incurs a 3-cycle The version that applies only the ALU optimization and the
penalty. alignment and sign-extension optimization is specified as
Figure 5 (b) shows the case where the pipeline stalls RVP-optALU, and the version that applies only the instruc-
due to the load-use dependency. In that case, the depen- tion fetch unit optimization is defined as RVP-optIF.
dency is avoided by stalling the instruction following the For the VexRiscv, VR-nobp denotes the configuration
load instruction. Using the decoder if to partially decode without the branch prediction, and VR-bp denotes the con-
the instruction in the If stage helps to detect the dependency figuration with the branch prediction. We set the parameters
by load instruction in Id stage and an instruction in the If of VexRiscv as follows to make the configuration as close
stage, and the detection result is stored in the IdEx pipeline as possible to RVCoreP. They are reading the register file
register. If the load instruction is in the Ex stage and there is asynchronously, using shift instruction implemented with a
a data dependency on load instruction, this processor inserts full barrel shifter that performs in one cycle, and utilizing
a bubble in IdEx pipeline register, and stall instructions in the data forwarding path.
the If stage and Id stage, which incurs a one-cycle penalty. VR-bp implements a bimodal branch predictor and a
BTB. The prediction scheme of the proposal is a gshare
5. Verification and Evaluation branch predictor, which achieves higher prediction accuracy
than the bimodal predictor of VR-bp.
5.1 Verification Our comparison between VR-bp and the proposal is
fair because two prediction schemes are implemented by us-
We verified the implemented RTL code by Verilog simu- ing the same size of block RAM.
lation. A RISC-V processor simulator modeling a conser- For the proposal, the number of BTB entries is 512,
vative multi-cycle processor named SimRV that we imple- and the number of PHT entries for gshare is 8,192. They
mented in C++ is used as the reference model. are implemented as 4KB block RAM in total. The predic-
SimRV outputs the PC value, the executed instruction, tion scheme of VR-bp is configured using the option called
and the 32 values stored in the register file, when a RISC-V Prediction DYNAMIC TARGET in BranchPlugin. For VR-
program binary is given. By executing the same binary us- bp, BTB and PHT for bimodal are implemented as a single
ing SimRV and Verilog simulation for our designed proces- table. The number of its entries is set to 512 using the pa-
sors, log files of the same format can be output. We executed rameter called historyRamSizeLog2. It is also implemented
the two benchmark binaries used in the evaluation described as 4KB block RAM.
later, and compared each log file. We have confirmed that To execute the RISC-V program with RVCoreP, we
their values in two log files match and the programs are ex- create a system including the proposed processor. This sys-
ecuting correctly. tem includes the proposed processor RVCoreP as shown in
In addition to the verification through simulations, Fig. 4, an instruction memory, a data memory, and the mod-
we verified the behavior of the designed processor us- ules for RS-232C serial communication with a communica-
ing an FPGA board. The same RISC-V program bi- tion buffer. This system reads the RISC-V program binary
nary used for Verilog simulation is executed on the actual and operates for Verilog simulation. Also, the same pro-
Xilinx Atrix-7 FPGA board, and we have confirmed that the gram for Verilog simulation runs on Nexys 4 DDR board
ASCII character output of the execution results via a serial with Xilinx Artix-7 FPGA [20] that receives the same binary
communication had matched to the correct result, and con- through the serial communication module. This system can
firmed that the numbers of execution cycles and executed output the same characters as the simulation by serial com-
instructions are also matched. munication. The number of lines of code for this system
is 1,487, of which the processor RVCoreP has 832 lines of
5.2 Evaluation Environment codes. This system is used to evaluate IPC, operating fre-
quency, and hardware resource utilization. By replacing the
We implement four versions of the proposed processor in VexRiscv processor with the processor part of this system,
Verilog HDL and evaluate them in terms of IPC, operat- VexRiscv is evaluated in the same environment.
ing frequency, hardware resource utilization, and proces- IPC is evaluated by Verilog simulation using
sor performance. We also make two configurations for Dhrystone [21] and Coremark [22] as benchmarks. We used
VexRiscv processor that supports RV32I, and use these the Dhrystone source code published in riscv-tests [23] and
2500
NUMBER OF RUNS was set to 2000. The number of ex-

ecuted instructions for Dhrystone is 909,443. We used the 5.3 Evaluation Results
Coremark source code [24] released for RISC-V and ITER-
ATIONS was set to 2. The number of executed instructions Table 1 shows the evaluation results of IPC and branch ac-
for Coremark is 1,481,298. The source codes of each bench- curacy obtained by Verilog simulation. This shows IPC and
mark are compiled by using the RISC-V RV32I cross com- the number of prediction hit and miss, and prediction hit
piler. The RISC-V gcc cross compiler version 8.3.0 has been rate for each of the two benchmarks, and the average IPC of
used, and the used optimization flag was -O2. For bench- these two benchmarks.
mark program simulation to evaluate IPC, the size of both Regarding IPC and the branch prediction hit rate of
instruction memory and data memory was set as 32KB for each benchmark, the four versions of RVCoreP outperform
all processor configurations because a processor with 16KB the two versions of VexRiscv. Because the predictors of the
data memory can not execute Coremark. same memory size are used for VR-bp and the proposal,
The operating frequency and hardware resource utiliza- RVCoreP having a better prediction scheme of gshare
tion are evaluated targetting Nexys 4 DDR board [20] hav- achieves better hit rate than VR-bp. RVP-simple achieves
ing xc7a100tcsg324-1 FPGA which is a family of Xilinx higher IPC than VR-bp, because RVP-simple has a small
Artix-7 FPGA. Xilinx Vivado 2017.2 is used to evaluate number of stalled clock cycles due to load and shift instruc-
the operating frequency and hardware resource utilization. tions, and it also has a small number of branch mispredic-
Flow PerfOptimized high strategy is used for logic syn- tions compared to VR-bp.
thesis, and Performance ExplorePostRoutePhysOpt strat- VexRiscv stalls two cycles when a load-use depen-
egy is used for placement and routing. We performed the dency occurs while RVCoreP is better stalling just one cycle
logic synthesis and placement and routing by incrementally for the dependency. VexRiscv stalls one cycle when an ex-
changing the clock cycle constraint in 5MHz. The highest ecuted shift instruction has a data dependency with the fol-
frequency that satisfies the constraints is used as the operat- lowing instruction, while RVCoreP is better having no stall
ing frequency of the processor. For hardware resource eval- for shift instructions.
uation, we used the result of placement and routing at the Note that the prediction accuracy of the branch predic-
maximum operating frequency. To measure the highest op- tor drops due to the pipelined branch prediction. Therefore,
erating frequency, the size of instruction memory and data the IPC of RVP-simple and RVP-optALU is higher than the
memory is set as 4KB for all processor configurations be- IPC of RVP-optIF and RVP-optALL.
cause this memory size is the smallest by using one block Figure 6 shows the IPC for each configuration ob-
RAM. To stabilize the operating frequency of the evaluated tained by Verilog simulation. The orange bars are used for
system, the placement and routing are performed using only Dhrystone, the yellow bars for Coremark, and the green bars
one clock region of the FPGA. for the average. As a whole, Dhrystone has a simpler pro-
The processor performance is calculated by multiply- gram structure than Coremark and a higher branch predic-
ing the average IPC by the operating frequency. tion hit rate. Therefore, Dhrystone tends to have a higher
value of IPC than Coremark. From this figure, we confirm
that the four versions of RVCoreP outperform the two ver-
sions of VexRiscv.
Table 1 The evaluation results of IPC and branch prediction accuracy obtained by Verilog simulation.
Dhrystone Coremark
Label Average IPC
IPC prediction hit prediction miss hit rate IPC prediction hit prediction miss hit rate
VR-nobp 0.661 N/A N/A N/A 0.591 N/A N/A N/A 0.626
VR-bp 0.836 146,180 29,452 0.832 0.766 348,010 109,701 0.760 0.801
RVP-simple 0.946 205,127 12,507 0.943 0.828 366,726 91,247 0.801 0.887
RVP-optALU 0.946 205,127 12,507 0.943 0.828 366,726 91,247 0.801 0.887
RVP-optIF 0.935 201,153 16,481 0.924 0.823 363,439 94,534 0.794 0.879
RVP-optALL 0.935 201,153 16,481 0.924 0.823 363,439 94,534 0.794 0.879
Table 2 The evaluation results of frequency, hardware resource utilization, and performance where
4KB memories are used.
Operating Slice Slice Increase rate Average Processor Normalized
Label Slice
frequency LUT register of slice IPC performance performance
VR-nobp 205 936 562 284 1.000 0.626 128.4 1.000
VR-bp 140 944 611 300 1.056 0.801 112.1 0.873
RVP-simple 160 1,020 715 349 1.229 0.887 141.9 1.105
RVP-optALU 170 1,070 730 375 1.320 0.887 150.8 1.174
RVP-optIF 180 1,044 749 390 1.373 0.879 158.2 1.232
RVP-optALL 190 1,073 764 397 1.398 0.879 167.0 1.300
2501
Table 2 summarises the evaluation results of operat- Figure 7 shows the maximum operating frequency for
ing frequency, hardware resource utilization, and proces- each configuration on Artix-7 FPGA. The configuration of
sor performance for Artix-7 FPGA where 4KB memories VR-nobp has the highest operating frequency of 205MHz.
are used. From the hardware resource utilization in this ta- It can be seen that the operating frequency of the four con-
ble, VexRiscv is more resource-saving than RVCoreP as a figurations of RVCoreP is improved by applying each op-
whole. It is because the proposed processor has a com- timization. The best frequency of RVCoreP is 190MHz on
plicated architecture to achieve higher performance than RVP-optALL. Note that among configurations with branch
VexRiscv, and any specific optimizations to reduce hard- predictions, RVP-optALL achieves much better operating
ware resources are not applied. frequency than VP-bp running at 140MHz.
The following three architectural modifications from Figure 8 shows the processor performance by IPC as-
VexRiscv seems to increase the hardware resources of our suming that the operating frequency is the same where VR-
processor: a change to gshare from bimodal, a change to the nobp is normalized as 1. RVP-simple and RVP-optALU
one-hot encoding from the binary encoding, and a change have the highest performance. RVP-optALL achieves
of locations of some modules among pipeline stages to im- 40.3% performance improvement compared to VR-nobp be-
prove the operating frequency. cause VR-nobp does not have a branch prediction and has
The increase rate of slice shown in the 6th column is low IPC. RVP-optALL achieves 9.74% performance im-
normalized with VR-nobp as 1. The slice usage of RVP- provement compared to VR-bp. The other configurations
optALL is 397, which is a 39.8% increase compared to VR- of RVCoreP achieve almost the same IPC performance.
nobp. 48 LUTs are used as memory (LUTRAM), which Figure 9 shows the processor performance on Artix-
are inferred for the register file of the processor. Except for 7 FPGA where VR-nobp is normalized as 1. This pro-
VR-nobp which does not have a branch prediction, only one cessor performance considers the operating frequency, and
block RAM for the tables in branch prediction is used. In all each value in the graph is the performance improvement rate
configurations, the instruction memory and data memory of from VR-nobp. RVP-optALL achieves 30.0% performance
4KB consist of two block RAMs.
Fig. 8 The processor performance by IPC assuming that the operating

Fig. 6 The IPC for each configuration obtained by Verilog simulation. frequency is the same where VR-nobp is normalized as 1.
Fig. 7 The maximum operating frequency for each configuration on Fig. 9 The processor performance on Artix-7 FPGA where VR-nobp is
Artix-7 FPGA. normalized as 1.
Table 3 The evaluation results of frequency, hardware resource utilization, and performance where
32KB memories are used.
Operating Slice Slice Increase rate Average Processor Normalized
Label Slice
frequency LUT register of slice IPC performance performance
VR-nobp 195 938 537 337 1.000 0.626 122.2 1.000
VR-bp 135 948 611 322 0.955 0.801 108.1 0.885
RVP-simple 145 1,028 711 434 1.288 0.887 128.6 1.053
RVP-optALU 160 1,095 726 437 1.297 0.887 141.9 1.162
RVP-optIF 165 1,045 745 368 1.092 0.879 145.0 1.187
RVP-optALL 175 1,075 760 469 1.392 0.879 153.8 1.259
2502
improvement compared to VR-nobp, which is the highest optimization are applied as effective methods to improve the
performance configuration of VexRiscv, because the pro- operating frequency. We implement this proposed proces-
posed processor has a higher IPC compared to VR-nobp sor in Verilog HDL and evaluate IPC, operating frequency,
minimizing the decrease in operating frequency of the pro- hardware resource utilization, and processor performance
posal by using three optimizations. The other configurations compared with the VexRiscv processor.
of RVCoreP achieve performance improvement of 10% or From the evaluation results, the proposed proces-
more compared to VR-nobp. sor RVP-optALL that applied all optimizations achieves
The performance improvement from RVP-simple to 30.0% performance improvement as processor performance
RVP-optALL is 1.176. Therefore, we achieve 17.6% perfor- considering operating frequency compared with VR-nobp,
mance improvement by using three proposed optimizations. which is the highest performance configuration of VexRiscv.
We also evaluate the operating frequency and hardware In addition, the proposed optimization method achieves
resources when the size of each memory is 32KB. The result 17.6% performance improvement in RVCoreP.
is shown in Table 3 where RVP-optALL with 32KB memo-
ries achieves 25.9% performance improvement compared to Acknowledgments
VR-nobp in case of running medium sized benchmark pro-
grams like Coremark. This work is supported by JSPS KAKENHI Grant Number
JP16H02794.
6. Discussion
References
The instruction fetch unit optimization is a novel method
[1] RISC-V Foundation, “RISC-V | Instruction Set Architecture (ISA).”
in terms of proposing the fetch unit including the pipelined https://riscv.org/.
gshare branch predictor which uses the previous PC. [2] Xilinx, MicroBlaze Processor Reference Guide, v2018.2 ed., June
The related work [25] proposes a similar fetch unit in 2018.
BOOM processor, but this work differs from our research [3] Intel, Nios II Processor Reference Guide, April 2018.
in terms of using the current PC. This method increases the [4] A. Waterman, Y. Lee, D.A. Patterson, and K. Asanović, “The RISC-
V Instruction Set Manual, Volume I: User-Level ISA, Version 2.1,”
overhead of processing branch instructions, because it re-
Tech. Rep. UCB/EECS-2016-118, EECS Department, University of
quires several cycles to get the result of branch prediction. California, Berkeley, May 2016.
The method proposed in the related work [26] also differs [5] R. Höller, D. Haselberger, D. Ballek, P. Rössler, M. Krapfenbauer,
from our method in terms of not using the previous PC. and M. Linauer, “Open-Source RISC-V Processor IP Cores for
Another related work [19] proposes a similar method, FPGAs — Overview and Evaluation,” 2019 8th Mediterranean Con-
but this method differs from our optimization in terms of ference on Embedded Computing (MECO), pp.1–6, June 2019.
[6] K. Asanović, R. Avizienis, J. Bachrach, et al., “The Rocket Chip
both targeting a TAGE branch predictor and not showing the Generator,” Tech. Rep. UCB/EECS-2016-17, EECS Department,
configuration as an instruction fetch unit. TAGE generally University of California, Berkeley, April 2016.
has higher prediction accuracy than gshare, but it has the [7] J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Avižienis,
drawback that the operating frequency decreases due to its J. Wawrzynek, and K. Asanović, “Chisel: Constructing hardware in
complicated configuration. a Scala embedded language,” DAC Design Automation Conference
The performance improvement of RVP-optALL com- 2012, pp.1212–1221, June 2012.
[8] SpinalHDL, “VexRiscv: A FPGA friendly 32 bit RISC-V CPU im-
pared to VR-nobp is reasonable referring to Pollack’s law, plementation.” https://github.com/SpinalHDL/VexRiscv.
which describes the relationship between hardware re- [9] SpinalHDL, “SpinalHDL: An open source high-level hardware de-
sources and processor performance. The empirical improve- scription language.” https://github.com/SpinalHDL/SpinalHDL.
ment according to the law is 18%, which is proportional [10] RISC-V Foundation, “RISC-V SoftCPU Contest, October 8, 2018.”
to the square root of the 40% increase in slice usage com- https://riscv.org/2018/10/risc-v-contest/.
[11] University of California, Berkeley, “riscv-mini: Simple RISC-V 3-
pared to VR-nobp. The obtained performance improvement
stage Pipeline in Chisel.” https://github.com/ucb-bar/riscv-mini.
of 30% is higher than this empirical improvement. [12] University of California, Berkeley, “The Sodor Processor: educa-
Our proposed processor differs from some RISC-V soft tional microarchitectures for risc-v isa.” https://github.com/ucb-bar/
processors [27], [28] because they adopt out-of-order execu- riscv-sodor.
tion for high performance while the target of our proposal is [13] University of Cambridge, “Clarvi: simple RISC-V processor for
a cost-effective scalar processor with a typical pipeline con- teaching.” https://github.com/ucam-comparch/clarvi.
[14] D.A. Patterson and J.L. Hennessy, Computer Organization and De-
figuration. The performance comparisons with these soft sign The Hardware / Software Interface, RISC-V Edition, Morgan
processors are our future works. Kaufmann, 2018.
[15] S. McFarling, “Combining branch predictors,” Tech. Rep., Technical
7. Conclusion Report TN-36, Digital Western Research Laboratory, 1993.
[16] P. Metzgen, “A High Performance 32-bit ALU for Programmable
Logic,” Proc. 2004 ACM/SIGDA 12th International Symposium on
We propose a RISC-V soft processor adopting five-stage
Field Programmable Gate Arrays, FPGA ’04, New York, NY, USA,
pipelining highly optimized for FPGAs. In the pro- pp.61–70, ACM, 2004.
posed processor, the instruction fetch unit optimization, the [17] Xilinx, HDL Synthesis for FPGAs Design Guide, 1995.
ALU optimization, and the alignment and sign-extension [18] D.A. Jimenez, “Reconsidering complex branch predictors,” Proc.
2503
The Ninth International Symposium on High-Performance Com- Md Ashraful Islam have graduated from
puter Architecture, 2003. HPCA-9 2003, pp.43–52, Feb. 2003. the University of Rajshahi, Bangladesh. He is a
[19] K. Matsui, M.A. Islam, and K. Kise, “An Efficient Implementa- 1st-year doctoral student at the Tokyo Institute
tion of a TAGE Branch Predictor for Soft Processors on FPGA,” of Technology. He has 8-years of experience in
2019 IEEE 13th International Symposium on Embedded Mul- the Semiconductor Industry in ASIC, SoC de-
ticore/Many-core Systems-on-Chip (MCSoC), pp.108–115, Oct. sign and Verification. His research interest is in
2019. Computer Architecture, especially on Processor
[20] Digilent, Inc., Nexys 4 DDR Reference Manual, rev.c ed., 2016. design and memory sub-system design.
[21] R.P. Weicker, “Dhrystone: A Synthetic Systems Programming
Benchmark,” Commun. ACM, vol.27, no.10, pp.1013–1030, Oct.
1984.
[22] EEMBC, “CoreMark | CPU Benchmark – MCU Benchmark.”
https://www.eembc.org/coremark/. Kenji Kise received the B.E. degree from
[23] RISC-V Foundation, “riscv-tests.” https://github.com/riscv/riscv- Nagoya University in 1995, the M.E. degree
tests. and the Ph.D. degree in information engineer-
[24] UC Berkeley Architecture Research, “Setup scripts and files needed ing from the University of Tokyo in 1997 and
to compile CoreMark on RISC-V.” https://github.com/riscv-boom/ 2000, respectively. He is currently an associate
riscv-coremark. professor of the School of Computing, Tokyo
[25] C. Celio, “A Highly Productive Implementation of an Out-of-Order Institute of Technology. His research interests
Processor Generator,” Ph.D. thesis, EECS Department, University include computer architecture and parallel pro-
of California, Berkeley, Dec. 2018. cessing. He is a member of ACM, IEEE, IEICE,
[26] S. Manne, A. Klauser, and D. Grunwald, “Pipeline gating: spec- and IPSJ.
ulation control for energy reduction,” Proc. 25th Annual Interna-
tional Symposium on Computer Architecture (Cat. No.98CB36235),
pp.132–141, 1998.
[27] E. Matthews, Z. Aguila, and L. Shannon, “Evaluating the Per-
formance Efficiency of a Soft-Processor, Variable-Length, Paral-
lel-Execution-Unit Architecture for FPGAs Using the RISC-V ISA,”
2018 IEEE 26th Annual International Symposium on Field-Pro-
grammable Custom Computing Machines (FCCM), pp.1–8, 2018.
[28] S. Mashimo, A. Fujita, R. Matsuo, S. Akaki, A. Fukuda, T. Koizumi,
J. Kadomoto, H. Irie, M. Goshima, K. Inoue, and R. Shioya, “An
Open Source FPGA-Optimized Out-of-Order RISC-V Soft Proces-
sor,” 2019 International Conference on Field-Programmable Tech-
nology (ICFPT), pp.63–71, 2019.
Hiromu Miyazaki received the B.E degrees

in Department of Computer Science from Tokyo
Institute of Technology, Japan in 2019. He is
currently a master course student of the Grad-
uate School of Computing, Tokyo Institute of
Technology, Japan. His research interest is com-
puter architecture and FPGA computing. He is
a student member of IEICE.
Takuto Kanamori is currently a bach-

elor course student of the School of Comput-
ing, Tokyo Institute of Technology, Japan. His
research interest is computer architecture and
FPGA computing.
View publication stats

E103.d 2020pap0015

Uploaded by

Copyright:

Available Formats

E103.d 2020pap0015

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

E103.d 2020pap0015

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

RVCoreP: An Optimized RISC-V Soft Processor of Five-Stage Pipelining

Article in IEICE Transactions on Information and Systems · December 2020

Hiromu Miyazaki Ashraful Islam

SEE PROFILE SEE PROFILE

Cost-Effective Merge Architecture View project

Many-core processor emulation View project

The user has requested enhancement of the downloaded file.

RVCoreP: An Optimized RISC-V Soft Processor of Five-Stage

Fig. 1 A block diagram of typical five-stage pipelined processor (baseline).

4. Design and Implementation of RVCoreP

Fig. 3 The pipeline diagram of instruction fetching using pipelined

and clock skew. In our preliminary evaluation, the access

Fig. 4 The block diagram of the proposed processor named RVCoreP.

NUMBER OF RUNS was set to 2000. The number of ex-

Fig. 8 The processor performance by IPC assuming that the operating

Hiromu Miyazaki received the B.E degrees

Takuto Kanamori is currently a bach-

View publication stats

You might also like