E103.d 2020pap0015
E103.d 2020pap0015
E103.d 2020pap0015
net/publication/347538569
CITATIONS READS
11 2,954
4 authors, including:
Kenji Kise
Tokyo Institute of Technology
116 PUBLICATIONS 708 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Ashraful Islam on 08 February 2021.
PAPER Special Section on Parallel, Distributed, and Reconfigurable Computing, and Networking
SUMMARY RISC-V is a RISC based open and loyalty free instruction version of a general-purpose instruction set.
set architecture which has been developed since 2010, and can be used for Among these instruction sets, we focus on RV32I in
cost-effective soft processors on FPGAs. The basic 32-bit integer instruc-
this paper because it is sufficient to support the operating
tion set in RISC-V is defined as RV32I, which is sufficient to support the
operating system environment and suits for embedded systems. In this pa- system environment and suits for embedded systems. RV32I
per, we propose an optimized RV32I soft processor named RVCoreP adopt- can emulate other extensions of M, F, and D, and can be
ing five-stage pipelining. Three effective methods are applied to the pro- configured with fewer hardware resources than processors
cessor to improve the operating frequency. These methods are instruction supporting RV32G. Although several soft processors that
fetch unit optimization, ALU optimization, and data memory optimization.
We implement RVCoreP in Verilog HDL and verify the behavior using
support RV32I have been released [5], they are not highly
Verilog simulation and an actual Xilinx Atrix-7 FPGA board. We evaluate optimized for FPGAs.
IPC (instructions per cycle), operating frequency, hardware resource uti- In this paper, we propose an optimized RV32I soft pro-
lization, and processor performance. From the evaluation results, we show cessor named RVCoreP of five-stage pipelining which is
that RVCoreP achieves 30.0% performance improvement compared with
highly optimized for FPGAs. The main contributions of this
VexRiscv, which is a high-performance and open source RV32I processor
selected from some related works. paper are as follows.
key words: soft processor, FPGA, RISC-V, RV32I, Verilog HDL, five-stage
pipelining • We propose an optimized RV32I soft processor of
five-stage pipelining highly optimized for FPGAs. To
1. Introduction improve the operating frequency, three optimization
methods are applied to the processor. They are instruc-
RISC-V [1] is becoming popular as an open and loyalty free tion fetch unit optimization, ALU optimization, and
instruction set architecture (ISA) which has been developed data memory optimization.
at the University of California, Berkeley since 2010. It can • We implement the proposal in Verilog HDL and evalu-
be used for cost-effective soft processors on FPGAs like ate IPC (instructions per cycle), operating frequency,
MicroBlaze [2] and Nios II [3]. The purpose of our research hardware resource utilization, and processor perfor-
is to design and implement a cost-effective RISC-V scalar mance. From the evaluation results, we show that the
processor adopting a typical pipeline configuration. proposed processor achieves much better performance
The RISC-V ISA is defined as some basic integer in- than VexRiscv, which is a high-performance and open
struction sets and some extended instruction sets. We can source RV32I processor.
select the necessary instruction sets by the application re-
quirements [4]. The basic 32-bit integer instruction set is 2. Related Works
defined as RV32I. Other typical extended instruction sets
are defined as M for integer multiplication and division in- Rocket Core [6] is a RISC-V in-order scalar processor de-
structions, F for single-precision floating-point ones, D for veloped by the University of California, Berkeley. It is a
double-precision floating-point ones, and A for atomic ones. pipelined processor supporting RV32G and RV64G. It sup-
In addition to these, a 32-bit general-purpose instruction set ports processing of privilege levels, and has an MMU (mem-
is defined as RV32G as the set of RV32I, M, A, F, and D. ory management unit) with virtual memory and data cache,
This is an instruction set architecture for general-purpose and a branch prediction unit. Because of this rich function-
computing systems of a broad range. RV64G is a 64-bit ality and hard customization, it is not suitable for embedded
Manuscript received January 7, 2020. systems.
Manuscript revised May 22, 2020. Rocket Core has another drawback. It is written in
Manuscript publicized September 7, 2020. Chisel [7], a domain-specific language based on Scala. Be-
†
The authors are with the School of Computing, Tokyo Insti- cause Chisel is a new hardware description language since
tute of Technology, Tokyo, 152–8552 Japan. 2012, it may be difficult for hardware developers who have
a) E-mail: [email protected]
not mastered Chisel to change the design effortlessly. Ac-
b) E-mail: [email protected]
c) E-mail: [email protected] cording to the work [5], Verilog HDL and SystemVerilog are
d) E-mail: [email protected] the dominant languages used to implement the processors,
DOI: 10.1587/transinf.2020PAP0015 and they may be the best choice for easy-to-use processor
Copyright
c 2020 The Institute of Electronics, Information and Communication Engineers
MIYAZAKI et al.: RVCOREP: AN OPTIMIZED RISC-V SOFT PROCESSOR OF FIVE-STAGE PIPELINING
2495
implementations. Therefore, we implement our processors consists of five-stage indicated by the instruction fetch stage
in Verilog HDL, a dominant hardware description language. (If stage), instruction decode stage (Id stage), instruction ex-
VexRiscv [8] is a RISC-V pipelined soft processor sup- ecution stage (Ex stage), memory access stage (Ma stage),
porting RV32I. The integer multiplication and division, and write back stage (Wb stage).
other extensions, and the MMU with instruction cache and The green rectangles are registers that are updated at
data cache can be added as options. In addition, the branch the positive clock edge. The yellow rectangles are modules
prediction scheme, implementation choice of shift instruc- including the memory which is composed of block RAM on
tion, data forwarding path, and so on can be tuned for im- Xilinx FPGA. The gray rectangle is a register file read asyn-
plementation. VexRiscv is written in an open source and chronously consisting of 32 registers. The red modules are
new hardware description language called SpinalHDL [9], an ALU or adders, and the other blue modules are combina-
and the corresponding RTL description can be generated as tional circuits.
a Verilog HDL file. Since the generated Verilog HDL code The baseline has an instruction memory named
is not hierarchical, debugging and understanding this gener- m imem shown at the bottom left of the figure, and a data
ated code is not easy. memory named m dmem shown at the right of the figure.
VexRiscv has won the 1st place at the highest- The branch prediction scheme is gshare [15] which
performance implementation category of the RISC-V Soft- contains a branch history register (BHR) named r BHR, a
CPU Contest in 2018 hosted by the RISC-V Founda- pattern history table (PHT) named m PHT, and a branch tar-
tion [10]. Therefore, it is an optimized soft processor for get buffer (BTB) named m BTB. To mitigate the data hazard,
high-performance, and the highest performance RV32I soft it has two forwarding paths. The red path from Ma stage to
processor available as an open source as far as we know. We Ex stage provides the register value for the next dependent
use VexRiscv as a reference for making the comparison with instruction. Similarly, the blue path provides register value
our proposed processors. from the Wb stage to the Ex stage.
There are other RISC-V processors for education such In the If stage, the instruction is fetched from the in-
as riscv-mini [11] and Sodor Processor [12] both are de- struction memory using the program counter (PC) as an ad-
veloped by the University of California, Berkeley, and dress. The register for PC named r pc is updated in every
Clarvi [13] developed by the University of Cambridge. cycle with the next PC value named w npc.
These educational RISC-V processors are easy-to-use, but There are four candidates for w npc in following de-
their performance is not high as VexRiscv because they are scending priority order. The highest priority one is the cor-
not highly optimized for high-performance. rect PC value named ExMa pc true from the Ma stage. The
second priority one is the current PC value from r pc in case
3. Design of a Typical Five-Stage Pipelined Processor of pipeline stalling. The third priority one is the branch tar-
get address named w btb which is output from the BTB. The
We design a typical five-stage pipelined processor with lowest priority one is r pc+4 for the instruction of the next
branch prediction referring [14], and this design is used as address.
a baseline for the proposal. There are three control signals to select the proper one
Figure 1 shows a block diagram of the baseline among four candidates. The first signal is named w bmis
which indicates whether a branch misprediction has oc- Code 1 The simplified description of a typical ALU.
curred. The second one is named w stall for pipeline stalling 1 module ALU (in1, in2, sel, rslt);
2 input wire [31:0] in1, in2;
due to the data dependency on the load instruction. The third 3 input wire [2:0] sel;
one is named w btkn from branch predictor to provide a pre- 4 output wire [31:0] rslt;
diction result as predicted taken or not taken. 5 reg [31:0] r rslt;
In the baseline, the path that determines the next PC 6 always @(*) begin
7 case(sel)
value from the four candidates through a multiplexer using
8 0 : r rslt = in1 + in2; // add
three control signals is the critical path that determines the 9 1 : r rslt = in1 - in2; // sub
maximum operating frequency. The next critical path is the 10 2 : r rslt = in1 ˆ in2; // ex-or
data path to store the executed result in the Ex stage from 11 3 : r rslt = in1 | in2; // or
ALU which uses two data forwarding values. Another slow 12 4 : r rslt = in1 & in2; // and
13 5 : r rslt = in1 << in2[4:0]; // shift left
path is aligning and sign-extending the values of reading 14 6 : r rslt = in1 >> in2[4:0]; // shift right
data from the data memory on the Ma stage which will be 15 default : r rslt = 0;
stored into the MaWb pipeline register. 16 endcase
In our proposed processor, the operating frequency is 17 end
18 assign rslt = r rslt;
improved by optimizing these critical paths.
19 endmodule
If stage for Inst B and the preIf stage for the next instruc-
tion are executed. In the If stage for Inst B, the next PC
value is determined by using the value prepared in the preIf
stage one cycle before. In the preIf stage, the value of BTB
and PHT index used in the If stage for the next instruction
0x108 are prepared by using the current PC value of 0x104.
Since Inst B is a conditional branch and assuming that it is
predicted as taken, the next PC value is 0x130.
In the clock cycle 3 when the value of PC is 0x130, the
If stage for Inst M and the preIf stage for the next instruction
are executed. In the If stage for Inst M, the next PC value
is determined as well but branch prediction is invalid, be-
cause the value prepared in the preIf stage one cycle before
is for the instruction whose address is 0x108, and this value
cannot be used in the If stage for Inst M whose address is
0x130. Therefore, if the value obtained by adding 4 to the Fig. 5 The pipeline diagrams of the proposed processor.
previous PC does not match the current PC value, the branch
prediction is invalid. a five-stage pipelined processor including an instruction
From the above, the PC value used for BTB access is memory, a data memory, pipelined gshare and pipelined
the one cycle earlier value of PC, and the PC value used for BTB. The ALU optimization, the alignment and sign-
PHT access is the value one cycle before the branch predic- extension optimization, and the instruction fetch unit op-
tion is output. As a result, gshare outputs a prediction in 2 timization are applied to the proposal. The unit named
cycles. The BTB entry is updated using a value obtained ALU opt in Fig. 4 is the optimized ALU.
by subtracting 4 from the PC value of the branch instruc- The detection timing of the load-use dependency be-
tion. When updating a PHT entry, we have to keep the PHT tween a load instruction and the following instruction using
index value used for the prediction and to update the PHT the loaded data is changed from the Id state on the base-
entry using this index when the actual branch outcome will line in Fig. 1 to the If stage using the combinational circuit
be available. named Load-use. To support the detection, a part of in-
The prediction accuracy might drop slightly due to the struction decoder named decoder if is implemented in the If
adverse effect of this optimization to make a prediction and stage. decoder if decodes two source registers and one des-
update the index with the one cycle earlier value of PC. tination register for one instruction, and generates the write
signals for the register file and data memory. This partial de-
4.4 RVCoreP Soft Processor coding of instruction in If stage allows us to detect the data
dependency including load-use dependency in advance.
Figure 4 shows the block diagram of RVCoreP which is Figure 5 shows the pipeline diagrams of the proposed
MIYAZAKI et al.: RVCOREP: AN OPTIMIZED RISC-V SOFT PROCESSOR OF FIVE-STAGE PIPELINING
2499
processor. Figure 5 (a) shows the case where the pipeline is configurations for comparative evaluation with the propos-
flushed due to a branch prediction miss. In the branch pre- als. The code used for RVCoreP is Ver.0.4.5. The code
diction mechanism, the branch target address from the BTB version of VexRiscv processor used for the evaluation is
is used when the BTB is hit and the branch is predicted to SpinalHDL/VexRiscv@ca228a3 committed on 26 Septem-
be taken. The correct branch destination address calculation ber 2019 in GitHub page [8].
and check whether the branch prediction is correct or not is The four versions of RVCoreP are named as follows.
executed in the Ex stage and stored in the ExMa pipeline The version that applies all the optimizations described
register. If the branch instruction is at the Ma stage and the above is called RVP-optALL, and the simple version that
branch prediction missed, the instructions in the If stage, does not apply any optimizations is defined as RVP-simple.
Id stage, and Ex stage are flushed, which incurs a 3-cycle The version that applies only the ALU optimization and the
penalty. alignment and sign-extension optimization is specified as
Figure 5 (b) shows the case where the pipeline stalls RVP-optALU, and the version that applies only the instruc-
due to the load-use dependency. In that case, the depen- tion fetch unit optimization is defined as RVP-optIF.
dency is avoided by stalling the instruction following the For the VexRiscv, VR-nobp denotes the configuration
load instruction. Using the decoder if to partially decode without the branch prediction, and VR-bp denotes the con-
the instruction in the If stage helps to detect the dependency figuration with the branch prediction. We set the parameters
by load instruction in Id stage and an instruction in the If of VexRiscv as follows to make the configuration as close
stage, and the detection result is stored in the IdEx pipeline as possible to RVCoreP. They are reading the register file
register. If the load instruction is in the Ex stage and there is asynchronously, using shift instruction implemented with a
a data dependency on load instruction, this processor inserts full barrel shifter that performs in one cycle, and utilizing
a bubble in IdEx pipeline register, and stall instructions in the data forwarding path.
the If stage and Id stage, which incurs a one-cycle penalty. VR-bp implements a bimodal branch predictor and a
BTB. The prediction scheme of the proposal is a gshare
5. Verification and Evaluation branch predictor, which achieves higher prediction accuracy
than the bimodal predictor of VR-bp.
5.1 Verification Our comparison between VR-bp and the proposal is
fair because two prediction schemes are implemented by us-
We verified the implemented RTL code by Verilog simu- ing the same size of block RAM.
lation. A RISC-V processor simulator modeling a conser- For the proposal, the number of BTB entries is 512,
vative multi-cycle processor named SimRV that we imple- and the number of PHT entries for gshare is 8,192. They
mented in C++ is used as the reference model. are implemented as 4KB block RAM in total. The predic-
SimRV outputs the PC value, the executed instruction, tion scheme of VR-bp is configured using the option called
and the 32 values stored in the register file, when a RISC-V Prediction DYNAMIC TARGET in BranchPlugin. For VR-
program binary is given. By executing the same binary us- bp, BTB and PHT for bimodal are implemented as a single
ing SimRV and Verilog simulation for our designed proces- table. The number of its entries is set to 512 using the pa-
sors, log files of the same format can be output. We executed rameter called historyRamSizeLog2. It is also implemented
the two benchmark binaries used in the evaluation described as 4KB block RAM.
later, and compared each log file. We have confirmed that To execute the RISC-V program with RVCoreP, we
their values in two log files match and the programs are ex- create a system including the proposed processor. This sys-
ecuting correctly. tem includes the proposed processor RVCoreP as shown in
In addition to the verification through simulations, Fig. 4, an instruction memory, a data memory, and the mod-
we verified the behavior of the designed processor us- ules for RS-232C serial communication with a communica-
ing an FPGA board. The same RISC-V program bi- tion buffer. This system reads the RISC-V program binary
nary used for Verilog simulation is executed on the actual and operates for Verilog simulation. Also, the same pro-
Xilinx Atrix-7 FPGA board, and we have confirmed that the gram for Verilog simulation runs on Nexys 4 DDR board
ASCII character output of the execution results via a serial with Xilinx Artix-7 FPGA [20] that receives the same binary
communication had matched to the correct result, and con- through the serial communication module. This system can
firmed that the numbers of execution cycles and executed output the same characters as the simulation by serial com-
instructions are also matched. munication. The number of lines of code for this system
is 1,487, of which the processor RVCoreP has 832 lines of
5.2 Evaluation Environment codes. This system is used to evaluate IPC, operating fre-
quency, and hardware resource utilization. By replacing the
We implement four versions of the proposed processor in VexRiscv processor with the processor part of this system,
Verilog HDL and evaluate them in terms of IPC, operat- VexRiscv is evaluated in the same environment.
ing frequency, hardware resource utilization, and proces- IPC is evaluated by Verilog simulation using
sor performance. We also make two configurations for Dhrystone [21] and Coremark [22] as benchmarks. We used
VexRiscv processor that supports RV32I, and use these the Dhrystone source code published in riscv-tests [23] and
IEICE TRANS. INF. & SYST., VOL.E103–D, NO.12 DECEMBER 2020
2500
Table 1 The evaluation results of IPC and branch prediction accuracy obtained by Verilog simulation.
Dhrystone Coremark
Label Average IPC
IPC prediction hit prediction miss hit rate IPC prediction hit prediction miss hit rate
VR-nobp 0.661 N/A N/A N/A 0.591 N/A N/A N/A 0.626
VR-bp 0.836 146,180 29,452 0.832 0.766 348,010 109,701 0.760 0.801
RVP-simple 0.946 205,127 12,507 0.943 0.828 366,726 91,247 0.801 0.887
RVP-optALU 0.946 205,127 12,507 0.943 0.828 366,726 91,247 0.801 0.887
RVP-optIF 0.935 201,153 16,481 0.924 0.823 363,439 94,534 0.794 0.879
RVP-optALL 0.935 201,153 16,481 0.924 0.823 363,439 94,534 0.794 0.879
Table 2 The evaluation results of frequency, hardware resource utilization, and performance where
4KB memories are used.
Operating Slice Slice Increase rate Average Processor Normalized
Label Slice
frequency LUT register of slice IPC performance performance
VR-nobp 205 936 562 284 1.000 0.626 128.4 1.000
VR-bp 140 944 611 300 1.056 0.801 112.1 0.873
RVP-simple 160 1,020 715 349 1.229 0.887 141.9 1.105
RVP-optALU 170 1,070 730 375 1.320 0.887 150.8 1.174
RVP-optIF 180 1,044 749 390 1.373 0.879 158.2 1.232
RVP-optALL 190 1,073 764 397 1.398 0.879 167.0 1.300
MIYAZAKI et al.: RVCOREP: AN OPTIMIZED RISC-V SOFT PROCESSOR OF FIVE-STAGE PIPELINING
2501
Table 2 summarises the evaluation results of operat- Figure 7 shows the maximum operating frequency for
ing frequency, hardware resource utilization, and proces- each configuration on Artix-7 FPGA. The configuration of
sor performance for Artix-7 FPGA where 4KB memories VR-nobp has the highest operating frequency of 205MHz.
are used. From the hardware resource utilization in this ta- It can be seen that the operating frequency of the four con-
ble, VexRiscv is more resource-saving than RVCoreP as a figurations of RVCoreP is improved by applying each op-
whole. It is because the proposed processor has a com- timization. The best frequency of RVCoreP is 190MHz on
plicated architecture to achieve higher performance than RVP-optALL. Note that among configurations with branch
VexRiscv, and any specific optimizations to reduce hard- predictions, RVP-optALL achieves much better operating
ware resources are not applied. frequency than VP-bp running at 140MHz.
The following three architectural modifications from Figure 8 shows the processor performance by IPC as-
VexRiscv seems to increase the hardware resources of our suming that the operating frequency is the same where VR-
processor: a change to gshare from bimodal, a change to the nobp is normalized as 1. RVP-simple and RVP-optALU
one-hot encoding from the binary encoding, and a change have the highest performance. RVP-optALL achieves
of locations of some modules among pipeline stages to im- 40.3% performance improvement compared to VR-nobp be-
prove the operating frequency. cause VR-nobp does not have a branch prediction and has
The increase rate of slice shown in the 6th column is low IPC. RVP-optALL achieves 9.74% performance im-
normalized with VR-nobp as 1. The slice usage of RVP- provement compared to VR-bp. The other configurations
optALL is 397, which is a 39.8% increase compared to VR- of RVCoreP achieve almost the same IPC performance.
nobp. 48 LUTs are used as memory (LUTRAM), which Figure 9 shows the processor performance on Artix-
are inferred for the register file of the processor. Except for 7 FPGA where VR-nobp is normalized as 1. This pro-
VR-nobp which does not have a branch prediction, only one cessor performance considers the operating frequency, and
block RAM for the tables in branch prediction is used. In all each value in the graph is the performance improvement rate
configurations, the instruction memory and data memory of from VR-nobp. RVP-optALL achieves 30.0% performance
4KB consist of two block RAMs.
Fig. 7 The maximum operating frequency for each configuration on Fig. 9 The processor performance on Artix-7 FPGA where VR-nobp is
Artix-7 FPGA. normalized as 1.
Table 3 The evaluation results of frequency, hardware resource utilization, and performance where
32KB memories are used.
Operating Slice Slice Increase rate Average Processor Normalized
Label Slice
frequency LUT register of slice IPC performance performance
VR-nobp 195 938 537 337 1.000 0.626 122.2 1.000
VR-bp 135 948 611 322 0.955 0.801 108.1 0.885
RVP-simple 145 1,028 711 434 1.288 0.887 128.6 1.053
RVP-optALU 160 1,095 726 437 1.297 0.887 141.9 1.162
RVP-optIF 165 1,045 745 368 1.092 0.879 145.0 1.187
RVP-optALL 175 1,075 760 469 1.392 0.879 153.8 1.259
IEICE TRANS. INF. & SYST., VOL.E103–D, NO.12 DECEMBER 2020
2502
improvement compared to VR-nobp, which is the highest optimization are applied as effective methods to improve the
performance configuration of VexRiscv, because the pro- operating frequency. We implement this proposed proces-
posed processor has a higher IPC compared to VR-nobp sor in Verilog HDL and evaluate IPC, operating frequency,
minimizing the decrease in operating frequency of the pro- hardware resource utilization, and processor performance
posal by using three optimizations. The other configurations compared with the VexRiscv processor.
of RVCoreP achieve performance improvement of 10% or From the evaluation results, the proposed proces-
more compared to VR-nobp. sor RVP-optALL that applied all optimizations achieves
The performance improvement from RVP-simple to 30.0% performance improvement as processor performance
RVP-optALL is 1.176. Therefore, we achieve 17.6% perfor- considering operating frequency compared with VR-nobp,
mance improvement by using three proposed optimizations. which is the highest performance configuration of VexRiscv.
We also evaluate the operating frequency and hardware In addition, the proposed optimization method achieves
resources when the size of each memory is 32KB. The result 17.6% performance improvement in RVCoreP.
is shown in Table 3 where RVP-optALL with 32KB memo-
ries achieves 25.9% performance improvement compared to Acknowledgments
VR-nobp in case of running medium sized benchmark pro-
grams like Coremark. This work is supported by JSPS KAKENHI Grant Number
JP16H02794.
6. Discussion
References
The instruction fetch unit optimization is a novel method
[1] RISC-V Foundation, “RISC-V | Instruction Set Architecture (ISA).”
in terms of proposing the fetch unit including the pipelined https://riscv.org/.
gshare branch predictor which uses the previous PC. [2] Xilinx, MicroBlaze Processor Reference Guide, v2018.2 ed., June
The related work [25] proposes a similar fetch unit in 2018.
BOOM processor, but this work differs from our research [3] Intel, Nios II Processor Reference Guide, April 2018.
in terms of using the current PC. This method increases the [4] A. Waterman, Y. Lee, D.A. Patterson, and K. Asanović, “The RISC-
V Instruction Set Manual, Volume I: User-Level ISA, Version 2.1,”
overhead of processing branch instructions, because it re-
Tech. Rep. UCB/EECS-2016-118, EECS Department, University of
quires several cycles to get the result of branch prediction. California, Berkeley, May 2016.
The method proposed in the related work [26] also differs [5] R. Höller, D. Haselberger, D. Ballek, P. Rössler, M. Krapfenbauer,
from our method in terms of not using the previous PC. and M. Linauer, “Open-Source RISC-V Processor IP Cores for
Another related work [19] proposes a similar method, FPGAs — Overview and Evaluation,” 2019 8th Mediterranean Con-
but this method differs from our optimization in terms of ference on Embedded Computing (MECO), pp.1–6, June 2019.
[6] K. Asanović, R. Avizienis, J. Bachrach, et al., “The Rocket Chip
both targeting a TAGE branch predictor and not showing the Generator,” Tech. Rep. UCB/EECS-2016-17, EECS Department,
configuration as an instruction fetch unit. TAGE generally University of California, Berkeley, April 2016.
has higher prediction accuracy than gshare, but it has the [7] J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Avižienis,
drawback that the operating frequency decreases due to its J. Wawrzynek, and K. Asanović, “Chisel: Constructing hardware in
complicated configuration. a Scala embedded language,” DAC Design Automation Conference
The performance improvement of RVP-optALL com- 2012, pp.1212–1221, June 2012.
[8] SpinalHDL, “VexRiscv: A FPGA friendly 32 bit RISC-V CPU im-
pared to VR-nobp is reasonable referring to Pollack’s law, plementation.” https://github.com/SpinalHDL/VexRiscv.
which describes the relationship between hardware re- [9] SpinalHDL, “SpinalHDL: An open source high-level hardware de-
sources and processor performance. The empirical improve- scription language.” https://github.com/SpinalHDL/SpinalHDL.
ment according to the law is 18%, which is proportional [10] RISC-V Foundation, “RISC-V SoftCPU Contest, October 8, 2018.”
to the square root of the 40% increase in slice usage com- https://riscv.org/2018/10/risc-v-contest/.
[11] University of California, Berkeley, “riscv-mini: Simple RISC-V 3-
pared to VR-nobp. The obtained performance improvement
stage Pipeline in Chisel.” https://github.com/ucb-bar/riscv-mini.
of 30% is higher than this empirical improvement. [12] University of California, Berkeley, “The Sodor Processor: educa-
Our proposed processor differs from some RISC-V soft tional microarchitectures for risc-v isa.” https://github.com/ucb-bar/
processors [27], [28] because they adopt out-of-order execu- riscv-sodor.
tion for high performance while the target of our proposal is [13] University of Cambridge, “Clarvi: simple RISC-V processor for
a cost-effective scalar processor with a typical pipeline con- teaching.” https://github.com/ucam-comparch/clarvi.
[14] D.A. Patterson and J.L. Hennessy, Computer Organization and De-
figuration. The performance comparisons with these soft sign The Hardware / Software Interface, RISC-V Edition, Morgan
processors are our future works. Kaufmann, 2018.
[15] S. McFarling, “Combining branch predictors,” Tech. Rep., Technical
7. Conclusion Report TN-36, Digital Western Research Laboratory, 1993.
[16] P. Metzgen, “A High Performance 32-bit ALU for Programmable
Logic,” Proc. 2004 ACM/SIGDA 12th International Symposium on
We propose a RISC-V soft processor adopting five-stage
Field Programmable Gate Arrays, FPGA ’04, New York, NY, USA,
pipelining highly optimized for FPGAs. In the pro- pp.61–70, ACM, 2004.
posed processor, the instruction fetch unit optimization, the [17] Xilinx, HDL Synthesis for FPGAs Design Guide, 1995.
ALU optimization, and the alignment and sign-extension [18] D.A. Jimenez, “Reconsidering complex branch predictors,” Proc.
MIYAZAKI et al.: RVCOREP: AN OPTIMIZED RISC-V SOFT PROCESSOR OF FIVE-STAGE PIPELINING
2503
The Ninth International Symposium on High-Performance Com- Md Ashraful Islam have graduated from
puter Architecture, 2003. HPCA-9 2003, pp.43–52, Feb. 2003. the University of Rajshahi, Bangladesh. He is a
[19] K. Matsui, M.A. Islam, and K. Kise, “An Efficient Implementa- 1st-year doctoral student at the Tokyo Institute
tion of a TAGE Branch Predictor for Soft Processors on FPGA,” of Technology. He has 8-years of experience in
2019 IEEE 13th International Symposium on Embedded Mul- the Semiconductor Industry in ASIC, SoC de-
ticore/Many-core Systems-on-Chip (MCSoC), pp.108–115, Oct. sign and Verification. His research interest is in
2019. Computer Architecture, especially on Processor
[20] Digilent, Inc., Nexys 4 DDR Reference Manual, rev.c ed., 2016. design and memory sub-system design.
[21] R.P. Weicker, “Dhrystone: A Synthetic Systems Programming
Benchmark,” Commun. ACM, vol.27, no.10, pp.1013–1030, Oct.
1984.
[22] EEMBC, “CoreMark | CPU Benchmark – MCU Benchmark.”
https://www.eembc.org/coremark/. Kenji Kise received the B.E. degree from
[23] RISC-V Foundation, “riscv-tests.” https://github.com/riscv/riscv- Nagoya University in 1995, the M.E. degree
tests. and the Ph.D. degree in information engineer-
[24] UC Berkeley Architecture Research, “Setup scripts and files needed ing from the University of Tokyo in 1997 and
to compile CoreMark on RISC-V.” https://github.com/riscv-boom/ 2000, respectively. He is currently an associate
riscv-coremark. professor of the School of Computing, Tokyo
[25] C. Celio, “A Highly Productive Implementation of an Out-of-Order Institute of Technology. His research interests
Processor Generator,” Ph.D. thesis, EECS Department, University include computer architecture and parallel pro-
of California, Berkeley, Dec. 2018. cessing. He is a member of ACM, IEEE, IEICE,
[26] S. Manne, A. Klauser, and D. Grunwald, “Pipeline gating: spec- and IPSJ.
ulation control for energy reduction,” Proc. 25th Annual Interna-
tional Symposium on Computer Architecture (Cat. No.98CB36235),
pp.132–141, 1998.
[27] E. Matthews, Z. Aguila, and L. Shannon, “Evaluating the Per-
formance Efficiency of a Soft-Processor, Variable-Length, Paral-
lel-Execution-Unit Architecture for FPGAs Using the RISC-V ISA,”
2018 IEEE 26th Annual International Symposium on Field-Pro-
grammable Custom Computing Machines (FCCM), pp.1–8, 2018.
[28] S. Mashimo, A. Fujita, R. Matsuo, S. Akaki, A. Fukuda, T. Koizumi,
J. Kadomoto, H. Irie, M. Goshima, K. Inoue, and R. Shioya, “An
Open Source FPGA-Optimized Out-of-Order RISC-V Soft Proces-
sor,” 2019 International Conference on Field-Programmable Tech-
nology (ICFPT), pp.63–71, 2019.