Academia.eduAcademia.edu

External Memory Controller for Virtex II Pro

2006, 2006 International Symposium on System-on-Chip

An implementation of an On Chip Memory (OCM) based Dual Data Rate external memory controller (OCM2DDR) for Virtex II Pro is described. The proposed OCM2DDR controller comprises Data Side OCM (DSOCM) bus interface module, read and write control logic, halt read module and Xilinx DDR controller IP core. The presented design supports 16MB of external DDR memory and 32 to 64 bits data conversion for single read and write operations. Our implementation uses 1063 slices of Virtex2Pro FPGA and runs at 100 MHz. The major bene ts of the proposed design are high bandwidth to external memory with reduced and more predictable access times compared to the Xilinx PLB DDR controller implementation. More specially, our read and write accesses are 2,44 and 4,25 times faster, than the PLB based solution respectively.

External Memory Controller for Virtex II Pro Blagomir Donchev Georgi Kuzmanov, Georgi N.Gaydadjiev Department of Microelectronics Computer Engineering, EEMCS Technical University-Soa Delft University of Technology 8, Kliment Ohridski, Bl.2, 1000, Soa, Bulgaria Mekelweg 4, 2628CD Delft, The Netherlands Email: [email protected]a.bg Email:{g.kuzmanov,g.n.gaydadjiev}@ewi.tudelft.nl Abstract— An implementation of an On Chip Memory (OCM) control unit. The system is implemented on the Digilent XUP based Dual Data Rate external memory controller (OCM2DDR) V2P development platform [3], which embeds a Virtex-II Pro for Virtex II Pro is described. The proposed OCM2DDR controller comprises Data Side OCM (DSOCM) bus interface module, read and write control logic, halt read module and Xilinx DDR controller IP core. The presented design supports 16MB of external DDR memory and 32 to 64 bits data conversion for single read and write operations. Our implementation uses 1063 slices of Virtex2Pro FPGA and runs at 100 MHz. The major benets of the proposed design are high bandwidth to external memory with reduced and more predictable access times compared to the Xilinx PLB DDR controller implementation. More specially, our read and write accesses are 2,44 and 4,25 times faster, than the PLB based solution respectively. I. I NTRODUCTION The PowerPC (PPC) hard cores embedded in the Virtex II Pro Field Programable Gate Arrays (FPGA) have two bus interfaces that can be used for memory access: the Processor Local Bus (PLB) and the On-Chip Memory controller (OCM) Bus. The OCM bus supports interface to on-chip Block RAM (BRAM) only. This type of RAM has short and uniform access times, however it is limited by the size of a single chip memory XC2VP30 FPGA and 256 MB DDR RAM. The key features of the proposed controller are: • Communication with external DDR memory through the Data Side OCM Controller (DSOCM); • Run time adjustable read and write access times; • 100 MHz operational frequency; • Trivial resource utilization: 7.8% slices and 3.8 % ipops of the XC2VP30 device; • 4,25 write speedup and 2,44 read speed up compared to Xilinx PLB DDR implementation. The remainder of this paper is organized as follows: The motivation for this work is presented in Section II. Section III introduces the OCM2DDR controller organization and provides short discussion on its modules and on the specic clock generation strategy utilized. The implementation results of OCM2DDR controller are presented in Section IV. Finally, Section V summaries the ndings and presents the conclusions. only [1]. To access larger data volumes, dedicated interface to II. M OTIVATION external RAMs is needed but is not currently supported. PLB is the only solution, provided by Xilinx, for connecting external The PowerPC cores in the Virtex2Pro are supported by two memories to Virtex II Pro FPGA. Although PLB supports a memory interfaces: the OCM and the PLB. The timing and variety of external memory types, such as SRAM, SDRAM, the protocols of these interfaces are conceptually different. In and DDR, and addresses larger storage capacities compared this section, we briey discuss the differences between these to OCM, it has one major drawback. This drawback is that two interfaces. Based on their advantages and drawbacks, we PLB is not a dedicated memory interface but it is based motivate the need of a controller, combining some advantages on the shared bus concept. The latter concept implies that of both the OCM and the PLB. each PLB connected memory module has to compete for the OCM provides a dedicated interface between the PowerPC bus resources with other peripheral modules attached, which core and the on chip BRAMs. Some key features of this potentially leads to performance degradation. interface are: separate Instruction Side OCM(ISOCM) and The goal of this paper is to propose a dedicated memory design solution that solves both the access time limitation Data Side OCM(DSOCM); short and xed access time to the BRAM memory. of the PLB and the storage capacity limitation of the OCM. PLB is based on IBM's 64-bit CoreConnect technology and The proposed solution of the above design challenges is a uses an arbitration policy to control the slave devices attached memory controller hereafter referred to as OCM2DDR con- to the bus. Some key features of this bus are: 64 bits wide data troller. For our design, we consider Double Data Rate (DDR) bus; 32 bits wide address bus, and 8-word cache line transfers. dynamic RAM due to its best performnce/cost ratio compared Xilinx provides several PLB-based external memory solutions, to static memories (SRAMs) and other dynamic memory including a DDR SDRAM controller, which is a soft IP core types (e.g.,SDRAM). The OCM2DDR controller consists of with the following features [2]: a module for input and output 32/64 bits data conversion, a • PLB interface; Xilinx DDR controller (v1.11) [2], an addressing module and a • Auto-refresh cycles generation; • Single-beat and burst memory transactions; • 32 and 64 bits DDR data widths; • Error correction code (ECC). Despite all PLB advantages, there exist two essential drawbacks: 1) Low speed and 2) The non-deterministic memory access times. A short comparison between the PLB and the OCM is presented bellow (for more elaborated comparison one can refer to [4]): Operating frequency: The PLB operating speed dependents on the maximum operating frequency of the PLB arbiter and the Fig. 1. Block diagram of OCM to DDR controller Fig. 2. Clock Architecture and Initialization chain FPGA IP blocks that are connected to it. On the other hand, the OCM speed dependents only on the amount of on-chip memory that is connected to it. Shared vs. Dedicated: The PLB is a shared bus, and allows up to sixteen masters and sixteen slaves. All devices connected to the PLB have to share the available bus bandwidth. There is no arbitration on the OCM bus because of its dedicated interface. Non-deterministic vs. Deterministic timing: The fact that the PLB must share its bandwidth with many masters and slaves makes its access times unpredictable. Because the OCM is a dedicated interface, it has deterministic timing. It can be concluded that one considerable drawback of the PLB is the speed limitation imposed by the bus arbitration. Another severe PLB drawback is that the bus bandwidth is shared among all attached devices, which results in non-deterministic latencies. A positive feature of the PLB is the support for large memory sizes. In contrast to PLB, the OCM bus speed depends only on the amount of the connected BRAMs. The OCM bus is dedicated and its timing is deterministic. Serious drawback from the processor are accompanied by the data and by the of the OCM is that the supported memory capacity is limited associated control signals. to the available on-chip BRAMs. Moreover, Xilinx does not Control unit: Consists of logic for read/write requests gener- provide any dedicated interface to external memories similar ation to the DDR, chip select and read/write signals to the to the one they provide to the internal ones through OCM. DDR, and halt logic driving PPC. Read and write operation This causes severe problems when fast and uniform access to are determined by OCM EN and OCM BW signals. During external memory is required. The above observations indicate the read operation the PPC has to be halt for the time until the origin of serious design problems, which arise when fast DDR provides valid data. external memory accesses are required. The above design problems motivated our research towards nding a performance efcient interface solution between the Virtex2Pro embedded PPCs and external memories with large storage capacities. More specically, we propose a design, which combines a high speed and deterministic OCM interface from one hand and the PLB advantages to support external memory on the other. III. OCM2DDR C ONTROLLER ORGANIZATION Driver unit: Provides address conversion from DSOCM format to format required by the DDR controller. Input/Output Data Buffer: This buffer is responsible to convert data between the 32 bits-OCM data bus format and 64 bitsDDR data bus format and is managed by the Control unit. The main function of the OCM2DDR controller is to provide data communication between the PowerPC Core (PPC) and external DDR memory through DSOCM. In case of writing the data to the memory, PPC provides the data, the address and a write request through DSOCM to the OCM2DDR The block diagram of the OCM2DDR controller is shown controller. The OCM2DDR controller generates all required on Figure 1. The OCM2DDR controller consist of the follow- signals with the regarded timing, for writing the data to the ing modules: DDR memory. In case of memory read, PPC provides the DSOCM interface: DSOCM is a data memory controller, address and a read request through the DSOCM, generates which is integrated in PPC. It is connected through accepts read request to the DDR. an address and associated control signals with the processor Design considerations: The DSOCM's controller is imple- during a load instruction, and passes valid address to the mented in a setup with a single PPC. In our design, both the OCM2DDR controller. For store instructions, a valid addresses data and the instruction side are used: the instruction side is used to store the instruction segment of the program and the data side is connected to the OCM2DDR controller. Clock Architecture: There are two clock schemes that are recommended by Xilinx application notes for Virtex II Pro DDR[5],[6]. In our design implementation, DCM circuits with local inversion [7] are used as illustrated in Figure 2. The rst DCM starts automatically at power on. When the rst DCM is initialized, the second DCM starts. Additional DCM cores are linked together in this fashion to ensure that all clock signals are stable before the system boots up. By inserting the OCM2DDR controller into this chain, the system boot can be delayed until the DDR has been initialized. Fig. 3. Signal Translation: The OCM2DDR controller has to translate Signal translation conception the signals provided by the OCM controller into the corresponding Intellectual Properties Interconnect Format (IPIF) signals (supported by the Xilinx DDR controller) [8] and vice versa. This leads to the signal translation diagram as shown timing model of Micron DDR 256 MB memory, provided by the vendor. in Figure 3. The IPIF has an address width of 32 bits, the Implementation results of OCM2DDR controller, presented DSOCM has only 22 bits address bus. Since the IPIF addresses in Table I suggest that the hardware costs are trivial with are byte aligned and the DSOCM is 32-bit aligned, the two respect to the available recongurable resources (8%). The least signicant bits of the IPIF address will be set to zero reported delays suggest a maximum speed of 159,9 MHz. and the 22 bits of DSOCM address will be placed behind After implementing in XC2VP30-7 (Digilent's XUP V2Pro that. The remaining 8 bits will be constantly set to zero. More board)[3], the design was tested at 100 MHz with two syntectic precisely, this means that every address of the DSOCM address applications that write and read into the DSOCM address space is mapped to a respective address of the DDR controller. space. One of them consists of single word (32 bits) write The IPIF protocol uses a scheme called ”Byte Steering”. This and read operations and the second one consists of loops of means that the peripheral can address the memory space byte memory initializations and linear write/read operations for 20 aligned, but the data must be provided, in compliance with 32 bits words. Figure 4 and Figure 5 depict the simulation the base bit alignment of the bus. This means that the address results of the OCM2DDR with the DSOCM in a single cycle is given as a byte address, but the byte mask and data are mode. Position 1 on both gures clearly indicates that the aligned to the width of the data bus (64 bits). The address DDR access is completed within the OCM bus assertion. The generated from the DSOCM is always aligned to 32 bits. This DDR memory used and its simulation timing model have a conversion holds for both the incoming, and outgoing data, CAS latency of two clock cycles. Because of the necessity to and the data mask has to be shifted accordingly. Both masks keep DDR CS signal for longer time than the DSOCM Enable of the IPIF and of the DSOCM hide the data on a byte level. signal, an internal counter was used, indicated by position The byte mask of the IPIF species the bytes that contain valid 2 on Figure 5. Because of the difference between the times data. The DSOCM mask determines the bytes to be written required for read/write operations by the PPC and the DDR, to the BRAM. For write operations, this means that the byte it is necessary to halt PPC during the read operation. The halt mask can be simply copied. However, for read operations, the lasts for the time required by the DDR memory to provide DSOCM byte mask is kept empty, while all the data bits on the data, depicted by position 3 on Figure 4. This feature is the bus are expected to be valid. The IPIF bus has separate implemented using simple logic based on a clock multiplexor read/write indicator signals, and the byte mask validates the primitive (BUFGMUX) [10]. The proposed solution follows data for both, read and write operations. This means that in the recommended technics for clock synchronization given by Xil- case of a read operation, the DSOCM byte mask is empty, but inx [11]. A severe concern is the fact that the DDR access time the translated IPIF byte mask should be completely asserted. can vary greatly. To solve this problem, a run time adjustable Because of read/write timing differences between the DSOCM and DDR, it is necessary to halt the processor during the read operation for the time, required for DDR memory to provide valid data. A special logic circuit is developed to implement this feature. circuit for read and write operations was developed. The execution time for both operation is calculate with generation of acknowledge by the internal DDR controller. Its behavior is indicated by position 4 on Figure 4 For debugging purpose, an Input/Output interface based on the Xilinx OPB UART Light IP core [12] is designed. Its IV. V IRTEX II P RO M APPING parameters are the following: 115200 kbits/s, 8 bits data, no parity check and no hardware/software corrections. In this The proposed design has been implemented using Xilinx implementation the CPU and the OCM2DDR controller are Platform Studio 7.1i [9]. Initially, the design has been sim- running at 100 Mhz with additional xed phase shifting of ulated with ModelSim 6.0 SE using a reference functional 60 degrees in the second DCM. It is done to compensated TABLE II sys_clk_pin docm_bramdsocmrddbus docm_dsocmbramen docm_dsocmbrambytewrite docm_dsocmbramabus docm_dscntlvalue 00000000 FF00AB00 FF00AB01 View publication stats T IMING 0000 1000000000000000000011 1000000000000000000000 RESULTS AFTER SYNTHESIS 1000000000000000000011 81 1 dsocm_clk isaligned pulse pulsgen_t pulsgen bus2ip_rdreq ip2bus_rdack ip2bus_data bus2ip_be bus2ip_rnw 3 4 0000000000000000 11111111 Timing parameters OCM2DDR PLB DDR Speed up Duration of write operation 4 Cycles 17 Cycles 4.25 Duration of read operation 9 Cycles 22 Cycles 2.44 FF00AB00FF00AB01 11110000 11111111 212960 ns Fig. 4. 213 us 213040 ns V. C ONCLUSION AND FUTURE WORK Read data from DDR In this paper we proposed a design of a controller, which provides a dedicated interface to external DDR memory con- sys_clk_pin docm_dsocmbramwrdbus docm_dsocmbramen docm_dsocmbrambytewrite docm_dsocmbramabus docm_dscntlvalue dsocm_clk isaligned pulse pulsgen_t pulsgen ip2bus_data bus2ip_wrreq ip2bus_wrack bus2ip_be bus2ip_data bus2ip_rnw bus2ip_addr nected to the PowerPC cores of the Xilinx Virtex II Pro FF00AB00 FF00AB01 0000 0000 1111 1000000000000000000001 1000000000000000000000 FPGAs. More specically, we proposed a high speed access to 81 large external storage capacity trough the dedicated DSOCM 1 bus of Virtex2Pro. Compared to the traditional shared-bus 2 approach (provided by the chip vendor) for connecting external 0000000000000000 11111111 00001111 FF00AB00FF00AB00 FF00AB01FF00AB01 00800000 00800004 211920 ns 2.44 times faster for read and 4.25 times faster for write 211960 ns Fig. 5. memories our dedicated controller performs in the worst case 11111111 212 us 212040 ns 212080 ns operations. Synthesis results suggest trivial hardware cost, measured with 8 % of XC2VP30. The proposed solution can Write data to DDR be extended in future with a cache module implementation, running as L2 caching subsystem of recongurable processors such as MOLEN [14], [15]. The performance can be improved the external wire's delay of the clock path. More details about further by implementing a burst access to the external memory technics on how to calculate the proper phase shifting are given and ECC functionality also. The OCM2DDR controller can in [13]. be also considered as a universal solution to connect IPIF Table I presents the synthesis results for the proposed compatible external memories ( static and dynamic). memory controller and provides comparison to the Xilinx PLB DDR controller. Synthesis results indicate substantial savings of design resources in the range of 17%-30%. The reason of that is lower complexity of OCM interface vs. PLB. The last row of Table I suggest that our design exhibits 30% shorter ACKNOWLEDGMENT This research has been partially supported by the National Science Fund, Bulgarian Ministry of Education and Science. Project MU-X-02/29.07.2005. R EFERENCES critical path, therefore it can be run at approximately 1.6 times higher frequency, then the PLB. [1] “Virtex II Pro and virtex II Pro X platform FPGAs: Introduction and Regarding performance, experimental results suggest that our OCM DDR controller takes 4 clock cycles for the single write operation and 9 clock cycles for single read operation. In comparison, the PLB DDR controller takes 17 clock cycles for a write operation and 22 clock cycles for a read operation. Compared timing results between both implementations are reported in Table II. Note that we consider the worst case scenario, when no bus arbitration takes place and only one PLB DDR controller is attached to the PLB bus. If bus arbitration is considered, the PLB latencies are expected to increase dramatically. overview,” Xilinx Corporation, DS083, Oct. 2005. [2] “PLB Double Data Rate (DDR) Synchronous DRAM (SDRAM) Controller,” in Product Specication, Xilinx Corporation, DS425, Aug. 2004. [3] “Xilinx University Program Virtex-II Pro Development System,” Hardware Reference Manual, UG069, Mar. 2005. [4] K. Lund, “PLB vs. OCM Comparison Using the Packet Processor Software,” Xilinx Corporation, XAPP644, Oct. 2004. [5] H. Winkler, “Clocking Strategy for a Virtex II Pro DDR SDRAM Controller,” in Array Electronics, http://www.array-electronics.de/doc. [6] C. Cain, “Reference system: Mch opb ddr sdram with opb central dma,” Xilinx Corporation,XAPP912, Nov. 2005. [7] “High-Speed Clock Architecture for DDR Designs Using Local Inversion,” Xilinx Corporation, XAPP685, Apr. 2004. [8] “PLB IPIF (v2.02a),” Xilinx Corporation, DS448, Apr. 2005. [9] “Embedded system tools reference manual,” Xilinx Corporation, UG111, Feb. 2005. [10] “Libraries guide,” Xilinx Corporation, ISE 6.3, Sept. 2005. [11] “Powerpc I MPLEMENTATION 405 processor block reference guide,” Xilinx Corpora- tion,UG018, July 2005. TABLE I [12] “Opb uart lite (v1.00b),” Xilinx Corporation, DS422, May 2005. RESULTS [13] “Determining the Optimal DCM Phase Shift for the DDR Feedback Clock,” Xilinx Corporation,XAPP806, May 2005. [14] S. Vassiliadis, S. Wong, G. N. Gaydadjiev, K. Bertels, G. Kuzmanov, and Used resources OCM2DDR PLB DDR Differences Number of Slices 1063 1246 17 % less Number of Slice Flip Flops 1052 1367 30 % less Number of 4 input LUTs 803 971 21 % less Minimum clock period 6,254 ns 9.968 ns 60 % faster E. M. Panainte, “The Molen Polymorphic Processor,” IEEE Transactions on Computers, vol. 53, no. 11, pp. 1363–1375, Nov 2004. [15] S. Vassiliadis, S. Wong, and S. D. Cotofana, “The molen ρµ-coded processor,” in in 11th International Conference on Field-Programmable Logic and Applications (FPL), Springer-Verlag Lecture Notes in Computer Science (LNCS) Vol. 2147, August 2001, pp. 275–285.