Academia.eduAcademia.edu

Data-flow multiprocessor with deterministic architecture

The structure of the multiprocessor with modified deterministic data flow architecture was proposed. The variants of the multiprocessor depending on the processor memory structure is considered. The results of simulation are presented in the paper.

Data-flow multiprocessor with deterministic architecture Novatsky1) A. A., Glushko1) Je. V., Chemeris2) A.A., Pugachov1) О.S. 1) National technical university of Ukraine «KPI», Kiev, Ukraine 2) Pukhov Institute for Modeling in Energy Engineering NASc of Ukraine, Kiev, Ukraine [email protected] Abstract The structure of the multiprocessor with modified deterministic data flow architecture was proposed. The variants of the multiprocessor depending on the processor memory structure is considered. The results of simulation are presented in the paper. 1. Introduction It is clear that to improve performance of data processing or speed up the calculations the use of parallel computing tools and systems is required. This paper presents the deterministic data-flow architecture where the flow of instructions is created by the compilator before program execution [1]. This architecture include main features of data-flow computers but determinism makes it more simple to use as embedded system realized on FPGA platform. Bottleneck in different processor architectures is the exchange of data between memory and processor. This paper focuses attention on the more effective memory architecture that is designed for the modified architecture of the data-flow multiprocessor. 2. Data-flow determinism multiprocessor scheduling and mapping according to processor structure and the number of processor elements. As it was underlined above the transferring processor-memory data has the high influence on the processor performance. Let's consider some ways to modify data-flow processor structure according to RAM using. We will consider three architecture modifications that are presented on figures 1, 2 and 3. The architecture depends on memory place and its work in the processor. PU1 MR OB PU2 RB PUn DAM CU BI SB with SM The main idea of data-flow architecture was described in [2,3]. This idea is used today in projects of well-known scientists, particularly, see [4]. We have proposed an architecture that is based on the main dataflow principles but they are realized during compilation process. Static analysis of data dependence graph gives a sequence of instructions to be executed. The compiler places instructions in the memory as far as operands are ready, building data flow graph. Then this graph is divided into some parts and translator makes Figure 1. The processor's structure with RAM of results Here are such abbreviations: SM — system memory; SB — system bus; BI — bus interface; CU — control unit; DAM - destination addresses (the set of addresses of instructions where the result have to be send); PU1..PUN — processor units; MR — memory of results; MO — memory for operands; R1..RN — registers; OB — bus of operands; RB — bus of results; BOR — bus of operands-results. PU1 R1 PU2 R2 PUn Rn BO DAM CU BI SB First of all the compiled program is written to SM and then during the program execution the instructions follow to CU by SB and BI. Every considered dataflow architecture has their own features in memory. First structure (Fig. 1): the CU sends operands of some instruction to the processor unit PUi using OB and then the corresponding instruction code. After the instruction execution PUi write the result to MR using RB. Then CU reads the result and sends it as operands of next instructions to various PU if it is have to be done. The destination addressed CU gets from DAM. The structure on the Fig. 2 differs from the first one by the set of registers that are near every PU. They replace the memory where the operands are stored. Due to FPGA or ASIC realization the access frequency to registers are higher then to the external RAM. The third architecture has the special memories for operands and result s. After an instruction execution the PU writes the results to the MR. Then it is sent to the corresponding cells of MO by CU. Destination addresses the CU takes in the special memory unit DAM. All types of memories are external. SM 3. Simulation of performance Figure 2. registers Architecture with the set of PU1 MR OB PU2 RB MO PUn To explore the most effective data-flow architecture among three presented we will simulate using tests that are algorithms of matrix calculations. To compare architectures we will use the effectiveness coefficient that is the ratio of the number of instructions executed by processor to the time (in ticks) of execution. The simulation was done by StateFlow tool of Matlab environment. The simulation scheme for two processors is presented on the figure 4. Simulation details we will show on the matrix multiplication example where DAM CU BI SB SM Figure 3.Architecture with RAMs of results and operands c (i , j ) = ∑ aik ⋅ bki . n k =1 Assume that the multiprocessor has two PU. According to the example, there are two instructions - addition and multiplication. Each instruction is executed in the processor during one cycle. In the first and third architecture the operation read/write to the external memory takes ten cycles of the processor. The second architecture for one cycle performed well and it stores the result in register. Due to register and PU are in the same chip then write data to the register is in the same cycle as the instruction execution. Instead of the first and third architectures MR and MO chips are the external memory and access to such memory requires much more time than to the register. Figure 4 shows the model in Matlab. Units PU1, PU2, Control Unit are blocks of module StateFlow. These blocks have the state machine inside that simulate two states "busy" and "free". PU1 and PU2 are busy when performing instructions and the Control Unit is busy when performs read/write access to the external memory. Moving to another state is based on the flow of instructions (addition, multiplication) that are coming from blocks Commands1, Commands2, Commands3. If instructions do not come to units PU1, PU2 and Control Unit then they are in a state "free". Status information of PU1, PU2 and Control Unit is displayed on the unit Scope1. As it was mentioned above it shows the ratio of the executed instruction number to the time of execution. For the second processor architecture this coefficient is the highest that shows the more effective calculations. So it proves the fact that memory elements such as registers have to be designed at the same crystal with processor elements. It means that locality of calculations are very important and it is an actual thing to develop optimizing compilers with parallelizing effect. 4. Conclusions The three processor architectures were considered where the determine data-flow principle was applied. Using the example of matrix multiplication it was shown that registers inside processors makes the effectiveness of calculations more higher – approximately to ten and more times. Proposed architecture supposes that the main work for compilation, parallelization and scheduling of calculations will be done by the compiler. In further work it is important to design the FPGA implementation too. Figure 4. The multiprocessor model for matrix multiplication in the Matlab environment 2 1,5 5. References [1]. W. Beletskyy, A. Chemeris, Je. Glushko, V. Viter Organizations of determined calculations in data-flow computers. // «Modeling and informations technologies ». #11 (in Russian). Kiev, 2001. – pp. 106-114. [2]. Dennis J., Data Flow Supercomputers // Computer, Vol.13, No.11, pp.48-56,1980. 1 [3]. Dennis J. “Data flow ideas for supercomputers”. IEEE Society, 28th international conference, San Francisco, 1984,pp. 15-19. 0,5 0 arch.1 arch.2 arch.3 Figure 5. The coefficient of efficiency for presented processor’s architectures To perform multiplication of two matrices the compiler first of all builds the data-flow graph and schedules instructions in the order of its readiness. Then, during execution, the processor according the algorithm sends them to PUs. So in the example the compiler builds the data-flow graph and cut it into some parts that are assigned to the processing elements. The chart on the Figure 5 shows the coefficient of efficiency for the first, second and third architectures. [4]. Klimov A.V., Levchenko N.N., Okunev A.S. Model calculations with data-flow control as a means of solving the problems of large distributed systems. Second All-Russian Scientific Conference "Supercomputer technology" (SKT2012), 24-29 September 2012, Divnomorskoe, Gelendzhik area. - pp. 303-307..