Academia.eduAcademia.edu

Table of Contents Abstract.....................................................................................................................................iii List of Figures

2002

Design and Optimization of Legacy Compatible Microprocessors Technical Report No. CS-TR-02-002 December 2002 Brian F. Veale*, John K. Antonio*, and Monte P. Tull† * School of Computer Science †School of Electrical and Computer Engineering University of Oklahoma Norman, OK 73019 Tel: 405-325-8446 Fax: 405-325-4044 E-mail: {veale, antonio, tull}@ou.edu Table of Contents Abstract .....................................................................................................................................iii List of Figures............................................................................................................................iv List of Tables..............................................................................................................................v 1. Introduction ............................................................................................................................1 1.1. Static Microprocessors .....................................................................................................1 1.2. Reconfigurable Microprocessors ......................................................................................3 1.3. Summary..........................................................................................................................4 2. The IBM DAISY Microprocessor ...........................................................................................5 2.1. Overview .........................................................................................................................5 2.2. Re-translation of Binary Machine Code............................................................................8 2.3. Optimization of Tree Groups..........................................................................................11 2.4. Special Hardware and Control Mechanisms....................................................................14 2.5. Performance Evaluation .................................................................................................16 2.6. Summary........................................................................................................................18 3. The Transmeta Crusoe Microprocessor .................................................................................19 3.1. Overview .......................................................................................................................19 3.2. Re-Translation of Instructions ........................................................................................23 3.3. Optimization of Crusoe Instructions ...............................................................................24 3.3.1. Removing the X86 Segmentation Process................................................................26 3.3.2. Removing Upper Boundary Memory Checks...........................................................28 3.3.3. Common Sub-Expression Elimination .....................................................................29 3.3.4. Removing Commit Operations ................................................................................30 3.3.5. Register Renaming ..................................................................................................30 3.3.6. Code Motion............................................................................................................31 3.3.7. Data Aliasing...........................................................................................................31 3.3.8. Copy Propagation ....................................................................................................32 3.3.9. Using Alias Hardware..............................................................................................32 3.4. Special Hardware and Control Mechanisms....................................................................32 3.5. Exception Handling........................................................................................................34 3.6. Summary........................................................................................................................34 4. Comparison of the DAISY and Crusoe Microprocessors .......................................................35 4.1. Overview .......................................................................................................................35 4.2. Difficulties in Comparing the DAISY and Crusoe Microprocessors................................37 4.3. The Architectures of the DAISY and Crusoe Microprocessors........................................37 4.4. The Re-translation Processes ..........................................................................................38 4.5. Scheduling Re-translated Operations ..............................................................................39 4.6. Optimization of Re-translated Machine Code .................................................................39 4.7. Handling of Special Situations........................................................................................40 4.8. Summary........................................................................................................................40 5. Proposed Future Research Directions ....................................................................................41 5.1. Overview .......................................................................................................................41 5.2. An Architecture to Support Dynamic Translation with Reconfigurable Computing ........42 5.3. Instruction Set Analysis..................................................................................................44 5.3.1. Overview.................................................................................................................44 i 5.3.2. Detecting Instruction Set Partitions with Clustering .................................................45 5.3.3. Overall Results of the Experiments and Future Work...............................................51 5.4. Summary........................................................................................................................52 6. Conclusions ..........................................................................................................................53 References ................................................................................................................................55 ii Abstract Microprocessors can be divided into two main categories: (1) those implemented using static hardware, and (2) those implemented using reconfigurable hardware. In microprocessors that use reconfigurable hardware, the instructions supported and the circuitry that performs the instructions can be changed after fabrication. Before a program is run on a microprocessor, it is translated into binary machine code for the microprocessor. There are two approaches to the translation of program code into binary machine code: (1) static and (2) dynamic. In the static translation approach, the program is translated (i.e., compiled) into binary machine code and then the microprocessor executes it directly. In the dynamic translation approach, the microprocessor executes programs that have been initially translated into binary machine code for a different microprocessor, by re-translating the initial binary machine code at execution time. At the beginning of this report, a taxonomy of the different types of microprocessors (based on these classifications) is presented. The focus of this report is the design and implementation of the IBM DAISY and Transmeta Crusoe microprocessors. Both of these microprocessors use the dynamic translation process to execute programs originally compiled for the PowerPC and Intel X86 microprocessors, respectively. This presentation of the DAISY and Crusoe microprocessors is followed by a comparison of these two microprocessors. Finally, areas for future research are identified and discussed at the end of this report. iii List of Figures Figure 1. A taxonomy of microprocessors and the translation processes they use. .......................1 Figure 2. The static translation process for a static microprocessor..............................................2 Figure 3. The dynamic translation process for a static microprocessor.........................................2 Figure 4. The static translation process for reconfigurable microprocessors.................................4 Figure 5. The components of a DAISY microprocessor [6]. ........................................................6 Figure 6. The clustered VLIW processor core of the DAISY microprocessor [10, 6]...................7 Figure 7. The tree-based instruction flow control model and instruction format [10]. ..................7 Figure 8. The DAISY Instruction Pipeline [10]. ..........................................................................8 Figure 9. The dynamic translation process used by the DAISY microprocessor derived from [4, 6]. .......................................................................................................................................9 Figure 10. Example PowerPC code and corresponding VLIW instructions and tree group [6]. ..10 Figure 11. Example of Copy Propagation between PowerPC instructions with no real dependencies [12]. ............................................................................................................12 Figure 12. Example of Load-Store Telescoping optimizing PowerPC code [12]. .......................13 Figure 13. Example PowerPC code and the corresponding translated VLIW Code [6]...............14 Figure 14. The components of a Crusoe based system [7]..........................................................20 Figure 15. Architecture of the Crusoe microprocessor [7]..........................................................20 Figure 16. The gated store buffer used to buffer writes to memory and its associated registers [7]. ..........................................................................................................................................21 Figure 17. The dynamic translation process used by the Crusoe microprocessor derived from [7]. ..........................................................................................................................................22 Figure 18. Example C program and corresponding assembly-level code [7]. .............................24 Figure 19. Example re-translated X86 code (in bold) with Crusoe operations required for each X86 instruction [7]. ...........................................................................................................25 Figure 20. Example re-translated X86 code (in bold) with the Crusoe operations required for each instruction after removal of the X86 segmentation process [7]...................................27 Figure 21. Example re-translated X86 code (in bold) with the Crusoe operations required for each instruction after removal of upper boundary memory checks [7]................................29 Figure 22. Example re-translated X86 code (in bold) with the Crusoe operations required for each instruction after removal of all commit operations except the one at the end of the code segment [7]. ......................................................................................................................31 Figure 23. The segmentation of a simple dynamic translation process in which each segment represents a different configuration of the same reconfigurable hardware. .........................43 Figure 24. A microprocessor core that includes a reconfigurable execution unit. .......................43 Figure 25. A high level view of a system that uses dynamic translation and reconfigurable hardware. ..........................................................................................................................44 Figure 26. The K-Means clustering algorithm derived from [19]...............................................46 Figure 27. The image created using POV-Ray for the clustering experiments............................47 Figure 28. Results of the first run of the K-Means clustering algorithm. ....................................49 Figure 29. Results of the second run of the K-Means clustering algorithm.................................50 Figure 30. The ten most frequently executed instructions. .........................................................50 Figure 31. The fifty most frequently executed instructions. .......................................................51 Figure 32. Results of the third run of the K-Means clustering algorithm. ...................................52 iv List of Tables Table 1. VLIW processor core configurations explored for the DAISY microprocessor [6].......18 Table 2. A comparison summary of the DAISY and Crusoe microprocessors............................36 v 1. Introduction Microprocessor hardware can be divided into two main categories: 1. microprocessors implemented in static hardware; and 2. microprocessor implementations that include reconfigurable hardware. In a microprocessor implemented in static hardware, the circuitry is fixed and implements the original set of operations for which it was fabricated. However, in a microprocessor implemented using reconfigurable hardware, the operations performed by the reconfigurable circuitry can be changed after fabrication by configuring the reconfigurable hardware. A microprocessor based on reconfigurable hardware can be partially or completely implemented in reconfigurable circuitry, e.g., only the circuitry that performs arithmetic operations might be implemented using reconfigurable circuitry. The rest of this section presents an overview of a microprocessor taxonomy illustrated in Figure 1. In addition to categorizing the type of hardware used to implement the microprocessor, distinction is made in how code is translated, i.e., statically or dynamically. Figure 1. A taxonomy of microprocessors and the translation processes they use. 1.1. Static Microprocessors In a static microprocessor, the instruction set that can be executed is fixed and the architecture of the underlying hardware is fixed. Examples of static microprocessors include the Intel X86 family of microprocessors [1] and the PowerPC microprocessor [2]. The static translation process, which is the typical code development and execution process for static microprocessors, is shown in Figure 2. The source code is constructed using a highlevel language, e.g., C++. The compilation process takes in source code and produces binary machine code (commonly referred to as machine code) for the target microprocessor. In the 1 model of Figure 2, note that the process of translating source code into machine code occurs before execution begins on the static microprocessor. Figure 2. The static translation process for a static microprocessor. In addition to the typical static translation process, there exist static microprocessors that perform the translation process dynamically at the same time that execution of the machine code occurs. The generic code development and execution process for a microprocessor that performs dynamic translation is shown in Figure 3. Figure 3. The dynamic translation process for a static microprocessor. In dynamic translation, as shown in Figure 3, the source code is developed as before using a high-level language. The compilation process takes in the source code and produces machine code for an initial target microprocessor. This initial target may be associated with an actual physical microprocessor or it may be associated with a virtual microprocessor. (For example, Java source code is initially targeted to binary Java Virtual machine (JVM) code [3].) The machine code for the initial target microprocessor is re-translated into machine code for the final target microprocessor and optimized. Re-translation refers to the process of translating the machine code for the initial target microprocessor into machine code for the final target microprocessor; and optimization refers to techniques used to change and re-order the execution of instructions contained in machine code in order to speed up execution of the instructions. The re-translation and optimization step can be performed in software or hardware, as illustrated in Figure 1. Two examples of systems that perform the re-translation and optimization step in software are JVM [3] and Dynamo [4]. When a Java program is executed on a static microprocessor, the 2 initial machine code, which is called Java byte code, is re-translated into the machine code for the target microprocessor using the JVM, which is implemented in software [5]. In the Dynamo system, the initial and final targeted microprocessors are actually the same. However, when the initially compiled code is executed, the Dynamo software dynamically retranslates and optimizes the initial machine code into machine code with the objective of producing code that executes faster [4]. The DAISY (Dynamically Architected Instruction Set from Yorktown) [6] and Crusoe [7] microprocessors are examples of static microprocessors that perform the re-translation and optimization step of the dynamic translation process in hardware. In these systems, the source code is not initially compiled for the DAISY or Crusoe microprocessor, but for a different static microprocessor. When the initial machine code is executed by DAISY or Crusoe, it is retranslated into machine code for the DAISY or Crusoe microprocessor and then executed by the microprocessor [6, 7]. This re-translation is performed in hardware. A main focus of this report is to overview and compare the DAISY and Crusoe systems (Sections 2 and 3). 1.2. Reconfigurable Microprocessors In contrast to a static microprocessor, the instruction set and the underlying architecture of a reconfigurable microprocessor can be dynamic. This means that the instruction set and the circuitry implementing particular instructions or functionality of the microprocessor can be changed after fabrication of the microprocessor. An example of a reconfigurable microprocessor is the SPYDER (Reconfigurable Processor DEvelopment SYstem) microprocessor [8]. In the SPYDER microprocessor, the circuitry implementing all of the instructions is dynamic. New instructions can be created and the implementation of current instructions can be changed by providing a hardware description for the instructions in the form of binary configuration code that specifies how to configure the reconfigurable hardware [8]. The static translation process, which is the typical code development and execution process for reconfigurable microprocessors, is shown in Figure 4. The source code is constructed using a high-level language. The compilation process takes in source code and produces: (1) machine code for the target microprocessor and (2) a description of instructions to be implemented in the reconfigurable hardware to support the machine code. After the compilation process is finished, 3 the synthesis process converts the descriptions of the instructions to be implemented in reconfigurable hardware into binary configuration code for the reconfigurable hardware. In the model of Figure 4, note that the process of translating source code into machine code and binary configuration code occurs before execution begins on a reconfigurable microprocessor. Unlike the category of static microprocessors, there are no known examples of a reconfigurable microprocessor that uses a dynamic translation process. At the end of this report, future work is outlined in the direction of examining reconfigurable microprocessor architectures capable of dynamic translation. Figure 4. The static translation process for reconfigurable microprocessors. 1.3. Summary For the purpose of this study, microprocessors are implemented in either static or reconfigurable hardware. Two possible translation processes are defined: static and dynamic. In the static translation approach, the source code is compiled before execution on the microprocessor begins. In the dynamic translation approach, initial machine code is re-translated and/or optimized during execution on the microprocessor. Microprocessors that perform dynamic translation have the advantage that they can execute machine code that was initially compiled for a different microprocessor. Microprocessors that perform static translation do not have to perform the re-translation and optimization step found in dynamic translation and therefore may execute faster than a microprocessor that uses dynamic translation to execute the same machine code. Reconfigurable microprocessors have the potential advantage of being able to dynamically alter their instruction set and the way that instructions are performed. However, current technology that supports reconfigurable microprocessors is slower than the technology used to 4 create static microprocessors. The slower execution of reconfigurable hardware is one reason why reconfigurable technology has not been widely applied to microprocessors in the commercial market. This report focuses on microprocessors based on the dynamic translation approach to source code compilation. The majority of the material is presented by providing details on the design of the hardware architectures of the DAISY [6] and Crusoe [7] microprocessors. Copies of [6] and [7] can be found in Appendices A and B. At the end of this report, a research idea dealing with the design of instruction sets and machine architectures is discussed. A research idea of how to combine the dynamic translation process with a reconfigurable microprocessor is also presented. 2. The IBM DAISY Microprocessor 2.1. Overview The DAISY microprocessor [6] is a static microprocessor that has been developed by IBM, which uses the dynamic translation process of Figure 3. The goal of the DAISY microprocessor is to be completely compatible with the binary machine code of an existing commercial microprocessor and was the first microprocessor developed exclusively for this purpose [6]. For the purpose of this study, the DAISY microprocessor presented is completely compatible with the machine code of the PowerPC microprocessor. However, the techniques used in the PowerPC version of the DAISY microprocessor can be applied to a host of different microprocessors such as the Intel X86 and the IBM System/390, as well as virtual microprocessors such as the JVM [6]. A high-level component view of the DAISY microprocessor is shown in Figure 5. The architecture of the DAISY microprocessor is based on a VLIW (Very Long Instruction Word) processor core and is built on top of the PowerPC memory model and register file [6]. The white areas of Figure 5 represent PowerPC components of the system, and the black areas represent the DAISY specific components of the system. Note that there are no PowerPC execution units; all processing is done in the block labeled VLIW Processor Core. A microprocessor based on a VLIW processor code (such as the DAISY) packages multiple independent operations into one “very long” instruction for parallel execution in hardware [9]. Each operation is executed using a hardware circuit called an execution unit, also referred to as an Arithmetic Logic Unit (ALU), which can perform several different operations. The operation 5 that is performed at any single point of time by an execution unit is specified using an operation code that is embedded in the VLIW instruction being executed. Such microprocessors use multiple execution units that can independently perform operations, allowing them to perform many operations at the same time. This approach of executing multiple operations in parallel allows for a high degree of ILP (Instruction Level Parallelism). Figure 5. The components of a DAISY microprocessor [6]. The execution units in the VLIW processor core used by DAISY are clustered, as shown in Figure 6. Each of the clusters contains four execution units and two load/store units. A cluster is the basic building block within DAISY, and the processor core used in this study has four clusters [6]. The advantages of the clustered design are that: (1) the processor core has a high execution bandwidth and (2) high clock frequency. One disadvantage to this approach is that if an operation dependent on another operation that has been scheduled for a different cluster is encountered, a one-cycle delay occurs in the processing of the dependent operation [6]. In the DAISY microprocessor, instructions are tree-based and implement a multi-way path selection scheme [6]. The flow control model for a tree-based instruction is given in Figure 7. The multi-way path selection scheme allows the dynamic translation process to aggressively retranslate and optimize programs that contain multiple paths of flow and benefit from branch prediction. Each DAISY VLIW instruction can specify up to sixteen concurrent operations [6]. In the model of Figure 7, each path can consist of any subset of the sixteen operations. The condition codes (ccA, ccB, and ccC) determine the path is taken and what instruction is performed next [10]. 6 Figure 6. The clustered VLIW processor core of the DAISY microprocessor [10, 6]. Figure 7. The tree-based instruction flow control model and instruction format [10]. The execution process for DAISY VLIW instructions is shown in Figure 8. The process is implemented in hardware, as a pipeline, and is segmented into sets of tasks, called stages. In the first stage, called the instruction fetch (IF) stage, a block of four consecutive instructions is read 7 in from memory and a 4×1 multiplexer chooses the instruction that is to be performed. The next stage, called the execute (EX) stage, combines three tasks: (1) the fetching of the operands from the register file; (2) execution of the sixteen operations; and (3) evaluation of the tree form, which takes place in the branch unit. This stage determines the path of the tree-based instruction that is taken. In the final stage of the pipeline, the write back (WB) stage, the results of the operations on the taken path are written back to the register file [10]. Each stage is performed concurrently, thereby increasing the instruction throughput of the processor core and helping to increase the overall speed of the microprocessor. Figure 8. The DAISY Instruction Pipeline [10]. The DAISY microprocessor performs the re-translation and optimization step of Figure 3 by performing a re-translation of machine code (compiled for a different static microprocessor) into groups of DAISY instructions (called instruction groups) that are in the form of machine code for the DAISY microprocessor. As execution of machine code on the DAISY microprocessor continues, if previously re-translated instruction groups are encountered frequently, then they are optimized. This process of re-translation and optimization is depicted in Figure 9 [6]. Overviews of the underlying hardware architecture of the DAISY microprocessor and the dynamic translation process have been presented in this subsection. In the next subsection, a discussion of how the system performs the re-translation of machine code is provided. Subsection 2.3 defines the process of optimization, followed by an overview of special hardware and control mechanisms that are provided in the DAISY system in Subsection 2.4. Finally, a performance evaluation of the DAISY microprocessor is presented in Subsection 2.5. 2.2. Re-translation of Binary Machine Code The DAISY microprocessor uses a Virtual Machine Monitor (VMM), shown in Figure 5, to handle the re-translation process. The VMM also handles control of the microprocessor, 8 including exception handling, and is transparent to the binary machine code of the initial target microprocessor [6]. Figure 9. The dynamic translation process used by the DAISY microprocessor derived from [4, 6]. In the DAISY microprocessor, instruction groups take the form of a tree and are called tree groups. A tree group is a high-level abstraction of a group of VLIW instructions that models the natural flow of instruction execution (the control path) through a program. This control path defines the tree properties of a tree group. Control paths can only merge on the boundary between tree groups (the transition from one tree group to another). Each of the leaves of the tree corresponds to an exit point in the tree and is called a “tip.” By knowing which of the tips was used to exit the tree, the system can determine the path taken through the tree [6]. An example segment of PowerPC code and the corresponding PowerPC VLIW instructions and VLIW tree group is shown in Figure 10. In the figure, the PowerPC code is packed into four VLIW instructions. The contents of the VLIW instructions are dependent on where branches occur in this example. Note that a tree group does not necessarily contain only four VLIW instructions. This particular example shows a group of VLIW instructions, which is independent of the four paths within a given VLIW instruction (see Figure 7). The resulting instructions and tree group, shown in the figure, have not been optimized at this point in the execution process. The first time a segment of machine code is encountered it is re-translated from PowerPC code into machine code for DAISY and is added to an instruction group, as shown in Figure 9. Once a stopping point for the group is found that meets certain requirements, the group is 9 executed by the DAISY microprocessor. The selection of stopping points for tree groups in DAISY is governed by a set of simple principles. A tree group can end at the target of a backward branch (which is usually the beginning of a loop), at a subroutine entry, or at a subroutine exit. Note that subroutine entries and exits can only be determined heuristically by examining the branch, link, and register-indirect branch instructions of the PowerPC code. Additionally, tree groups can span processor pages, protection domains, and indirect branches (which are handled by using runtime information to replace the branch with a set of conditional branches) [6]. Figure 10. Example PowerPC code and corresponding VLIW instructions and tree group [6]. Finding a stopping point does not mean that the tree group will end at such a point. Either the desired ILP must have been reached for the tree group or the tree group must reach a certain “window size” before the tree group is ended. The term “window size” refers to the number of PowerPC operations found on the path being considered from the root of the tree group. The condition on the ILP is aimed at attaining maximum performance and the condition on the “window size” is meant to limit code explosion. Both of these limits are dynamically adjusted according to the frequency of code execution. This approach also has the benefit of implicitly performing loop unrolling [6]. 10 Tree groups are used as the unit of translation in the re-translation process. This helps to simplify the scheduling of speculative operations because any predecessor instruction dominates all its successors. Additionally, tree groups can have at most one reaching definition [6], meaning that, if a variable is defined at any point within a tree group, then it cannot be re-defined within the same tree group [11]. This helps to simplify optimization (and scheduling) approaches [6]. Originally, processor pages were used as the unit of translation in DAISY. However, tree groups were adopted later because it was discovered that paths through the program that are not frequently executed ended up being re-translated. This re-translation of infrequently executed code led to a large amount of unnecessary code and limited the ILP achieved by the microprocessor [6]. With the advent of tree groups, when a segment of code is encountered that has already been re-translated, the system merely branches to the corresponding tree group. In this situation, retranslation is not necessary. As before, if re-translated code is executed frequently, then it is optimized [6]. The process of re-translating code a certain number of times before it is optimized is beneficial to the overall performance of the system. First, the re-translation process acts as a filter for rarely executed code to keep such code from being optimized. The cost to optimize such code is wasted and will never be regained because the system will not benefit from faster execution of the code in the future. Second, the re-translation process can be used to gather data about how to guide the optimization process. After a tree group has been encountered a set number of times, it is optimized [6]. 2.3. Optimization of Tree Groups As shown in Figure 9, once a threshold on the number of times to execute an un-optimized segment of PowerPC code is reached, the associated tree group is optimized. The goal of the optimization algorithms used in DAISY is to attain a significant level of ILP with a low overhead cost. The scheduling approaches are adaptive and a function of execution frequency and behavior. The optimizations used in these approaches include copy propagation and load-store telescoping [6]. 11 As each operation is optimized, it is examined in-order (i.e., non-speculatively) and immediately placed into a VLIW instruction. At the same time, DAISY performs global VLIW scheduling on multiple paths and across loop iterations. If the resulting operations are scheduled in-order, then the results will be in the correct destination register after the operation is executed. However, if the operation is scheduled out-of-order (i.e., speculatively), then the result is placed into a hidden register, that can only be seen by the DAISY microprocessor and not the emulated PowerPC microprocessor. It is later copied into the correct destination register associated with the original in-order execution of the program [6]. Tree groups are initially created with moderate ILP and “window size” parameters. If the time spent on a path in a tree group is above a certain threshold, then the tree group will be extended and optimized again using a higher ILP goal and a larger “window size.” This allows the translator to spend more time very aggressively optimizing frequently executed code, while still optimizing less frequent code at a moderate level [6]. The optimization process used by DAISY performs several optimizations. Two of the optimizations performed are copy propagation and load-store telescoping [6]. Copy propagation is a code transformation that first searches for operations following a copy operation that use the destination register of the copy operation as a source register. When such an operation is found, the source register of the operation is replaced with the source register of the original copy operation [11]. In the DAISY system, copy propagation is also used to recognize when instructions that use the same registers do not have any real dependence between them. For example, the PowerPC instructions in the code shown in Figure 11(a) use the same registers, but have no real dependence between them. Because there is actually no dependence between these instructions, they can be performed in parallel in the single VLIW instruction of Figure 11(b) [12]. Figure 11. Example of Copy Propagation between PowerPC instructions with no real dependencies [12]. Load-store telescoping is an optimization that looks for load operations that correspond to previous store operations. When such patterns are found, the dependency of the instructions 12 involved in the load-store chain can be re-arranged such that no load or store operations need to be performed. Figure 12 provides an example of using load-store telescoping to optimize PowerPC code. Assuming that none of the instructions between stw and lwz write to r1 and that no other store instructions are found between these two instructions that write to 8(r1), then the code of Figure 12(a) can be re-written as VLIW instructions as shown in Figure 12(b) [12]. Figure 12. Example of Load-Store Telescoping optimizing PowerPC code [12]. Load-store telescoping has the benefit of being able to remove load and store operations for values maintained in memory that are used every time the values are needed. This removes these operations from the critical path of the program allowing programs optimized with this single technique to approach the performance of fully optimized code [12]. The optimization process also performs re-scheduling of operations in order to increase the ILP of the optimized code. If a speculatively executed operation results in an incorrect execution (i.e., not the original in-order behavior) then an exception is raised and the results are corrected. Each time this happens a counter is incremented and if a tree group has a large number of poorly scheduled speculative instructions, then the entire tree group will be rescheduled conservatively with these speculative operations scheduled in-order [6]. This adaptive approach allows the DAISY system to make mistakes in scheduling due to the aggressiveness of the process and still gracefully recover from and correct such mistakes. The scheduling approaches also support re-arranging the order of load instructions optimistically and must handle incorrectly scheduled loads appropriately. An exception is raised on a load operation whose target memory location has been altered between the point when the load is executed and when the result is committed. When an exception of this type is caught, the system takes corrective actions to ensure the load occurs correctly [6]. In Figure 13, the re-translated VLIW code for the PowerPC code segment of Figure 10 is shown. This example shows how the xor operation can be performed in the first VLIW 13 instruction and the four VLIW instructions of Figure 10 can be compressed into two VLIW instructions. The movement of the xor operation shows how the DAISY re-translation process performs operations as early as possible, with the result placed in a re-named register (r63 in Figure 13) if the operation is moved to an earlier VLIW instruction. Then the results of the moved instruction are placed into the correct PowerPC register (r4 in Figure 13) at the correct place in the re-translated code as seen with the r4 = r63 operation. This mechanism allows the microprocessor to perform precise exception handling [6]. Figure 13. Example PowerPC code and the corresponding translated VLIW Code [6]. As a result of the optimization process, programs that are flat and do not have comparatively highly executed code fragments will not be optimized aggressively. This helps to preserve cache resources and reduce translation overhead [6]. 2.4. Special Hardware and Control Mechanisms There are several areas in which special support is provided to make the DAISY microprocessor completely compatible with the binary machine code of the PowerPC microprocessor without encountering performance degradation. Among these areas are exception handling and context switching mechanisms, support for handling register-indirect branches, and being able to detect and handle self-modifying and self-referential program code. An exception is an event that requires special processing that changes the normal flow of execution, e.g., division by zero [13]. An important feature of DAISY is its precise exception handling mechanism. When an exception is encountered while executing program code, the VMM determines the PowerPC instruction that was being performed when the exception occurred. Next, the actions that would be required by the PowerPC are performed. Finally, the microprocessor branches to 14 the operating system code that handles the exception. However, if the instruction that caused the exception was being speculatively performed, special processing of the exception must be performed [6]. If a speculative operation causes an exception, then the register it writes to is tagged by setting an “exception tag” bit included in the register. This tag bit tells the VMM that the result in the associated register is incorrect. Then, if a non-speculative operation uses a tagged register, an exception is raised and the VMM handles the exception appropriately. This approach allows the optimization process to aggressively schedule instructions without affecting the exception behavior of the initial PowerPC machine code [6]. A context switch occurs when the microprocessor switches from executing one program to another; saving the context of the current program to memory and loading the context of the new program in from memory [13]. The DAISY supports this mechanism by only using nonPowerPC registers as destination registers for speculative operations. As speculative operations write to registers, non-PowerPC registers are used and their values are copied to the correct PowerPC register at the point in time that the operation would have written to the register if the PowerPC code was being executed in-order. This feature combined with the precise exception handling mechanism removes the need to save or restore non-PowerPC registers when a context switch occurs [6]. This means that DAISY does not have to do any special processing when a context switch occurs and such an event is handled solely by the operating system. Another area in which DAISY provides a specialized control mechanism is in the handling of register-indirect branches. A register-indirect branch is a branch in which the target of the branch is specified in a register. When this type of branch is encountered, the system uses the data in the specified register to determine the target location within the machine code of the branch. When scheduling a register-indirect branch, the microprocessor does not know the branch target until the branch is executed. This can cause the optimization process to schedule such operations exclusively in-order (such that the branch is the only operation being performed) [9] which significantly impedes performance [6]. To avoid such serializations, the DAISY converts register-indirect branches into a series of conditional branches followed by a register-indirect branch to ensure that the branch occurs correctly if the target is not provided by one of the 15 conditional branches. If additional branch targets are discovered in the future, then the series of conditional branches is updated to test for the additional targets [6]. Self-referential and self-modifying program code can cause problems in emulated microprocessors because the code is re-translated from one binary machine format to another and the self-referential or self-modifying code is not aware of the changes made. Examples of selfreferential code include code that performs a checksum on itself, has constants intermixed within it, and relative branches. The handling of such code in the DAISY microprocessor is straightforward because the PowerPC code can only refer to itself through the PowerPC registers; and in DAISY these registers contain the values they would if the program code was running on the microprocessor for which it was initially compiled [6]. The handling of self-modifying code is more complicated than self-referential code. The DAISY microprocessor handles this situation through the use of a “read-only” bit included in every unit of memory allocated to the PowerPC microprocessor. This “read-only” bit is hidden from the PowerPC microprocessor being emulated and tells the VMM when the tree group(s) associated with the unit of memory should be invalidated [6]. When machine code in memory is re-translated, the “read-only” bit for the unit of memory holding the code is set. Then, if a store operation occurs within a unit of memory whose “readonly” bit is set as the destination of the store, the store is committed and the execution of the retranslated code is interrupted. Next, the VMM invalidates the tree group(s) associated with the modified memory (the destination memory of the store). Finally, the PowerPC code resumes execution with the instruction immediately following the store instruction, resuming the retranslation-optimization-execution cycle; and when the modified code is to be executed in the future, it will be re-translated again with the modifications in place [6]. With special support for exception handling and context switching, indirect branches, selfmodifying, and self-referential code, the DAISY microprocessor can overcome potential problems that would otherwise degrade performance. Without the mechanisms provided for such situations, the DAISY microprocessor would most likely perform at an unacceptable level. 2.5. Performance Evaluation This subsection presents some of the aspects of the DAISY microprocessor that affect performance of the binary machine code running on the microprocessor. The aspects of DAISY 16 studied include the re-translation of machine code, the optimization process, and the underlying hardware architecture. In the studies performed in [6], re-translation of code was used to filter out portions of code that are not frequently executed from the translation cache (this process was not used to gather program profiling information to help guide tree group formation and optimization in these studies). However, filtering of infrequently executed code was not found to result in better cache performance, as might be expected. The result of filtering is larger segments of machine code for the regions of the initial machine code that were ultimately re-translated. This increase in code segment sizes make the performance of the instruction cache a more important factor in the performance of the system than when filtering of code is not performed negating the savings in time from filtering out infrequently executed code [6]. The studies also discovered that where the tree groups are terminated has an effect on the dynamic path length of the tree groups. The dynamic path length is the average number of PowerPC instructions between the root and leaves of a tree group. This measure is important because longer paths give the translator more opportunity to speculatively schedule VLIW instructions [6] and increase the ILP achieved. However, these speculative operations are only useful if they lie on the path taken at execution time. Additionally, incorrectly predicted speculative instructions can reduce the dynamic path length. This can occur when the number of paths through the program code exceeds what can be covered by tree groups [6]. Thus, the selection of good stopping points for tree groups is directly correlated with performance of the system. Several different configurations of the VLIW processor core were studied for the DAISY project. The different configurations considered are listed in Table 1 and range from 4-issue processor cores to 16-issue processor cores. All of the execution units have support for arithmetic and logic operations, and one or two units per cluster can perform memory operations [6]. As might be expected, the wider configurations of the VLIW processor core were found to provide a significant improvement over competing microprocessors in terms of high clock frequency. However, the interesting result is that the narrower configurations also performed well. This result is due to the lower translation overhead of narrower configurations. Because of the lower translation overhead, these configurations have a very good CPI (CPI stands for clock 17 Cycles Per Instruction and is a measure of the average time needed to perform an instruction [9]) compared to current superscalar microprocessors [6]. (Superscalar microprocessors execute a varying numbers of instructions at the same time that are statically scheduled at compile time or dynamically scheduled by the microprocessor at execution time, while VLIW-based microprocessors attempt to execute a fixed number of instructions at the same time that are typically statically scheduled at compile time [9].) Additionally, the simplistic hardware of the narrower configurations, if implemented in silicon, should result in a higher frequency microprocessor than the implementation of the wider configurations [6]. Table 1. VLIW processor core configurations explored for the DAISY microprocessor [6]. Number of Clusters Number of ALUs/Cluster Number of L/S Units/Cluster Number of Branch Units I-Cache Size 1 4 1 1 8K 1 4 2 1 8K Configurations 2 2 4 4 1 2 2 2 16K 16K 4 4 1 3 32K 4 4 2 3 32K The performance studies of DAISY, presented in [6], indicate that the filtering out of infrequently executed machine code before it is optimized does not necessarily improve system performance; and that the optimization of DAISY machine code is expensive. Also, it was found that the ILP achieved is directly affected by how tree groups are formed. Experiments simulating different configurations of the DAISY microprocessor have also shown that both the wide and narrow configurations, of Table 1, perform well. These studies have shown that most of the approaches used in the DAISY microprocessor are useful while the filtering of machine code may not always be beneficial to the overall performance of the system. 2.6. Summary The DAISY microprocessor uses the dynamic translation process of Figure 3 solely for the purpose of executing binary machine code compiled for a different microprocessor [6]. As a result of the approach taken to dynamic translation in the DAISY microprocessor, the processes used to re-translate and execute machine code are transparent to the machine code of the initial microprocessor and the resulting microprocessor is completely compatible with the initial microprocessor. 18 The keys to the success of the approaches used in DAISY are that the system performs ILP extraction at execution time [6] and run-time profiling of program code. This results in a high level of performance due to the ability of the microprocessor to dynamically adapt the retranslated instruction code. This is a major improvement over the heuristic and profile-based approaches that static VLIW compilers use, that result in trade-offs being considered to improve performance [6]. The DAISY project is innovative in its combination of a clustered VLIW processor core, tree-based VLIW instructions, tree groups as a unit of translation, and its scheduling and exception handling mechanisms. This work represents a new direction in which legacycompatible microprocessor design may go in the future. In fact, Transmeta Corporation has already taken this general approach in producing a commercial line of Intel X86 compatible microprocessors [7] and is the topic of the next section. 3. The Transmeta Crusoe Microprocessor 3.1. Overview The Crusoe microprocessor [7], developed and marketed by Transmeta Corporation, is in the same class of microprocessors as DAISY, i.e., it is a static microprocessor that performs the retranslation and optimization step of the dynamic translation process in hardware. This microprocessor is associated with the same high-level translation process as DAISY, which is illustrated in Figure 3. The goals of the Crusoe microprocessor are to be completely compatible with the machine code of the Intel X86 family of microprocessors [1] and to directly compete with these microprocessors in the marketplace. The Crusoe microprocessor achieves these goals with a unique hardware architecture, which includes enhanced support for re-translating X86 machine code into Crusoe machine code and executing the resulting machine code [7]. A high-level view of a Crusoe based system is shown in Figure 14. Similar to the DAISY, the Crusoe microprocessor is based on a VLIW processor core and is built on top of the X86 register file and memory model. A Crusoe based system can be divided into four parts: (1) the target application which was initially compiled for an X86 microprocessor; (2) the target operating system (also initially compiled for an X86 microprocessor); (3) the Code Morphing process which handles the re-translation and optimization of machine code, the maintenance of 19 re-translated machine code in a translation buffer located in memory, and system control; (4) and the Morph host which is the VLIW processor core of the microprocessor [7]. Figure 14. The components of a Crusoe based system [7]. The registers in the Crusoe consist of the same registers as the Intel X86, called official registers, and a set of working registers, some of which duplicate (or shadow) the official registers, as seen in Figure 15. As the Crusoe performs operations, it uses the working registers and preserves the previous state (i.e., the official state) of the emulated X86 microprocessor in the official registers. When a code segment boundary (e.g., a subroutine entry or exit) in the X86 machine code is encountered, the official state of the emulated X86 microprocessor is updated by copying the values of the working registers to the official registers. This mechanism is supported by an extra stage in the instruction pipeline of the microprocessor to avoid slowing down operation of the microprocessor [7]. This approach to execution of re-translated machine code is different from the DAISY in that the Crusoe does not work directly on the registers containing the official state of the microprocessor (making rollback of operations performed on registers trivial) and the DAISY only uses extra registers for the speculative execution of operations. Figure 15. Architecture of the Crusoe microprocessor [7]. Another important component of the Crusoe is the gated store buffer, shown in Figure 15, which buffers writes to memory by holding the address and data for each store to memory. This queue of memory stores temporarily holds memory state changes before they are committed to 20 the official memory state of the emulated X86 microprocessor, as illustrated in Figure 16. This mechanism ensures that the state of the emulated microprocessor is correct at the time of interrupts and exceptions. The stores between the head of the queue and the gate pointer have already been committed to memory and those between the gate pointer and the tail of the queue are those that have not been committed [7]. Figure 16. The gated store buffer used to buffer writes to memory and its associated registers [7]. Commit operations occur on code segment boundaries. When a commit operation occurs, the uncommitted stores are committed to memory and the gate pointer is moved to the tail of the queue. If a rollback operation is needed, e.g., for processing an exception, then the uncommitted stores are removed from the queue and the tail pointer is moved to the position of the gate pointer [7]. In contrast, the DAISY microprocessor does not provide a mechanism similar to the gated store buffer and must manually rollback any writes to memory whereas the Crusoe only has to change the value contained in one register. The Code Morphing process of the Crusoe microprocessor maintains a translation buffer, as shown in Figure 15, which stores completed re-translations of each X86 instruction. Once instructions are successfully re-translated and segments of instructions are optimized, the 21 resulting machine code is stored in the translation buffer. The resulting machine code (in the translation buffer) is executed by the VLIW processor core. When a previously re-translated instruction is encountered again, the microprocessor can recall the corresponding operation(s) from the buffer and execute them without further re-translation [7]. The use of a translation buffer approach greatly improves the speed of the microprocessor because it does not have to fetch, decode, re-translate, optimize, re-order, and schedule operations every time they are executed [7]. This mechanism is similar to and serves the same purpose as the instruction cache found in the DAISY. The structure used for the translation buffer may be implemented in hardware (e.g., as an instruction cache) or in software (e.g., as a data structure residing in memory). As X86 machine code is executed on the Crusoe microprocessor, if the instruction being executed at any given time has not been re-translated (and does not exist in the translation buffer), then it is re-translated into machine code for the Crusoe microprocessor and optimized. This process is the re-translation step of Figure 3 and is shown in Figure 17 for the Crusoe [7]. Figure 17. The dynamic translation process used by the Crusoe microprocessor derived from [7]. The performance of the Crusoe microprocessor comes from the reduction in the amount hardware in the microprocessor (compared to the Intel X86 microprocessor) and the caching of re-translated machine code. This results in a possible speed up of the execution of program code and a reduction in the power consumption of the microprocessor [14]. Transmeta has not made conclusive performance benchmark results for the Crusoe microprocessor readily available. Instead of proclaiming a faster execution time for applications and operating systems, Transmeta focuses on proclaiming the lower power consumption rate of the Crusoe microprocessor compared to state of the art compatible microprocessors. According 22 to Transmeta, the Crusoe microprocessor consumes 60%-70% less power than other conventional microprocessors. Due to this reduction in power consumption, Transmeta focuses their efforts on the lightweight mobile computer and handheld markets [14]. An overview of the dynamic translation process and the underlying hardware architecture of the Crusoe microprocessor have been presented in this subsection. In the next two subsections, a discussion of how the system performs re-translation (Subsection 3.2) and optimization (Subsection 3.3) is provided. In Subsection 3.4, special control mechanisms and support is presented, followed by a discussion of exception handling in Subsection 3.5. 3.2. Re-Translation of Instructions The Crusoe microprocessor uses the Code Morphing process, shown in Figure 14, to perform the re-translation process. The Code Morphing process also optimizes the resulting machine code and handles control of the system, including exception handling [7]. Just as with the DAISY VMM, this process is transparent to the X86 machine code being executed on the Crusoe. The first time an X86 instruction is encountered it is re-translated into a sequence of Crusoe operations, as shown in Figure 17. As instructions are re-translated the different segments of Crusoe machine code that are generated are linked together so that they do not branch back to the Code Morphing process if the next segment to be executed has already been re-translated. This helps to eliminate most of the branches back to the Code Morphing process and serves to enhance the speed of the emulated X86 microprocessor. Once the system has reached a steady state, it is estimated that a re-translation will only be necessary for one in every million X86 instructions executed over the life of a running program [7]. An example C program and the corresponding Intel X86 assembly code are shown in Figure 18. As the code is executed by the Crusoe microprocessor, each X86 instruction is re-translated into a series of Crusoe operations. These operations perform the X86 segmentation process, memory bound checking, and the operations required to create the results specified by the X86 instruction [7]. Figure 19 shows each X86 instruction followed by the necessary Crusoe operations. Unlike DAISY, the Crusoe does not use tree groups as the unit of translation, but instead uses X86 instructions. Additionally, although the instruction format used is by the Crusoe is not 23 specified in [7], there is no indication that the Crusoe uses an instruction format similar to the tree-based instructions of DAISY. Figure 18. Example C program and corresponding assembly-level code [7]. In addition to re-translating X86 machine code into Crusoe operations, the Code Morphing process also optimizes the operations in an attempt to speed up the execution of instructions as much as possible. The optimization process is presented in the next subsection. 3.3. Optimization of Crusoe Instructions The Code Morphing process not only re-translates X86 instructions into Crusoe operations, it also optimizes the operations using several techniques including common sub-expression elimination, speculative removal of commit operations, and copy elimination. Such optimizations are performed on a re-translation only if the re-translation is executed frequently, because the time needed to re-translate and optimize infrequently executed instructions is greater than the time required to re-translate and execute the instructions without optimization. Although [7] does not specify in detail how the Crusoe microprocessor determines which retranslations should be optimized, the following three criteria are discussed in [7]. First, a count of how many times a re-translation is executed is kept. If this count reaches some threshold, then an exception can be raised and the re-translation can be optimized at that time. This mechanism can be embedded into the re-translations as software. Second, the Code Morphing process can interrupt the execution of re-translations at a specified frequency and optimize the re-translation running at the time the system is interrupted if it has not already been optimized. Finally, the Code Morphing process can simply optimize certain types of operations or sequences of operations (e.g., loops) [7]. 24 Figure 19. Example re-translated X86 code (in bold) with Crusoe operations required for each X86 instruction [7]. 25 The rest of this subsection presents each of the different optimizations that are specifically addressed in [7]. These optimizations are: speculatively removing the X86 segmentation process, speculatively removing upper boundary memory checks, common sub-expression elimination, speculatively removing commit operations, register renaming, code motion, data aliasing, copy elimination, and the use of alias hardware [7]. Note that the Crusoe optimization process may perform more optimization on code than what is described in [7]. 3.3.1. Removing the X86 Segmentation Process In a segmented memory model, the memory allocated to an application is segmented into a group of independent sections of memory, called segments. In the segmented memory model used by the X86 microprocessor, the code, data, and stacks are assigned to separate segments. In contrast, in a flat memory model the memory is presented to the application as a single section of memory, called a linear address space, in which all of the code, data, and stacks are contained. The segmentation of memory increases the reliability of applications because it prevents the program from overwriting memory that has not been allocated to the program. When a program using the segmented memory model accesses memory, the microprocessor must translate the address appropriately and ensure the correct segment is present in main memory. This process of translation and the mechanisms required to maintain the segments are transparent to an application running on an X86 microprocessor [1]. The Crusoe microprocessor does not include support for transparently translating logical addresses into addresses for a segmented address space or the mechanisms to transparently maintain a segmented memory space. Therefore, the Crusoe inserts appropriate operations to perform these tasks as X86 machine code is initially re-translated. This optimization removes these operations by performing a speculative mapping of all segments of the code segment to the same address space. Note that the speculative removal of these operations requires the assumption that the program currently being optimized is written for a flat memory model, removing the need for the segmentation process. The example code in Figure 19 (which is the re-translated code resulting from the retranslation process being performed on the X86 code of Figure 18) includes the operations necessary to support a segmented memory model. Assuming this code uses a flat memory model, the optimization process removes these extra operations resulting in the code segment 26 shown in Figure 20 [7]. As seen in these figures, the removal of these operations removes up to three Crusoe operations per X86 instruction. Figure 20. Example re-translated X86 code (in bold) with the Crusoe operations required for each instruction after removal of the X86 segmentation process [7]. If the speculation that the application being optimized uses a flat memory model is wrong, then the execution of the optimized code will fail. When this happens, the Code Morphing 27 process will rollback the state of the emulated microprocessor and the operations that support the segmented memory model (that were removed) are re-inserted into the code segment. Finally, the code will be re-executed with the newly inserted code in place [7]. 3.3.2. Removing Upper Boundary Memory Checks Pages are virtual units of memory that allow the memory space of the microprocessor to be extended beyond that provided by the semiconductor-based memory devices used to support the main memory. When a page is needed, it is loaded into memory from a secondary memory device (e.g., a hard drive) and when it is no longer needed, it is stored back out to secondary storage [9]. It may be the case that a data unit does not completely fit in one page and is split among multiple pages. If such a splitting occurs, then the data is said to be unaligned; and when the associated data is referenced, all corresponding pages must be present in main memory. The operations included in the re-translated X86 code include an operation that checks the logical address created by the code against the upper boundary of the address space for the memory segment being used. These memory checks are not removed as part of the optimization that performs the removal of the X86 segmentation process because the memory reference may refer to unaligned memory, in which case the microprocessor must ensure that the correct virtual memory pages are present in main memory [7]. Under the assumption that the instructions and data of the application being optimized are correctly aligned and assuming the application uses a flat memory model, the operations that perform the upper memory boundary checks (the operation chku) can be removed [7]. These operations are shown in the example code of Figure 20, which is the re-translated code corresponding to the X86 code of Figure 18 after the removal of the segmentation process. The further optimized code without the upper boundary memory checks is shown in Figure 21. This optimization removes one Crusoe operation per Intel X86 instruction. If the speculation that the instruction and data of the application are correctly aligned is wrong (or if the assumption that the application uses a flat memory model is wrong), then the execution of the optimized code will fail. When this happens, the state of the emulated X86 microprocessor will be rolled-back and the operations removed during the optimization process will be re-inserted into the code. Finally, the code will be re-executed with the upper boundary memory checks included in the code [7]. 28 Figure 21. Example re-translated X86 code (in bold) with the Crusoe operations required for each instruction after removal of upper boundary memory checks [7]. 3.3.3. Common Sub-Expression Elimination The third optimization presented is common sub-expression elimination [7]. This optimization reduces the number of operations in a sequence by removing unnecessary re-computation of the same expressions [11]. This removal of common sub-expressions is performed on the retranslated machine code in order to further optimize the Crusoe machine code. Common subexpression elimination does not require any assumptions about the nature of the application being re-translated and the resulting re-translation will always execute correctly [7]. 29 3.3.4. Removing Commit Operations The fourth optimization used by the Crusoe microprocessor is the speculative removal of commit operations from the re-translated Crusoe machine code. This optimization assumes that the associated machine code segment will not cause an exception. This allows the removal of the commit operations that update the state of the official registers and move uncommitted memory stores, in the gated store buffer, to memory [7]. The removal of commit operations is possible because the state of the emulated X86 microprocessor only needs to be correct when the operating system accesses the state of microprocessor due to the occurrence of an exception. Thus, the commit operation only needs to occur at the end of a sequence of X86 instructions instead of after each instruction. This removes one Crusoe operation for every re-translated X86 instruction and replaces them with only a single commit operation at the end of the sequence of re-translated X86 instructions [7]. These commit operations (commit) are shown in the code listing of Figure 21 and the resulting optimized code without these commit operations is shown in Figure 22. When an exception does occur, the microprocessor will invalidate the uncommitted state of the emulated microprocessor and re-translate the machine code executed since the last commit. In the new re-translation, a commit operation is performed after every sequence of Crusoe operations corresponding to each X86 instruction. Then, when the exception is encountered, the state of the microprocessor is correct and the exception can be properly handled [7]. 3.3.5. Register Renaming The fifth optimization used by the Crusoe microprocessor is the process of register renaming [7]. This optimization relies upon name dependencies between operations that occur when multiple operations utilize the same register(s) and the operations are not dependent on the data computed by the other operations involved [9]. When this occurs, the registers used by such operations can be renamed, removing any hardware dependencies between the operations and allowing them to be executed in parallel utilizing different execution units [7, 9]. This optimization can have a significant impact on the level of ILP depending on how many registers are available for register renaming [9]. DAISY also performs the same type of optimization by using copy propagation to detect name dependencies and register renaming to remove the dependencies so that operations can be performed in parallel. 30 3.3.6. Code Motion Code motion is the sixth optimization used by the Crusoe microprocessor [7]. This is a loop optimization that targets expressions (operations) found in the body of the loop that yield the same result in each iteration of the loop. By moving such expressions outside the loop body, the number of operations executed to complete the loop is reduced [11]. Figure 22. Example re-translated X86 code (in bold) with the Crusoe operations required for each instruction after removal of all commit operations except the one at the end of the code segment [7]. 3.3.7. Data Aliasing The seventh optimization used by the Crusoe microprocessor is data aliasing. This optimization recognizes when several operations access the same locations in memory. When such a situation occurs, the Code Morphing process will load the values at the referenced addresses into a register. Then only a register-to-register copy needs to be performed for each operation that references any of the associated memory addresses. The copies of data from memory, residing 31 in the registers, are marked as aliased, and when a change is detected, an exception will occur. This optimization is beneficial because every load operation that involves moving data from the affected memory addresses is changed to a simple register-to-register copy that executes much faster than a load from memory [7]. 3.3.8. Copy Propagation The eighth optimization performed is copy propagation. For copy propagation, the Code Morphing process removes unnecessary register-to-register copies by using the register in which the data originally existed whenever copies of that register are used. (Note that once this has been done, the original copy operation is removed if it is no longer needed using an optimization called dead code elimination [11].) This effectively reduces the number of cycles required to execute a segment of code [7]. 3.3.9. Using Alias Hardware The final optimization, considered in [7], is the use of alias hardware to remove store operations from loops. This optimization is similar to that of data aliasing, except here the aliased registers are used to hold data to be stored into memory. This allows the store operations within the loop body to be replaced with register-to-register copies or register references instead of memory stores. Additionally, the actual stores to memory are moved such that they occur immediately after the end of the loop. This optimization speeds up the processing of loops by either replacing memory stores with register transfers or eliminating the store operations from the body of the loop altogether [7]. The optimization is similar to the load-store telescoping optimization used by the DAISY microprocessor [12]. 3.4. Special Hardware and Control Mechanisms The Crusoe microprocessor includes special hardware and control mechanisms for several different types of X86 program support. These areas include support for rollback operations, memory mapped I/O (Input/Output), and self-modifying program code. The Crusoe supports rollback operations through the use of shadow registers and the gated store buffer, both presented in Subsection 3.1. The present subsection presents the support that the Crusoe provides for memory mapped I/O and self-modifying code. Memory mapped I/O refers to the way that memory mapped I/O devices are attached to the microprocessor. An I/O device can be mapped to a memory address so that the programmer can 32 access it using typical memory stores and writes [9]. Due to the nature of memory mapped I/O, it is impossible to distinguish memory instructions from memory mapped I/O instructions [7]. Because memory mapped I/O operations often must be performed in the precise order specified in the X86 machine code, a system normally must treat all memory operations with conservative assumptions so as minimize the affect memory on mapped I/O. In order to allow optimization for true memory operations, an A/N (Abnormal/Normal) protection bit is included in every address translation in the Translation Look-aside Buffer (TLB). This protection bit specifies if the memory accesses used for the associated memory address is abnormal (i.e., an access to memory mapped I/O) or normal (i.e., an access to regular memory). This mechanism allows true memory accesses to be speculatively re-translated [7]. When an access to memory is re-translated, the Code Morphing process initially treats it as a regular access to memory, which can be speculatively scheduled. Then after the re-translation executes, the memory access type of the operation is compared against the A/N bit. If the access type used in the re-translation and the A/N bit disagree, then a memory mapped I/O operation was performed and an exception occurs. When this happens, the Code Morphing process takes the necessary actions to correct the access type of the re-translation and a rollback of any operations performed. Then, the correct re-translation is performed, scheduling the memory access operations in-order [7]. In contrast to the Crusoe, the DAISY microprocessor attempts to detect and schedule all memory mapped I/O operations in-order. If an operation is not detected correctly, then the DAISY detects the operation when it is executed and correctly re-translates the affected operations again [4]. In addition to the A/N bit, another type of protection bit, called a T-bit, is provided in every address translation in the TLB. The T-bit helps to guard against the affects of self-modifying code by specifying for which memory pages a translation exists. If self-modifying code is encountered, then the corresponding translation(s) must be invalidated and the new code retranslated prior to the code segment being executed again [7]. If a memory write occurs on a memory page for which the corresponding T-bit in the TLB is set, then an exception occurs and the microprocessor invalidates the translation(s). When the corresponding Intel X86 code segment is to be executed again, the translator will be called and the code segment will be re-translated. This mechanism can also be used to specify which 33 memory pages the re-translations depend upon not being accessed by write operations [7]. The DAISY microprocessor supports self-modifying code by including a special “read-only” bit in every unit of memory [4]. 3.5. Exception Handling When an X86 exception is detected, the Crusoe microprocessor must ensure that the state of the emulated X86 microprocessor is exactly the same as it would be in the Intel X86 microprocessor. This is accomplished by rolling back the state of the working registers by copying the values located in the official registers into the working registers if needed and removing any uncommitted stores from the gated store buffer. Once the uncommitted stores are removed from the gated store buffer, the value in the register holding the tail pointer of the buffer is replaced with the value stored in the gate pointer of the buffer [7]. After the state of the working-values and the gated store buffer are successfully rolled-back, the state of the emulated microprocessor is the same as it would be in the Intel X86 microprocessor at the beginning of the code segment which caused the exception. Next, the Code Morphing process re-translates, re-executes, and commits the results of re-translation of each X86 instruction (in-order) until the exception occurs again. Once the exception occurs, the official state of the microprocessor is correct and the exception can be properly handled [7]. At the end of this process, the resulting re-translations can be stored into the translation buffer because they correctly handle the exception. Then, if the exception occurs in the future, the system will correctly handle the exception without having to re-translate the associated X86 code again [7]. 3.6. Summary The Transmeta Crusoe microprocessor is a successful commercial product. It has gained a share of the Intel X86-compatible microprocessor market. Transmeta has been successful in marketing the Crusoe microprocessor for use in laptop computers and handheld devices. Transmeta has successfully used dynamic translation and execution of programs to reduce the power requirements of the Crusoe microprocessor by 60%-70% over compatible microprocessors [14]. This has been accomplished by reducing the complexity of the underlying hardware architecture of the microprocessor [7]. 34 The Crusoe is innovative in its combination of simplistic hardware and the dynamic translation process to create a system capable of emulating another microprocessor. This innovative approach is a new direction in microprocessor design that may allow designers to create more sophisticated microprocessors that are compatible with legacy microprocessors. 4. Comparison of the DAISY and Crusoe Microprocessors 4.1. Overview Both the DAISY and Crusoe microprocessors are static microprocessors that are completely compatible with the binary machine code of other microprocessors. These two microprocessors utilize the dynamic translation process of Figure 3 to re-translate program code compiled for different microprocessors into machine code for their own VLIW processor core. In both microprocessors, if a re-translated portion of machine code is frequently executed, it is optimized [6, 7]. Although the DAISY and Crusoe appear to be similar from a high-level viewpoint, their detailed implementations are quite different, as shown in Table 2. • The DAISY has been developed for research purposes [6], while the Crusoe has been developed as a commercial product [7]. • They use different approaches to re-translation and optimization of machine code. • The DAISY microprocessor uses optimizations that center around branch analysis [6]. • The Crusoe microprocessor performs specialized optimizations only valid on Intel X86 machine code [7]. Because the DAISY and Crusoe are designed to be compatible with different microprocessors, comparison of the two is nor as straightforward as it would have been otherwise. An overview of some similarities and differences between the two microprocessors has been presented in this subsection. In the next subsection, the difficulty encountered in making direct comparisons of certain aspects of these two microprocessors is presented. Subsection 4.3 presents a comparison between the architecture of the two microprocessors, followed by a comparison of how they re-translate machine code in Subsection 4.4. After this, the differences in the approach taken to scheduling by the two microprocessors are presented in Subsection 4.5, followed by a look at the different optimizations utilized by the microprocessors 35 in Subsection 4.6. Finally, the different ways in which the microprocessors handle special situations (e.g., self-modifying code) is presented in Subsection 4.7. Table 2. A comparison summary of the DAISY and Crusoe microprocessors. DAISY Objective: Goal: Architecture: Crusoe Research Based Commercial Based Completely compatible with an existing microprocessor VLIW Based Completely Compatible with the Intel X86 Less expensive VLIW Based Clustered Hardware and Software Based Official PowerPC Registers Official X86 Registers Clustered Cache “Gated Store Buffer” Support for profiling Specialized TLB Tree-based Instructions Translation: Goal: High ILP with low overhead cost due to compilation Performs an initial re-translation on the first occurrence of a PowerPC instruction Further re-translates and optimizes code with moderate and aggressive approaches if necessary Performs code profiling Performs re-translation on first occurrence of an X86 instruction Further retranslates code with less speculation and optimization if necessary Uses tree groups as unit of translation Scheduling: Optimization: Speculative load operations Speculative Memory Operations Memory Mapped I/O not speculatively scheduled Copy Propagation Removes X86 Segmentation Process Load Store Telescoping Removes memory boundary checks Branch Analysis Sub-expression elimination Removal of commit operations Register Renaming Code Motion Data Aliasing Copy Elimination Use of alias hardware Provides for: Exception Handling Exception Handling Self-Modifying Code Self-Modifying Code Self-Referential Code 36 4.2. Difficulties in Comparing the DAISY and Crusoe Microprocessors Comparing the performance of the DAISY and Crusoe microprocessors is difficult because they are targeted at emulating different microprocessors. The DAISY system emulates the RISC-like PowerPC microprocessor [6] and the Crusoe microprocessor emulates the CISC-like Intel X86 microprocessor [7]. Adding to the difficulties in comparing these two microprocessors is the lack of information on the performance of the Crusoe microprocessor. This is due to Transmeta’s objective of marketing their microprocessor based on power consumption rates as opposed to performance in terms of speed. Even if a computer system based on the Crusoe was purchased and benchmarked, there is no computer system based on the DAISY to compare it against because the DAISY has never been fabricated [6]. Additionally, if a DAISY microprocessor was available, other factors would impede a true comparison, e.g., peripheral hardware support. In light of the difficulties discussed in this subsection, it is interesting to see how the DAISY works compared to the Crusoe. The designers of the DAISY have a research focus and are interested in furthering the state of microprocessor emulation and exploring novel techniques [6], while the Crusoe was designed from the beginning to be a commercial product [7]. By analyzing the design of the Crusoe, it is informative to note the techniques the designers chose as opposed to those chosen by the designers of the DAISY. 4.3. The Architectures of the DAISY and Crusoe Microprocessors The VLIW processor core found in the DAISY is clustered and supports tree-based instructions. This type of processor core relies heavily on branch analysis and works well with programs in which there are several different paths for the flow of control to follow [6]. Thus, the microprocessor may be well suited for decision-based applications (e.g., artificial intelligence applications), but perhaps less suited for computationally intense applications (e.g., matrix multiplication). The DAISY microprocessor is designed not only for emulating the PowerPC microprocessor, but also as a base microprocessor that can be modified and extended to emulate any conventional microprocessor [6]. In contrast to this, the Crusoe microprocessor is designed only with the intent of emulating the Intel X86 microprocessor [7]. Transmeta has not released information that specifies the level of sophistication of the VLIW processor core of the Crusoe. Instead, they have provided information explaining the gated store 37 buffer, TLB (which serves the same purpose as the instruction cache used by DAISY), and translation buffer found in the microprocessor. The gated store buffer represents a novel hardware approach to updating the state of the emulated microprocessor. This approach preserves the official state of the system and allows an automatic rollback of the unofficial state to the official state when needed [7], whereas the DAISY must manually rollback any writes to memory. Additionally, the Crusoe provides shadow registers for the official X86 registers simplifying the rollback of the registers. In contrast, the DAISY uses extra registers to store changes to the state of the emulated microprocessor and copies the values to the official registers in the order that the instructions were to be executed in the original machine code making the rollback of the registers more complex. Because Transmeta has not released much information detailing the architectural aspects of the Crusoe, it is difficult to determine the precise level of difference between its architecture and that of DAISY. Because the Crusoe is designed specifically to emulate the Intel X86 microprocessor, it might be expected that the VLIW core of the Crusoe resembles the processor core of the X86 microprocessor. 4.4. The Re-translation Processes The basic approach taken to re-translation in the DAISY and Crusoe microprocessors is similar. They both convert machine code compiled for the emulated microprocessor into machine code for a VLIW-based processor core. When a segment of code is re-translated, they both immediately convert the code into the appropriate operations and store them for optimization at a later time. In both of the microprocessors, this process is transparent to the machine code being executed on the emulated microprocessor. Although both of the microprocessors re-translate code the first time it is encountered, they approach the process differently. The DAISY not only re-translates machine code, but it also builds tree groups and collects profiling information on the machine code to help guide optimization at a later time [6]. In contrast, instead of using tree groups, the Crusoe re-translates machine code one instruction at a time and performs optimizations on straight-line segments of code [7] and there is no indication that it uses an instruction format similar to the tree-based instructions of DAISY. 38 As both microprocessors re-translate code, they link the translations together to remove calls to the re-translation process as much as possible. This helps to speed up the execution of retranslated code. Once re-translation has occurred on a segment of code, both of the microprocessors optimize the segment after it has been determined as a frequently executed segment of code [6, 7]. The DAISY has more steps in the re-translation process of code and it employs a more sophisticated approach to re-translating code by using tree groups as the unit of translation [6]. Even though the re-translation process of the DAISY system is more complex than that of the Crusoe microprocessor, the Crusoe has shown that such sophistication is not mandatory in implementing a commercially viable microprocessor that employs dynamic translation. 4.5. Scheduling Re-translated Operations Both the DAISY and Crusoe microprocessors perform speculative scheduling of certain types of operations. The main difference in their approaches is that the Crusoe speculatively schedules all memory operations [7], while the DAISY only speculatively schedules memory stores that do not affect memory mapped I/O devices [6]. The DAISY detects most memory mapped I/O operations at re-translation time and schedules them in-order [6], whereas the Crusoe detects such operations at run-time and then re-translates the affected code, scheduling the memory mapped I/O operations in-order [7]. 4.6. Optimization of Re-translated Machine Code The optimizations used in the DAISY and Crusoe microprocessors vary greatly. The optimizations that DAISY performs (as specified in [6]) are copy propagation and load-store telescoping [6]. These optimizations are fairly generic in that they can be applied to sequences of DAISY operations no matter what microprocessor the DAISY system is targeted to emulate. The optimizations performed by the Crusoe contain generic and system specific optimizations. Among the generic optimizations are sub-expression elimination, removal of commit operations, register renaming, code motion, data aliasing, copy elimination, and aliasing of hardware [7]. In DAISY, copy propagation is used to perform the same optimization as register renaming; and load-store telescoping is similar to the Crusoe’s use of alias hardware. The generic optimizations performed by the Crusoe give it an edge over DAISY in regards to the efficiency of compiled code. However, the branch analysis approach of DAISY may negate 39 the potential performance advantages that the Crusoe has over DAISY due to these optimizations. The Crusoe also performs optimizations that are specific to the Intel X86 machine code. These optimizations are the removal of the X86 segmentation process and the removal of upper boundary memory checks [7]. These optimizations help the Crusoe to compete with the Intel X86 microprocessor, but if the Crusoe was to be re-targeted to emulate a different microprocessor, some of these may become invalid. 4.7. Handling of Special Situations The DAISY and Crusoe microprocessors both provide facilities for handling special situations, e.g., self-modifying code. The microprocessors both handle PowerPC (DAISY) and Intel X86 (Crusoe) exceptions by ensuring the state of the emulated microprocessor is correct, performing any required actions, and then allowing the operating system to handle the exceptions as needed. If the exceptions are DAISY or Crusoe specific, then they are handled internally [6, 7]. The DAISY provides facilities for dealing with both self-modifying code and self-referential code [6], while (according to [7]) the Crusoe microprocessor only addresses self-modifying code. The mechanism used for handling such code is similar in both microprocessors. The DAISY handles self-modifying code by adding a “read-only” bit to every memory unit allocated to the emulated PowerPC microprocessor. This “read-only” bit tells the DAISY VMM if the memory location should be invalidated due to a store to the memory location it occupies [6]. Similarly, the Crusoe handles self-modifying code by adding a T-bit to every address in the TLB. Then, if an address with its T-bit set is written to by the system, the Crusoe can handle the situation accordingly [7]. 4.8. Summary The designers of the Crusoe microprocessor started out with a basic VLIW processor core (that possibly resembles the core of the Intel X86 microprocessor’s multiple-issue execution unit) and added only the most necessary hardware components to this core to make the microprocessor run as fast and as power aware as possible [7]. On the other hand, the architects of the DAISY microprocessor chose to build their design on a sophisticated clustered tree-based VLIW processor core, to which they added a clustered cache system [6]. Although this design works 40 well on programs that benefit from branch analysis, it is complex and may require more hardware to implement compared to the Crusoe microprocessor. 5. Proposed Future Research Directions 5.1. Overview Several research projects have attempted to harness and capitalize on the flexibility of reconfigurable hardware, often realizing significant performance improvements for many target applications such as DNA matching, target recognition, pattern searching and encryption/decryption [15, 16, 17]. However, even as impressive as these performance improvements are, computing using reconfigurable hardware (often referred to as configurable or reconfigurable computing) has only become a custom solution to a relatively small set of problems and has yet to make a significant impact in the general purpose computing community [17]. This has led to an effort to merge general-purpose computing and reconfigurable hardware. This section presents two research ideas that merge general-purpose microprocessors and reconfigurable hardware. The first idea presented is an architectural approach to use reconfigurable hardware in a microprocessor that performs dynamic translation (i.e., a DAISYlike microprocessor). The second research idea is to analyze the instruction set of a current microprocessor. The goal of this analysis is to develop an analytical approach to designing microprocessors that employ reconfigurable hardware. In this work, instruction set analysis is performed using instruction set partitioning. Instruction set partitioning classifies instructions based on how frequently and how closely they are executed with respect to other instructions. If partitions exist, then it is conceivable to provide configurations for reconfigurable hardware that best match the characteristics of the instructions in each partition. In the next subsection, an overview of the research idea of combining a DAISY-like microprocessor and reconfigurable hardware is presented. Subsection 5.3 presents an approach to instruction set partitioning, an overview of a microprocessor that could make use of instruction set partitions, and preliminary research that has been performed in this area. 41 5.2. An Architecture to Support Dynamic Translation with Reconfigurable Computing In order for a microprocessor to be marketable and able to compete with current microprocessors, it must be compatible with a successful microprocessor in today’s microprocessor market [7]. This is the motivation behind the Crusoe and DAISY microprocessors. The DAISY and Crusoe microprocessors have made breakthroughs in emulating microprocessors at a hardware level using a dynamic translation process that is transparent to the applications running on the microprocessor. However, they must re-translate program code compiled for the emulated microprocessor. The goal of the optimizations and scheduling approaches used in the translation process is to execute the re-translated machine code fast enough to negate the effects of having to re-translate the initial machine code. It may be possible to further speed up the re-translation process by implementing the algorithms used in these processes and the resulting machine code (in certain circumstances) in reconfigurable hardware. This concept follows closely to what has been done in the DISC (Dynamic Instruction Set Computer) microprocessor [18]. In DISC, each instruction is implemented as a stand-alone circuit module. Then, as the instructions are executed, the circuit for the current instruction is configured into the reconfigurable hardware and executed [18]. A similar approach can be taken to implementing the re-translation processes in reconfigurable hardware. The re-translation process can be synthesized into circuits and then partitioned into segments that can be implemented in reconfigurable hardware. When a segment is finished and control needs to pass to the next segment of the process, the reconfigurable hardware is re-configured to implement the new segment. To illustrate, the segmentation of a simple dynamic translation process is shown in Figure 23. Another improvement, which may be beneficial to a microprocessor that uses dynamic translation to emulate an already existing microprocessor, is to add a reconfigurable execution unit to the core of the microprocessor, as shown in Figure 24. In this approach, the algorithm that controls the re-translation, optimization, and execution of the program code can determine to optimize an instruction or set of instructions by synthesizing them into circuits and implementing them in the reconfigurable execution unit. With an arrangement of execution units similar to that shown in Figure 24, the microprocessor can implement instructions in both static hardware and reconfigurable hardware. 42 Figure 23. The segmentation of a simple dynamic translation process in which each segment represents a different configuration of the same reconfigurable hardware. Several mechanisms can be used to determine if an instruction or set of instructions should be synthesized and targeted for reconfigurable hardware. A simple mechanism for making this determination is an analysis of how often the instruction(s) are executed and how much time the microprocessor spends in the associated segments of program code. However, the time taken to synthesize the instructions into circuits and the time required to re-configure the reconfigurable execution unit must be considered in such a determination. Figure 24. A microprocessor core that includes a reconfigurable execution unit. A system that uses the concepts of implementing the re-translation process in reconfigurable hardware and/or synthesizing certain instructions or groups of instructions into reconfigurable hardware may perform better than microprocessors such as the DAISY or Crusoe that rely only on static hardware. The determination of this performance is likely to be associated with the time required to reconfigure the reconfigurable logic in the microprocessor and the time required to synthesize instructions into hardware. 43 With the modifications presented in this subsection, the system architecture of the resulting microprocessor would resemble Figure 25. These approaches may prove infeasible with current technology, due to the reconfiguration times of existing reconfigurable technology. However, reconfigurable technology continues to increase in logic capacity and decrease in reconfiguration time. Due to these ongoing improvements, the future may provide opportunities to implement such a system for general-purpose computing. Figure 25. A high-level view of a system that uses dynamic translation and reconfigurable hardware. 5.3. Instruction Set Analysis 5.3.1. Overview The research idea presented in this section is in the area of instruction set analysis and partitioning. The motivation behind instruction set analysis is to develop a formal approach to microprocessor design that is soundly based in a mathematical context. This presentation of instruction set partitioning is based on how closely instructions are executed with respect to each other. The existence of instruction set partitions in existing instruction set architectures may be useful in the design of new microprocessors (based on reconfigurable hardware) or reconfigurable execution units for use within a microprocessor (e.g., Figure 24). A microprocessor (or execution unit) designed around instruction set partitions implements one partition of the instruction set in circuitry at a time. Instructions that are encountered that are outside of the partition that is currently supported can be emulated using the instructions found in the current partition. Or, when the control process of the microprocessor determines that the configurable hardware needs to be changed (i.e., because the partition that the majority of the instructions being executed by the current program are found in has changed), then the microprocessor re-configures to implement the appropriate partition in circuitry. 44 The drawback of using instruction set partitions (where only one partition is supported in hardware at a time) is that the execution of instructions that are not implemented in the current partition (that are executed using emulation) may be inefficient. However, the use of emulation allows any instruction (in the instruction set) to be performed without reconfiguring to implement the correct partition, bringing a sense of completeness to each partition if it can be used to emulate any instruction in the instruction set that it does not directly support in hardware. Instruction set partitions may prove to be a very powerful tool in designing new instruction sets if instruction set partitions can be discovered and ways of analyzing them can be developed. Before the use and design of techniques to create instruction set partitions are explored in detail, an initial study has been performed to determine if instruction set partitions exist in the instruction sets of today’s microprocessors. This section presents an initial study of an existing instruction set to see if instruction set partitions exist. The next subsection presents a discussion of a technique that may be able to detect such partitions in existing instruction sets along with preliminary experiments performed to verify this technique. Finally, the overall results of these experiments and areas for future work are presented in Subsection 5.3.3. 5.3.2. Detecting Instruction Set Partitions with Clustering 5.3.2.1. Overview In this initial study, clustering was used to detect instruction set partitions for the Intel IA-32 instruction set [1]. Clustering is used because such techniques can determine if a set of data is weakly or strongly differentiated indicating if one or multiple classifications for the data exists [19]. This discovery of classifications of data (in our case instructions) is the purpose of this initial study. For this study, the K-Means clustering technique was chosen because of its simplicity and ease of implementation. In the K-Means clustering algorithm (shown in Figure 26) the data is grouped into a set of mutually exclusive clusters, each of which has a center. Before the first iteration of the algorithm, the first k data units are chosen as the centers of k clusters (the k centers can also be randomly chosen). Then in step two, the remaining data is assigned to clusters dependent on the cluster that the data is closest to (this is determined by the distance between the data and the center of each cluster). In step three, each cluster is examined to determine the data point that should be the center of each cluster. Finally, the algorithm iterates steps two and three until convergence is reached [19]. In this study, convergence is determined 45 by the distance between the old centers and the new centers. Convergence is reached when this distance becomes stable or cyclic. Figure 26. The K-Means clustering algorithm derived from [19]. The Intel IA-32 instruction set [1] was chosen for this study because it is possible to trace the execution of a program on an Intel IA-32 microprocessor (e.g., Intel Pentium 4) [1] at the assembly language level one instruction at a time. This is possible using the ptrace system call [20] that is supported under the Linux operating system (e.g., Red Hat Linux 7.3). This allows a program to exec (i.e., start) another program to be traced and then attach to the program and control the execution of the program [20]. As each instruction is executed, it can be read from memory using the ptrace interface and disassembled into an assembly-level operation. The use of ptrace allows a finer granularity of control than other methods, such as using debuggers. The GDB debugger [21] was also studied for this project. However, such debuggers only trace programs at the system call level and then disassemble large portions of the program at a time [21]. Therefore, it is impossible to know exactly what instructions in these portions were executed. The rest of this section presents a series of three clustering experiments that were performed on the IA-32 instruction set as part of this study. In the next subsection, a description of the clustering experiments is presented, followed by a presentation of the results of the three experiments. 5.3.2.2. Clustering Experiments Performed These experiments analyze the instructions executed by the POV-Ray (Persistence of Vision Raytracer) raytracing program [22]. This program was chosen because it uses sophisticated algorithms and can be successfully traced at the assembly-level using the technique discussed in Subsection 5.3.2.1. In these experiments POV-Ray was statically compiled for an Intel Pentium 4 system running the Red Hat Linux operating system (version 7.3), and was used to create the image shown in Figure 27. The input file, simple.pov, used for creating this image comes as part of the distribution packages of POV-Ray. 46 Figure 27. The image created using POV-Ray for the clustering experiments. In order to perform the clustering, a distance metric is required to determine the cluster that instructions should be assigned to and the instructions that represent the resulting centers. This distance metric is crucial to the quality of the resulting clusters because it impacts how the instructions are classified into different clusters. The distance metric developed for these experiments measures of how close (in time) instructions are to other instructions in the instruction set for the program being executed. The closeness of two instructions is based on how many delays (instructions) apart the two instructions are in the execution trace of the program. This information is collected as the program is being executed and is stored in a three-dimensional array. The array element associated with indices (i, j, k) represents the number of times instructions i and j are separated by a delay of k. A limit is placed on how far apart two instructions can be in the data collected. This limit serves two purposes: (1) the limit determines how close an instruction must be to the center of a cluster to be assigned to that cluster; and (2) the limit helps to keep the threedimensional array stored in memory from growing large enough that this becomes an inefficient mechanism for collecting data. This limit is referred to as the window size because it effectively places a sliding window over the program for data to be collected (for these experiments, the size of the window is ten delays). Once the delays for all the instructions executed by the program are calculated, the threedimensional array is compressed into two dimensions by summing across the dimension that represents the different delays used (1-delay to window size-delays) and normalized by dividing each element of the resulting two-dimensional array by the sum of all of the elements. This 47 results in a likelihood value that represents the probability of two instructions being encountered within the window used while executing the program. Finally, the distance between two instructions is calculated with the following distance metric: distance(x,y) = 1 − likelihood(x,y) . Once these values were calculated for an execution of POV-Ray, the K-Means clustering algorithm (of Figure 26) was run three times to cluster the instructions, with ten clusters ( k = 10 ). The results of the first run are discussed in the next subsection, followed by the results of the second and third runs. In these experiments, the clustering algorithm was allowed to iterate until convergence was observed (this occurred in less than 100 iterations for each experiment). 5.3.2.3. First Clustering The results of the first run of the clustering algorithm are shown in Figure 28. In these results, a distance to center value of 1 indicates that the pair of the instruction and the instruction that is the center of the cluster was never encountered within the window used to collect delay data; and a distance to center value of -1 indicates that the instruction is the center of the cluster. Thus, a distance to center value close to 1 indicates that the pair of the instruction and the center of the cluster are far apart and a distance to center value close to 0 indicates that they are close together. In the results reported in Figure 28, the clusters found are not distinct because the instructions that form each cluster are far away from the center of the cluster. Additionally, the centers of the clusters that are found have a high execution frequency and are spread throughout the execution of the program. This indicates that the centers that were found are not true centers of a temporal portion of executed instructions. This leads to the next run of the clustering algorithm, where the top ten most frequently executed instructions were removed from consideration in the clustering process. This was done in an effort to let the clustering algorithm discover different centers that are more reasonable than those found in the first run of the algorithm. 5.3.2.4. Second Clustering The results of the second run of the clustering algorithm are shown in Figure 29; and the instructions removed from consideration in the clustering process are listed in Figure 30. As in the first run of the algorithm, this run of the algorithm resulted in clusters that are not distinct (the instructions assigned to each cluster are far away from the center of the cluster). However, 48 in these results the execution frequency of the centers is lower and closer to the instructions found in the clusters than in the first run. Due to these results, one more run of the algorithm was performed where the top fifty most frequently executed instructions were removed from consideration in the clustering process. Cluster Center 0 1 2 mov Instruction Distance to Center adc add and bsf bsr call cdq cld cmovbe cmovc cmovg cmovl cmovle cmovnc cmovns cmovnz cmovs cmovz cmp cwde dec div fadd fiadd fidivr fild fimul fist fistp fisub fisubr fld fldcw fldz frndint fsin fstcw fstp fsub fsubp fsubr fucomp fucompp fxch idiv imul inc int ja jbe jc jcxz jg jge jl jle jmp 0.999997878 0.974808389 0.987613055 0.999999951 0.999995856 0.986570572 0.999998453 0.997068666 0.999999785 0.999950061 0.999998122 0.999987677 0.999987115 0.999999963 0.999994436 0.999971215 0.999999551 0.999785881 0.962424883 0.999999969 0.991280393 0.999930509 0.99760333 0.999999997 0.999999994 0.999674382 0.999253358 0.999999903 0.995452523 0.999999933 0.999827698 0.975498895 0.987746379 0.998925467 0.998793904 0.999999252 0.994017138 0.986350166 0.999555381 0.999712098 0.99838611 0.999059471 0.997744508 0.993831682 0.999999997 0.999983845 0.978715052 0.9999868 0.995510163 0.994732295 0.997276148 0.999993724 0.998095061 0.999353454 0.998542979 0.992674314 0.993430533 Frequency 455 6966594 2810445 9 695 3584263 230 661674 51 8906 356 2017 1794 7 1248 4915 93 37187 10307302 10 1634748 8026 1258335 1 2 167052 153603 30 694021 30 76801 10040418 1848902 754305 230400 1000 924451 5693399 224364 153857 586078 371776 1418651 5666274 1 2371 6815934 3306 696666 1018722 944291 3931 489419 127657 292728 1852853 2117456 Cluster Center Instruction jnc jns jnz js jz lea leave mov movsb movsd movsx movzx mul neg nop not or pop push rdtsc ret sar sbb scasb seta setbe setc setg setnz setz shl shld shr shrd sub test xchg xor Distance to Center Frequency 0.998507619 558893 0.995821084 905932 0.974842644 7425393 0.999424586 162646 0.976439243 6082623 0.978853692 6183661 0.998697305 329086 -1 55756233 0.999996783 841 0.999997451 560 0.999801194 53906 0.973076156 6549733 0.999984907 3455 0.999707893 66737 0.996016973 602617 0.999998993 248 0.998469698 274909 0.967416922 10073504 0.931308873 20963662 0.999999888 19 0.98768832 3584393 0.99981345 104440 0.999998562 239 0.999995694 2495 0.999999949 20 0.999999925 20 0.999999968 13 0.999999934 10 0.999946941 15097 0.999958229 7530 0.995603483 711986 0.999993842 1322 0.999781983 61838 0.999999904 22 0.977096893 6602932 0.965569166 8884945 0.999996983 817 0.992608943 1520276 3 4 5 6 7 fmul fabs faddp fdiv fdivr fdivrp fidiv fld1 fmul fmulp fsqrt fst fstsw fsubrp fucom 8 cmova cmova 9 Figure 28. Results of the first run of the K-Means clustering algorithm. 49 0.999779415 0.996015073 0.999851326 0.999999955 0.999708228 0.999827701 0.999420888 -1 0.999624724 0.999574739 0.998424392 0.999100411 0.99991912 0.999285063 -1 260260 1522662 67423 10 78853 153603 722827 4780494 174073 160442 906578 2511001 158659 719574 60269 Cluster 0 1 2 Center rdtsc movzx Instruction bsf cmovc cmovnc cmovns cmovs fiadd fidivr fsin idiv movsb rdtsc sbb scasb setbe setg shrd adc bsr call cdq cmova cmovbe cmovg cmovl cmovle cmovnz cmovz cwde dec div imul int ja jbe jc jcxz jg jle jmp jnc jns js lea movsx movzx mul neg nop not sar seta setc Distance to Center Frequency 1.000000000 9 1.000000000 8906 1.000000000 7 1.000000000 1248 1.000000000 93 1.000000000 1 1.000000000 2 1.000000000 1000 1.000000000 1 1.000000000 841 -1.000000000 19 0.999999987 239 1.000000000 2495 1.000000000 20 1.000000000 10 1.000000000 22 0.999999997 0.999999996 0.999663031 0.999999997 0.999999904 0.999999952 0.999999481 0.999998016 0.999999762 0.999997732 0.999974410 0.999999993 0.998154598 0.999999998 0.999996971 0.999999995 0.999380050 0.998355982 0.998317459 0.999987202 0.999648894 0.997830772 0.998801977 0.998972709 0.999552024 0.999993101 0.999312506 0.999982744 -1.000000000 0.999996291 0.999973838 0.999110540 0.999999932 0.999865645 0.999999996 0.999999996 455 695 3584263 230 60269 51 356 2017 1794 4915 37187 10 1634748 8026 2371 3306 696666 1018722 944291 3931 489419 1852853 2117456 558893 905932 162646 6183661 53906 6549733 3455 66737 602617 248 104440 20 13 Cluster Center Instruction setnz setz shl shld shr xor Distance to Center Frequency 0.999992463 15097 0.999993145 7530 0.999581110 711986 0.999999995 1322 0.999930318 61838 0.999383735 1520276 3 4 5 fxch and cld fabs fadd faddp fdiv fdivr fdivrp fidiv fild fist fisub fisubr fld1 fldz fmul fmulp fsqrt fst fstp fstsw fsub fsubp fsubr fsubrp fucom fucomp fucompp fxch jge jz leave ret 0.998178352 0.999806572 0.999658740 0.998431985 0.995373855 0.999827419 0.999999985 0.999878210 0.999597968 0.999997048 0.999999933 0.999999933 0.999827700 0.998713842 0.999098082 0.990498765 0.999746315 0.999684537 0.997792166 0.995227629 0.996091204 0.999659803 0.999712786 0.998871646 0.999600079 0.998189202 0.999754748 0.998088327 -1.000000000 0.999950691 0.998933851 0.999963644 0.999604647 2810445 661674 260260 1258335 1522662 67423 10 78853 153603 167052 30 30 76801 722827 754305 4780494 174073 160442 906578 5693399 2511001 224364 153857 586078 158659 719574 371776 1418651 5666274 127657 6082623 329086 3584393 6 7 movsd movsd xchg -1.000000000 0.999999618 560 817 8 9 fldcw fimul fistp fldcw frndint fstcw jl or 0.999655398 0.998157919 -1.000000000 0.999310802 0.997583543 0.999942746 0.999540535 153603 694021 1848902 230400 924451 292728 274909 Figure 29. Results of the second run of the K-Means clustering algorithm. mov push cmp pop fld test jnz add inc sub Figure 30. The ten most frequently executed instructions. 5.3.2.5. Third Clustering In the final run of the clustering algorithm, the top fifty most frequently executed instructions (listed in Figure 31) were removed from consideration in the clustering process. The major result of this run of the clustering algorithm (as shown in Figure 32) is the same as in the first 50 and second runs. This result is that the clusters that were found are not distinctive. The instructions assigned to each cluster are still far away from the centers of the clusters. Additionally, in this run, the centers of each cluster have execution frequencies that are close to that of the other instructions in the clusters. This indicates that the instructions that are spread through out the program (and are executed the most frequently) are those contained in the fifty instructions removed from consideration in the clustering process. mov push cmp pop fld test jnz add inc sub movzx lea jz fstp fxch fmul ret call and fstsw jmp jle fldcw dec faddp xor fucompp fadd jbe jc fstcw fst jns fldz fld1 fucom shl ja fistp cld nop fsubr jnc jg fucomp leave jl or fabs frdint Figure 31. The fifty most frequently executed instructions. 5.3.3. Overall Results of the Experiments and Future Work The three experiments conducted are inconclusive because the clusters found in all three runs of the clustering process (as discussed in Subsections 5.3.2.3, 5.3.2.4, and 5.3.2.5) are not distinctive from each other. There are several possible reasons for why this occurred: (1) using a clustering technique to discover instruction set partitions may be inappropriate; (2) the K-Means clustering technique may not be appropriate for this application of cluster analysis; (3) the distance metric used may be poorly formulated or may need to be revised; and/or (4) the clustering program used in these experiments may contain error(s). Before more work is done in this area, these issues will be addressed. Thus, more research into clustering and instruction set partitioning needs to be performed; and other clustering techniques besides the K-Means clustering algorithm (Figure 26) need to be investigated. Additionally, the distance metric used needs to be reviewed and tested; and the correctness of the clustering program needs to be verified. One possible way to test the distance metric and verify the correctness of clustering program, is to test them with an assembly language program that can be analyzed by hand to see if the results match. 51 Cluster Center Instruction Distance to Center Frequency 0 fidiv adc 1.000000000 455 cdq 1.000000000 230 cmova 1.000000000 60269 cmovbe 1.000000000 51 cmovc 1.000000000 8906 cmovg 1.000000000 356 cmovl 1.000000000 2017 cmovle 1.000000000 1794 cmovnc 1.000000000 7 cmovns 1.000000000 1248 cmovs 1.000000000 93 cwde 1.000000000 10 fdivr 1.000000000 10 fiadd 1.000000000 1 fidiv -1.000000000 153603 fidivr 1.000000000 2 fist 1.000000000 30 fisubr 0.999885134 76801 fsub 0.999885134 224364 fsubp 0.999885134 153857 idiv 1.000000000 1 movsx 1.000000000 53906 rdtsc 1.000000000 19 sbb 1.000000000 239 scasb 1.000000000 2495 setbe 1.000000000 20 setg 1.000000000 10 shld 1.000000000 1322 1 fsqrt fdiv 0.999975346 67423 fdivrp 0.999941033 78853 fmulp 0.999917909 174073 fsqrt -1.000000000 160442 fsubrp 0.999963647 158659 Cluster Center Instruction Distance to Center Frequency 2 bsr bsr -1.000000000 695 3 bsf bsf -1.000000000 9 div 1.000000000 8026 4 5 seta seta -1.000000000 20 setc 0.999999989 13 6 shr cmovnz 0.999998157 4915 cmovz 0.999999982 37187 imul 0.999999788 2371 int 0.999999997 3306 jcxz 0.999999412 3931 jge 0.999999716 127657 js 0.999999048 162646 movsb 0.999998744 841 movsd 0.999999164 560 mul 0.999997220 3455 neg 0.999999023 66737 not 0.999999953 248 sar 0.999999657 104440 setnz 0.999999999 15097 setz 0.999998119 7530 shr -1.000000000 61838 shrd 0.999999991 22 xchg 0.999999541 817 7 fimul fimul -1.000000000 153603 8 fild fild -1.000000000 167052 fsin 0.999999253 1000 9 fisub fisub -1.000000000 30 Figure 32. Results of the third run of the K-Means clustering algorithm. In summary, the results presented in this study are inconclusive and the experiments that were performed need to be scrutinized. Additionally, more work needs to be performed in the area of instruction set analysis; and different ways of evaluating instruction sets need to be formulated. 5.4. Summary Two ways to improve the design of microprocessors have been presented in this section. The first deals with combining the dynamic translation process of Figure 3 with reconfigurable hardware to create a reconfigurable microprocessor that uses dynamic translation to execute programs. More research needs to be performed on the concept of combining dynamic translation and reconfigurable hardware. This concept is only in the initial stages of development and has not been pursued farther than proposing the ideas presented in Subsection 5.2. The second way to improve microprocessors is to analyze the current instruction sets being used in today’s microprocessors and attempt to develop a methodology of how to evaluate the design and use of instruction sets. 52 The initial step in an analysis of current instruction sets using instruction set partitioning has been presented in this section. This study needs to be expanded further and the methods need to be reviewed at a deeper level. 6. Conclusions This report has introduced a microprocessor taxonomy that classifies microprocessors based on the technology used to implement them (static or reconfigurable), the process that they use to translate machine code and execute instructions, and whether this process is performed in software or hardware. The design and operation of two different static microprocessors that perform dynamic translation of machine code have been presented and compared. At the end of the report, two possible research directions were introduced: (1) reconfigurable computing combined with dynamic translation, and (2) instruction set partitioning analysis. The two microprocessors reviewed in this report are the IBM DAISY and the Transmeta Crusoe. These microprocessors use dynamic translation to execute machine code initially compiled for the PowerPC and Intel X86 microprocessors, respectively. The design of these two microprocessors and how they perform dynamic translation greatly differ. DAISY is based on a sophisticated VLIW processor core while the Crusoe uses a simplified VLIW processor core that has extra hardware support added for speeding up the process of rolling back the state of the emulated microprocessor when an exception occurs. The re-translation, optimization, and scheduling processes are also different between these two microprocessors. DAISY uses a generic approach while the Crusoe is Intel X86 specific and performs specialized optimizations that may only apply to Intel X86 machine code. The DAISY and Crusoe microprocessors both represent a new direction for microprocessor design. These microprocessors harness the reality that for a new microprocessor to be successful in today’s market, it should be compatible with an existing instruction set of a microprocessor that has been successful. This is due to the vast amount of legacy software and hardware systems that dominate the market. At the end of the report, two concepts for improving on today’s microprocessors were presented. The first one is to combine the dynamic translation process with a reconfigurable microprocessor. Such a microprocessor may be able to outperform a static counterpart because it implements the dynamic translation process in hardware and the optimization process has the 53 option of synthesizing an instruction or segment of code into circuits that can be implemented in the reconfigurable hardware. This implementation of instruction(s) in hardware could speed up execution of the associated operations. The second idea presented at the end of this report is instruction set partitioning. This presentation discusses an initial study of what properties a good instruction set possesses. The ideas presented in this part of the report are pursued in hopes of being able to formalize the design and use of instruction sets and how they are evaluated. However, the results of this study are inconclusive. More research needs to be performed in the areas of dynamic translation, reconfigurable computing, and the design of microprocessors and the instruction sets they implement. These areas represent vast opportunities in improving the microprocessors that are currently being developed and produced. As the feature size of the technologies used to implement microprocessors shrinks, they will become faster enabling more processing of machine code to be performed with less execution latency. Additionally, reconfigurable technology will be able to implement larger circuits and reconfigure the circuits implemented in less time as well. This will continue to open more opportunities for deeper research as these technological improvements are realized. 54 References [1] IA-32 Intel Architecture Software Developer’s Manual, Intel Corporation, http://developer.intel.com/design/pentium4/manuals/index2.htm, 2002. [2] PowerPC Microprocessor Family: The Programming Environment for 32-Bit Microprocessors, International Business Machines Corporation, http://www3.ibm.com/chips/techlib/techlib.nsf/techdocs/852569B20050FF778525699600719DF2, 2002. [3] J. Gosling and H. McGilton, “The Java Language Environment: A White Paper”, Sun Microsystems Inc., Mountain View, California, ftp://ftp.javasoft.com/docs/papers/langenviron-pdf.zip, May 1996. [4] E.R. Altman, K. Ebcioğlu, M. Gschwind, and S. Sathaye, “Advances and Future Challenges in Binary Translation and Optimization,” Proceedings of the IEEE, Vo. 89, No. 11, November 2001, pp.1710-1722. [5] The Java HotSpot Virtual Machine Technical White Paper, Sun Microsystems Inc., http://wwws.sun.com/software/solaris/java/wp-hotspot/, 2001. [6] K. Ebcioğlu, E.R. Altman, M. Gschwind, and S. Sathaye, “Dynamic Binary Translation and Optimization,” IEEE Transactions on Computers, Vol. 50, No. 6, June 2001, pp. 529-548. [7] R.F. Cmelik, D.R. Ditzel, E.J. Kelly, C.B. Hunter, D.A. Laird, M.J. Wing, and G.B. Zyner, “Combining Hardware and Software to Provide an Improved Microprocessor,” US Patent 6,031,992, February 2000. [8] C. Iseli and E. Sanchez, “Beyond Superscalar Using FPGAs,” Proceedings of the 1993 IEEE International Conference on Computer Design: VLSI in Computers and Processors, 1993, pp. 486-490. [9] D.A. Patterson and J.L. Hennessy, Computer Architecture: A Quantitative Approach, Second Edition, Morgan Kaufmann Publishers Inc., San Francisco, California, 1996. [10] K. Ebcioğlu, J. Fritts, S. Kosonocky, M. Gschwind, E.R. Altman, K. Kailas, and T. Bright, “An Eight-Issue Tree-VLIW Processor for Dynamic Binary Translation,” Proceedings of the International Conference on Computer Design: VLSI in Computers and Processors, 1998, pp. 488-495. [11] A.V. Aho, R. Sethi, and J.D. Ullman, Compilers: Principles, Techniques, and Tools, Addison-Wesley, Reading, Massachusetts, 1988. [12] K. Ebcioğlu, E.R. Altman, S. Sathaye, and M. Gschwind, “Optimizations and Oracle Parallelism with Dynamic Translation,” Proceedings of the 32nd Annual International Symposium on Microarchitecture, 1999, pp. 284-295. 55 [13] D.A. Patterson and J.L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, Second Edition, Morgan Kauffman Publishers, Inc., San Francisco, California, 1998. [14] A. Klaiber, “The Technology Behind Crusoe Processors: Low-Power X86-Compatible Processors Implemented with Code Morphing Software,” Transmeta Corporation, Santa Clara, California, http://www.transmeta.com/about/press/white_papers.html, January 2000. [15] J.R. Hauser and J. Wawrzynek, “Garp: A MIPS Processor with a Reconfigurable Coprocessor,” Proceedings of the 5th Annual IEEE Symposium on Field Programmable Custom Computing Machine, 1997, pp. 12-21. [16] J.M. Arnold, D.A. Buell, and E.G. Davis, “Splash 2,” Proceedings of the Fourth Annual ACM Symposium on Parallel Algorithms and Architectures, June 1992, pp. 316-322. [17] J. Villasenor and B. Hutchings, “The Flexibility of Configurable Computing,” IEEE Signal Processing Magazine, September 1998, Vo. 15, No. 5 , pp. 67-84. [18] M.J. Wirthlin and B.L. Hutchings, “A Dynamic Instruction Set Computer”, Proceedings of the 1995 IEEE Symposium on FPGAs for Custom Computing Machines, 1995, pp. 99107. [19] M.R. Anderberg, Cluster Analysis for Applications, Academic Press, New York, New York, 1973. [20] Red Hat Documentation: Linux Programmer's Manual, PTRACE, Red Hat, Inc., http://www.europe.redhat.com/documentation/man-pages/man2/ptrace.2.php3, March 2000. [21] The GNU Project Debugger: Documentation for GDB version 5.2.1, GDB Internals, Free Software Foundation, Inc, http://sources.redhat.com/gdb/download/onlinedocs/gdbint.html, April 2002. [22] POV-Ray 3.5 Documentation, Hallam Oaks Pty. Ltd, http://www.povray.org/documentation/, April 2002. 56