Embedded and Reconfigurable Systems Research at
DemoFest’09
Zhimin Chen, Ken Eguro, Alessandro Forin, Ruirui Gu, Zhanpeng Jin, Paul Larson, Wenchao Li,
Weiqin Ma, Rene Müller, Neil Pittman
Microsoft Research
William Bengston, Meg Davis, Drew Fisher, Larry Laugesen, Steve Liu, Grant Marvin,
Jon Moeller, Brandon Nance, William Somers, Jillian Weise
Texas A&M University
July 2009
Technical Report
MSR-TR-2009-187
Microsoft Research
Microsoft Corporation
One Microsoft Way
Redmond, WA 98052
Embedded and Reconfigurable Systems Research at
DemoFest’09
Zhimin Chen, Ken Eguro, Alessandro Forin, Ruirui Gu, Zhanpeng Jin, Paul Larson, Wenchao Li,
Weiqin Ma, Rene Mueller, Neil Pittman
Microsoft Research
William Bengston, Meg Davis, Drew Fisher, Larry Laugesen, Steve Liu, Grant Marvin,
Jon Moeller, Brandon Nance, William Somers, Jillian Weise
Texas A&M University
Abstract
The Embedded and Reconfigurable Systems Group at Microsoft Research has been engaged in joint academic
collaborations spanning both teaching and research activities. The results of these collaborations are showcased
annually at the Faculty Summit at Microsoft headquarters in Redmond, WA, during the DemoFest event. This is also
a good opportunity to review some of the other research projects that the group is engaged in, with special
consideration to the research performed as part of the graduate internships. This report presents the demonstrations
that took place during the 2009 DemoFest. We presented two undergraduate student projects from the Real Time
Distributed System group at Texas A&M. One is a touch screen prototype that uses light diffraction rather than
pressure-sensing to realize a multi-touch 2D input device. The second is a LED-input based dance pad that
overcomes the wear and tear problems of traditional dance pads by using light sensing, and connecting LEDs as
input rather than output devices.
Members of the ERSG group presented a number of research projects and demos. A set of APIs simplifies the
communication between PCs and FPGA boards and when using Gigabit Ethernet achieves full-bandwidth speed.
FPGAs are also used to accelerate the processing of networking protocols in a database system, with automatic
generation of the circuits directly from the protocol specification. A new CPU model for the Giano full-system
simulator supports the x86 ISA, and additionally realizes mixed concrete and symbolic execution of binary codes to
detect data races in multi-threaded programs. A novel system for mining specifications deduces timing constraints
in timed traces for digital circuits, embedded software, and network protocols. The system accurately pinpoints the
source of errors in a faulty eMIPS micro-processor design. IEEE compliant Floating-Point execution units are fully
optimized on a per-application basis and dynamically un/loaded in the reconfigurable logic portion of the microprocessor. The M2V compiler can now handle multi-basic blocks of MIPS binary code to automatically generate
application accelerator circuits. A dual-core version of the eMIPS system demonstrated near-perfect speedup on the
Montgomery modulo multiplication of large integers. The NetBSD Operating System runs in multi-user mode on the
eMIPS system on two FPGA platforms. It uses a new, online scheduler to allocate the available accelerators slots to
competing software applications.
1
Introduction
The Microsoft Faculty Summit is a workshop that attracts highly qualified participants from universities across
the globe. One of the most favorably received events at the Summit is the DemoFest event, when a number of
research projects and related activities are presented in a fair-like environment over a very short period of time (just
about three hours!). The Embedded and Reconfigurable Systems Group has been present at the Summit since its
inception, demonstrating both the results of its own research and the results of joint collaborations with academic
partners and other researchers. The goal of this document is to attempt to communicate the vitality and excitement of
the DemoFest event for those who could not attend it. While the atmosphere of free-flow communications and
discussions is clearly impossible to reproduce, we can at least recapture and recount the artifacts and some of the
practical demonstrations that took place in our small section of the event. Each of the document’s sections is
dedicated to one of the demonstrations that occurred at the event. Each section has a corresponding poster that was
displayed at the booth and which is reproduced in Appendix A. Short movies and memorable moments are recorded
in Appendix B.
It is likewise impossible to describe the atmosphere of the days that immediately preceded the event, when the
demos where finalized sometime very late into the night and very many and very busy people crammed into a
couple of offices, interacting, arguing, feeding, competing, laughing, sleeping and generally helping each other
reach a common goal of complete and amazing success. Our heartfelt thanks go to the administrative and support
people for their help with the ensuing bedlam, and to all our significant others for their huge patience and tolerance.
Broadening the Impact of the Goal Driven, Self-Propelling
Process Learning in Embedded Systems
Steve Liu
Texas A&M University
Abstract
As we continue our efforts to enhance the learning experiences of the undergraduate computer engineering
program, we have passed the experimental stage of the new curriculum, and made it a main stream approach in
teaching embedded systems design. We are confident to say that giving students fun and freedom in exploring
technical issues along directions of their own choices provided an excellent platform for students to translate their
theoretical knowledge practices. Technologies evolve with time rapidly, yet the old-fashioned method of teaching
students how to formulate and attack difficult problems, clearly can stand the test of time. It is the most effective
ways for teachers to guide engineering students to gain real-world experiences, after they acquired their basic
engineering theoretical foundation. Using data sheets, live codes, and design tools as the primary technical
materials brings students to the real world engineering R&D environments. Using wiki pages to report their
projects regularly proved to be much more interesting and productive than traditional written assignments and
quizzes.
Both the Computer Science and Engineering department and the Electrical computer Engineering department
have adopted the curriculum as the standard, pre-capstone design class to better prepare students before they took
on the capstone design classes. The department is also planning for a new class of software studio prior to this
embedded systems class, to enhance students’ learning experiences through hands-on experiences. The author
hosted a series of lab tours for attendees of the Summer Honors Invitational Program (SHIP) using class projects.
Per the coordinator of event, “The students raved about the projects they saw”.
Past few years of experiments helped us to understand the importance of adequate communications prior
beginning of semesters and also broad participation of class evaluation from students. The lab-based course is
expected to be more expensive than purely lecture based classes, but the value of the outcomes clearly justifies the
investment, not to mention its powerful impact to the students’ competence in meeting the industrial demands for
higher, more advanced problem solving skills to face the ever higher global challenges.
The Curriculum
The design-centric curriculum aims to create a goal-driven, self-propelled learning process for the teacher and
students work together to reach a common goal:
“At end of a semester, students can realize their new design concepts on a working prototype using basic
hardware and software components.”
With system design and creativity exploration as the class goals, students have one semester to adjust to the
“learning by doing” learning process. The technical aspects of the new curriculum are conceptualized in figure 1. It
was implemented in the CPSC 462 Microcomputer System class in the Computer Science Department of Texas
A&M University. The changes cover much more than the technical foci. We took a clean slate approach in creating
and revising the course contents; literally everything related to the class―teaching and learning objectives, technical
goals, technical materials, labs, assignments, technology, even the lab physical layout itself―are new. Instead of
covering myriad of technical details, many of which are broadly available on numerous web sites, we focused on the
design and development processes of embedded systems. The goal is to bootstrap students’ self-learning process
using a “learning by doing” approach, so that eventually students can take charge of their own learning
responsibility and deliver their final projects.
Figure 1 A design-centric curriculum for the embedded systems class.
Open Literature vs. Textbooks
A critical decision of this new curriculum was to eliminate the textbook in favor of microelectronic datasheets
(published by their vendors), source codes and lecture notes. “Time to market” is an industry expression that
reflects the competitive nature to deliver a product to the consumer in the shortest time possible. Similarly, after
students have finished their foundational courses, we might define “time to classroom” as the need to bring
contemporary materials to students so that they will be much better prepared for the job market. While textbook
will continue to play its critical role in many courses, it is not necessarily the most effective tool to serve our
objective in building the bound between basic knowledge and design skills. Knowing that professional engineers
must rely on datasheets to carry out their designs, selective use of the datasheets help students make the transition
from college textbooks to the professional “textbooks”. To make the material more manageable, selection of
datasheets was based on their readability, prevalence of the technology, and friendliness of the vendor web site.
Using the industry datasheets instead of the textbook can be an intimidating experience for students, especially in
preparation of tests. In our approach, we limited the discussion to the major chips on the evaluation boards used in
the lab exercises. Only the architectural and functional portions of datasheets were included in the two written tests.
The system platform
We chose both microcontrollers and field programmable gate array (FPGA) as the hardware platforms for the
class. The technical discussions of the class started with introductions of their chip-board level architectures, basic
programming issues (in C and Verilog), and concluded with the interface and integration of the two types of
technology at the bus level. Introduction of applications, operating system, sensors and actuators was synchronized
to lab exercises for students to understand the relationship between high level languages, executable binary codes,
and hardware. Typically, students were directly guided through their first lab programming assignment, after a
similar lecture on the development steps. After two assignments, the discussion was shifted to the FPGA, with
similar lab arrangements. Then, students were asked to build a simple bus interface of the two hardware families. A
sensor/motor exercise wrapped up the lab exercises. Basic system design issues, i.e., bus architecture, timing, and
handshaking protocols dominated the lecture time, with working examples presented by the TA. As of now, we
continue to use the Atmel eb63 evaluation board (equipped with the AT91M63200 chip with ARM7TDMI core),
and Xilinx Spartan III evaluation board for the learning phase of the class. We also continue to use the Microsoft
Invisible Computing (MIC), (http://research.microsoft.com/invisible/, aka MMLite) as a reference operating system
for this kind of small hardware platforms.
As the new generation of small microcontrollers with on-chip A/D converters begin to emerge on the market,
together with highly compact wireless radios, we begin to adopt those microcontrollers for the student projects. We
also began to incorporate soft-core based FPGA technology, i.e., eMIPS, for students to gain an understanding of the
FPGA based systems.
Hands-on and Fun
Fun and curiosity are the key factors that stimulate the students’ willingness to tackle advanced ideas and
transform them into working systems. The value of building an artifact is lost on the students when it cannot be
shown to friends, or if they do not find it interesting. This is a particularly important issue for embedded computing
systems, because of the tight integration of hardware, software, algorithms and even mechanical design into any
complete, working system. It makes the difficult process of building small computing systems under the constraints
of power, sizes, and mechanical structures, etc., much more interesting. The author took a challenge-response
process to mentor students through this process. Students are challenged at every stage about the scope and progress
of their projects, but there is no definite correlation between their adoption of the instructors’ inputs and the final
project outcomes. In fact, some of the best projects were realized in total defiance of the instructors’ suggestions.
The process clearly stimulates the students to think out of the box and to take on more interesting projects.
Students were granted a lot of freedom in their projects, but they are also held accountable for their
decisions. They can use any information, including existing designs in the projects, provided that they can add new
ideas to the project. They can regroup, both scale up or scale down, as their projects progresses with time. They can
even redirect the project goals anytime. That said, students are expected to justify their decisions and choices
throughout the process to minimize waste of time and resources. Safe or “canned” projects are categorically
discouraged, and students took this route usually received the lowest and possibly failing grades.
From the teacher’s perspective, the objective of hands-on and fun is not the artifact itself, but rather the
stimulation of the students’ creativity, knowledge usage and teamwork. The result of the final project is only a
portion of the final grade; even a failed project can receive a good grade if the team can demonstrate high quality in
their project development process. The level of challenges, creativity, and teamwork vs. accountability are all
important decision factors. Students are encouraged to follow and to expand upon the successful projects from
previous semesters. Admittedly, some projects did resemble each other but every semester has produced at least one
very interesting project that is worth preserving and that will be cited as example in the next semester. Outstanding
projects are submitted to the DemoFest fair to share our experiences with participating faculty. The history of
exceptional projects is maintained in a website, and together with the exposure at DemoFest it creates further
motivation for the new students.
Industry Collaboration
Experience tells us that there is a strong correlation between the degree of interaction between industry and
university and these educational outcomes. Ever since the inception of our collaboration, our MSR partners have
played a critical role in the progress that we have been able to make. Our experiences consistently pointed out the
effectiveness of industry participation of hands-on design classes. In addition to the on-going, strong collaboration
with Microsoft researchers, local industries also attended the class project review activities. The feedback from them
is loud and clear: we need to train students for problem solving skills, so that they can handle the ever changing
technical problems on a daily basis. The unique opportunity of being able to attend the DemoFest is particularly
invaluable to enhance the class outcomes as a whole. Students realized that their learning outcomes matter, and they
become highly motivated to prove their competence.
Summary
In summary, we believe we have identified a sustainable, productive educational model to boost students’
interests in the computer engineering program, and likely the whole engineering discipline as a whole. Students are
able to complete far more advanced projects in their capstone design courses, and they demonstrate greater
confidence in their job interviews. We realized that all the artifacts created by students serve as an excellent live
teaching material for students to emerge in an approachable, yet challenging environment to advance their
professional preparedness. It is a great investment that cannot go wrong.
Multi-touch Screen
Meg Davis, Larry Laugesen, Grant Marvin, Jon Moeller
Brandon Nance, Jillian Weise
Department of Computer Science and Engineering
Texas A&M University
Our group decided to make a multi-touch screen using infrared diodes and sensors. This method,
to the best of our knowledge has never been used before in practice but in a very simple test we
have determined that theoretically our design could work. Assuming this design works, we
would like to build some kind of application on top of it that allows for a multi-touch interface.
The following paragraphs outline the design proposal and plan of action that we thus far think
should be necessary to complete the assignment to the best of our ability.
Scope & Expected Outcomes:
Depending on the success or failure of different components and steps in the system design
process, the scope of our project is not as concrete as it would be if the project were to follow a
strict set of implementation details. At the highest level however, our scope encompasses two
main areas: screen design and application design. The screen design phase must be done first
since no applications can be written until the data coming out of the sensor array is better
defined. After the screen design is at a prototype stage, work may begin to progress on the
application design, but development of both of these will most likely occur in tandem as each
relies on the other in very crucial ways.
Action Plan:
Because of the short duration of this final project, we plan to work through the agile
development methodology. This iterative methodology calls for short sprints of work and allows
the project development to be flexible, quickly accounting for any changes in design. We plan to
have sprints of one week in which we will strive to create complete and working modules of
design for the final product. The agile development methodology also places high emphasis on
frequent communication and teamwork. Not only we will have frequent meetings during the
week, but we have also established a Google Group in order to effectively communicate ideas to
the whole group.
We will first focus on prototyping solutions by experimenting with different sensors and
different materials for the interface. Once we have decided on a direction in which to design, we
will first start experimenting with the microcontrollers and the algorithms to read the sensors.
Our brainstorming for how to do this involves polling by seeing how one touch on the board is
read by each sensor. Then we will shift our focus to signal processing and filtering out the noise
from the signals. Application development will also begin at this point. If at any point in the
design, we need to change our approach, the agile methodology will give us the flexibility to do
so.
Our action plan relies heavily on communication in order to work together effectively to create a
product over this short period of time.
Design:
The system is based on a simple optical phenomenon called Total Internal Reflectance, which
relies on light being shone into glass at such a large angle that it cannot escape. This effectively
traps the light inside the glass. When the glass is touched by something (such as a finger) the
total internal reflectance becomes "frustrated", and some of that light will escape the glass.
Image by Jeff Han ©2006 http://cs.nyu.edu/~jhan/ftirsense/
In our setup, we have photo detectors along the edges opposite the LED light sources. The goal
is to detect the decrease in the amount of light hitting the sensor caused by a finger touching the
surface. Touching the screen should cause an intensity dip in both the x and y axes, allowing us
to locate the point of the touch.
There are a few things to note about this diagram. First, the actual design would use more than 4
photo detectors per side. It would be possible to fit as many as 40 sensors along 1 edge of a
typical flat screen monitor, though it is unlikely we would need that many. The other thing of
note is that the width of each beam spreads, so that a single LED might hit multiple sensors.
Because of this, if the user touches a spot not directly across from a sensor, several sensors will
be affected anyway. An algorithm can then use this information to determine the actual point of
contact. In this way, we can have a resolution higher than the number of sensors along the edge.
Since such an approach has not been taken before (to our knowledge) the nature of the algorithm
is not clear yet. However, by showing a graph of each sensors output under different types of
touches would provide an excellent starting point for understanding the exact nature of the
interface.
Eb63 Controls
The eb63 board is utilized in our design to act as a mediator between the output from the A/D converter
and the processing software. The following files contain the primary control mechanisms used to carry
out this function.
common.h
:
Contains functions used to effectively manage the communication between the A/D converter and
processing. OpenCom2 establishes a connection between the eb63 and the com2 serial port, allowing
for data transmission from the A/D converter to the eb63. The ReadOneInteger and WriteOneInteger
functions act as the data pipeline. ReadOneInteger reads the A/D converter output at a specific pint and
returns the value to the eb63, while the WriteOneInteger function passes the value to the processing
software.
ADCDriver.c:
Sets up many of the background utilities that allow for the actions of common.h to execute properly.
StartSystemClock instantiates a 4 MHz clock to run the system, accomplished by dividing the internal
clock of the eb63 to down to the desired rate. ConfigureSPI prepares the SPI protocol used to
communicate with the A/D converter, allowing the eb63 to read from two separate 8-bit A/D converters.
NextRead sets up which of the pins (and consequently the photo-transistor) is currently being read from,
allowing
the
ReadOneInteger
function
to
extract
samples
from
the
proper
source.
VoltTest.cpp:
Runs the processes that manage how and when each pin is checked. The ScanVoltages function cycles
through each of the photo-transistor pins and uses NextRead to prepare the pins for data extraction. After
reading from the A/D converter, the function passes on the sensor number, the value received, and a 255
value to signal that the data set is complete for that transmission.
The SelfTest function was used to check that the A/D converter was functioning
correctly, as this hardware was paramount to the operation of ScanVoltages. All these processes are ran
In the main function at the end of the file.
Processing
Here is a list of explanations of our processing applications.
VoltageGraph
This is our flagship application that we have used nearly exclusively in the testing and refining of our
multi-touch screen. In short, this application takes values given to it from the EB63 microcontroller over
the serial port and maps those values in a simple bar graph. Each bar corresponds to a photo-transistor
on our prototype. The graph that you see is showing the current reading of each PT, with a very short bar
being a PT with a very small reading whereas a tall bar corresponds to a PT with a large reading. When
you press down on the acrylic, which is termed as a "touch," the PT's that are affected by your touch will
show a dip in height. To understand the physics behind why it reads less, refer to our project proposal for
a detailed explanation. Our program has quite a few modes of operation, which are detailed below:
1. Normal Mode: In this mode, you see two bar graphs: one green and one blue. The green bars
correspond to the right side PT's, and the blue bars correspond to the left PT's. This mode is the main
mode we use to calibrate the individual potentiometers for each PT since we can try and line up each bar
with all the other ones on that side. As expected, the PT's on each end are significantly less than the
middle ones because less LED's are affecting those PT's. If you touch the screen in the middle of the
acrylic, you should see mostly teal (green and blue combined). If you touch to either side, you will begin
to see the opposite color since the dip is greater when you are closer to one side or the other. The first
graph is uncalibrated (what we use to calibrate the hardware pots) and the second graph is calibrated and
shows a touch on the right side.
Results
Hardware Schematics
LED DDR Pad
William Somers, Drew Fisher, and William Bengston
Department of Computer Science and Engineering
Texas A&M University
Project overview
This project was created for Dr. Steve Liu's CPSC462 (Microcomputer Systems) class at Texas A&M
University.
Initially, we saw a neat paper by Mitsubishi Electric Research Laboratories on using LEDs as light
sensors (Dietz, P.H.; Yerazunis, W.S.; Leigh, D.L., "Very Low Cost Sensing and Communication Using
Bidirectional LEDS", International Conference on Ubiquitous Computing (UbiComp), October 2003). We
decided we wanted to use this concept to implement something cool. In this project, we designed,
implemented, and constructed a DDR pad. A problem with most industrial designs is that over time,
moving parts wear out and the pad no longer recognizes steps on a particular arrow. We designed our
pad to have no moving parts. Instead, we chose to measure light reflectance off the user's feet to
determine if one of the arrows was being stepped on or not. Since a design with no moving parts offers
little to no tactile feedback, we wanted to make sure the pad offered some other form of feedback, so the
user would know when they had successfully stepped on the arrow. To this end, we chose to embed
LEDs in the arrows themselves, and to have them illuminate when the arrow was pressed. We also
decided we'd like to have these LEDs flash in time with the music.
Design overview
The design consists of one master controller and four slave controllers. Each slave controls a set of 15
LEDs. Thirteen LEDs are placed around the border of the arrow and flash in rhythm with the music. The
remaining two LEDs are used as an illumination and sensing pair - the LED pointing straight up acts as a
permanent flashlight, and the LED at an angle serves to sense reflection of the previous LED off the foot
of the player.
Readings are taken with the technique described in the MERL paper, summarized here. We place the
interior sensing LED in reverse bias mode to allow the LED to charge up a capacitance in the LED
junction itself. When a reading is to be taken, we set the cathode to input mode, disable the internal pullup resistor, and time how long it takes the capacitance to discharge. Greater light input results in a
shorter discharge time. As light from the illumination LED reflects off the foot of the player, the time the
LED takes to discharge through the microcontroller decreases. By comparing this time to a threshold,
the slaves determine if the arrow is pressed, and will raise or lower a line to the master accordingly.
The slaves can also read the voltage of a line from the master. When this bit is high, the slaves
illuminate the arrow LEDs.
The master runs the main DDR pad controller program. It determines when the arrow LEDs should be
on by and receives constant readings from the slaves. The master passes the readings from the slaves
on to the Xbox 360 controller. The master can receive lighting commands from Stepmania running on a
computer through a USB serial port, or it can just leech power from an Xbox 360.
Hardware design
The frame of the DDR pad was built using plywood and 2x4 pieces of wood to divide the pad into 9
equal squares in a 3x3 fashion. The bottom of the pad is a piece of plywood cut to 33"x33". The left and
right edges were constructed using 33" pieces of 2x4 laid on their narrow side. The upper and bottom
edges and the two horizontal dividing pieces were constructed with 30" pieces of 2x4 on their thin sides,
so that they fit within the left and right edges. The vertical dividing pieces were constructed as 6 10"
pieces of 2x4 on their thin sides to fit within the horizontal dividers. The 2x4 pieces of wood are screwed
together at every meeting point and screwed into the plywood from the bottom. The four pieces that
create the middle square had holes drilled in them before being screwed in, so that wires could pass
underneath the 2x4. Similar treatment was given to the two horizontal dividing pieces of the top left and
top right squares. 1" circular holes were drilled in the upper edge in the center of the top left and top right
squares so communication and power cables could be run to the pad. After all holes were drilled and all
the pieces were screwed together, the frame as a whole was spray painted with one coat of primer and
two coats of black paint.
Frame of the wood base.
The top of the DDR pad was constructed using a 1/4" thick 33"x33" sheet of Lexan screwed to the four
edges of the frame with twelve screws. A large heavy duty black handle was attached to center of the
top edge.
Adding the handle.
The four arrows of the DDR pad were constructed out of the same plywood as the base of the pad. They
were cut into the shape of arrows about 8" long and 4" wide at the base. 13 holes were drilled around
the edge of the arrow for the LED leads to fit through, and a 1" circular hole was drilled in the center for
all wires of the slave board to pass through.
The underside of an arrow, before connecting leads.
The supporting pieces of the arrows were constructed out of four 2"x2" plywood squares; two stacked
together each for the point and for the base of the arrow. The arrows were attached to the supporting
posts with one screw from the top of the arrow and the posts were attached to the bottom of the frame
with one screw from the bottom of the frame (through the plywood base). Before the pieces were
assembled, they were painted. We applied one coat of primer to all the pieces, and then painted the
arrow with two coats of granite colored paint and the supporting square posts with two coats of black
paint.
Adjusting wires. Note that each arrow has four connectors: power, ground, data from master, and data to
master.
The Xbox 360 controller board was mounted in the top left square, along with the master chip (in the
Arduino development board), and the master board (with hardware used to interface the master chip and
Xbox 360 controller board). The Xbox 360 controller was mounted on two 2"x2" pieces of 2x4 spray
painted black and screwed into the bottom of the frame. The master chip was screwed onto 1" metal
risers and hot glued to the bottom of the DDR frame. The master breakout was screwed onto 1/2" risers
and hot glued to the bottom of the frame. A Dell desktop computer power supply was mounted in the top
right square by screwing it to the right edge of the frame. All four slave boards were screwed onto 1/2"
risers and hot glued to the top of the arrow above the 1" circular hole. All wires throughout the pad were
either hot glued out of sight or hot glued to an edge and spray painted black.
The board with power supply, master board, Xbox 360 controller, and cables run inside.
Electronics design
Our electronics design consists of 4 slave boards located on each arrow and 1 master board that serves
as the interface between the slave boards, the computer, and the Xbox360 controller. These slave
boards each held an ATmega328P microcontroller, one 20 MHz resonator with internal capacitors, two
ULN2004A LED drivers and 15 LEDs. A main design was used for all the slave boards. The only
difference between the two slave board designs (one for blue LEDs, the other for red LEDs) are the
resistors used in series with the LEDs to prevent them from passing too much current.
We calculated that, to keep current through the LEDs below 20mA (well within their maximum rating), we
would need 50 Ohm resistors for the blue LEDs and 115 Ohm resistors for the red LEDs. Each slave
board used two blue LEDs for sensing: one LED is kept constantly lit while the other is polled to detect
light reflection.
Close-up of the slave board.
Each driver is connected to VCC and GND. The drivers act as an open circuit when the pin from the
ATmega328 is low. When said pin goes high, the circuit closes which allows the LED to turn on. Each
LED, excluding the sensing LED, has the cathode tied to VCC and the anode tied to the LED driver.
The blue slave board schematic shows the pin configurations on the ATmega328P microcontroller, their
connections to the LED drivers and the use of the 50 Ohm resistors.
The red slave board schematic shows the pin configurations on the ATmega328P microcontroller, their connections
to the LED drivers and the use of the 115 Ohm resistors.
The master board schematic shows the pin configurations on the ATmega328P microcontroller, their
communications to the slaves, and the communication to the Xbox 360 controller. Ordinarily, when a user presses a
button on the Xbox 360 controller, it completes a circuit through a resistive pad. To emulate this, we have a
transistor tied to the two sides of the resistive pads and connect the base to a pin from the master board. We use
potentiometers as voltage dividers to adjust the voltage applied at the base to prevent over-volting the Xbox 360
controller. We raise a pin on the master board, the transistor activates, current flows, and the Xbox 360 controller
thinks the button is pressed.
In parallel with this circuit we have an LED and current-limiting resistor to indicate to the user that their foot step is
being registered. The whole Xbox 360 button control circuit is seen four times on our master board to accommodate
the four arrows: up, down, right, and left.
Software design
The software portion of this project consists of three important parts:
1. "Master" microcontroller code
2. "Slave" microcontroller code
3. Computer game code
Master microcontroller code
Part 1 is implemented as a simple infinite loop that samples the four input lines from the four slaves and
writes matching values to the output lines to the Xbox360 controller. It also listens for serial commands when it receives a '1' character, it will raise the four output lines to the four slaves to a digital HIGH, and
it will drop them upon receipt of a '0'. The slaves, in turn, will see the line go high and illuminate the
LEDs in the arrow.
Slave microcontroller code
Part 2 is a bit more complex. Each slave arrow has fourteen separately controllable LEDs on 14 different
pins, and one LED that acts as a sensor taking up two pins.
Sensing involves applying a reverse bias to the sensing LED to charge the LED's internal capacitance,
then swapping the anode to input mode and disables the internal pull-up resistor. It then counts the
number of times it goes through a busy loop before the anode reads a logical low again. The more light
the LED receives, the lower the number of loops.
To ensure that the arrow responds promptly to changes in the line from the master indicating light data,
each iteration through the busy loop checks the status of that pin, and updates the LEDs accordingly.
We experimented with different loop counts to determine a threshold below which our code assumes
that a foot is present and above which our code assumes the foot is absent. We thus go through the
busy loop this threshold number of times - if the anode input hasn't gone low yet, we assume no foot is
present and begin a new sample. This allows us to keep a fast sample rate, which is important for a
rhythm game like DDR.
Our current code will illuminate all the LEDs in the arrow when:
The last sample of the sensing LED went low before threshold iterations (the button is "pressed"),
or
The master has indicated that the lights should be illuminated by raising that pin to logical high.
Since each LED can be individually controlled, it is possible to reprogram the atmega328 chips to display
various patterns. Getting consistent timings may be difficult, due to the nature of the variable speed at
which the slaves sample (since reading a low value takes less time than reading a high one). Such code
has not been implemented yet, but would require no hardware changes to effect.
Computer game code
Part 3 consists of modifications to an existing open-source program, Stepmania to send light data to our
master microcontroller. Stepmania already has support for the use of standard joysticks as input
devices, so by using a USB HID device (like the Xbox360 controller), we did not have to modify any code
to make Stepmania recognize our pad as an input device.
Our software development was done on Linux machines, but we also wanted to ensure that the DDR
pad could be used on Windows machines as well. To this end, we implemented the serial-port lighting
control for both platforms.
Since we were unsure what COM port the master board would appear as to the Windows systems, we
also added a command line switch to allow the user to specify which serial port Stepmania should use to
send lighting data. Invoking Stepmania in the following manner will tell the program to send light data to
COM10:
Stepmania.exe --serialport=COM10
If Stepmania is unable to open the appropriate serial port, it will simply continue running
without sending any lighting data to the pad, and will continue to function with the pad as an
input device.
As Stepmania is open-source and distributed under the GNU GPL, we have provided our
patches which apply against SVN r28063 at the time of writing (2009-05-10).
Challenges faced
While the project has come together into a successful product, we experienced many trials along the way.
This document serves to exhibit some of our failures, how we discovered them, and how we resolved
them.
Challenge 1
We discovered that the use of the LEDs as sensors was sensitive to wire length. Specifically, we found
that six inches of 22-gauge wire had substantially more capacitance to charge and discharge than that of
the LEDs that we were trying to measure. The signal-to-noise ratio, as measured with a multimeter, was
about 1 to 20. This meant that we would have to use minimal wire to connect the sensing LEDs, which
meant we'd need a microcontroller next to each sensing LED. As such, it became infeasible to have as
many sensing LEDs as we had originally hoped for.
While we toyed with the thought of calibrating the wire lengths to serve as antennae, which would
change capacitance depending on proximity of the player, the difficulty in implementing such an idea put
it out of our consideration.
Challenge 2
We were trying to create a breakout board on which to test the ATmega328 chips, as we would need them for the
slave boards (which we needed because any significant length of wire would ruin our light readings). We failed to
realize that the RESET pin (Pin 1) needed to be tied to 5V for the chip to operate properly. As we left that pin
floating, the chip remained in a constant state of RESET, which meant that it wasn't doing much useful.
We also failed to attach a crystal resonator to the ATmega328s, which, compounded with the aforementioned
problem, rendered the breakout boards fully nonresponsive. After rereading the spec sheets and reviewing the
Arduino board schematics, we discovered our errors, and purchased the appropriate crystals and connected Pin 1
to power. Now our breakout boards would execute code and blink an LED.
Challenge 3
We tried communicating with the slave boards via a serial port with a RS232 shifter to convert TTL voltage levels to
those of RS232. Unfortunately, the data came through all wrong - an ASCII 01000001 came out as an ASCII
11000001. It seemed that the timings were close, but slightly off. We realized that this was because the software
was written to have delays for a 16MHz crystal, rather than the 20MHz crystals we were using. A one-line change in
the Arduino dev environment configuration, and we had correctly-functioning serial communications again.
Challenge 4
We purchased a sheet of 1/8th inch polycarbonate. It proved too flimsy, so we wound up having to purchase a 1/4th
inch sheet, to the tune of 125 dollars. Now it's pretty, although polycarbonate scratches rather easily. We decided
that players will be required to wear socks while using our pad, to ensure its longevity.
Challenge 5
After we finished constructing all of the slave boards, we connected them all to the master, and tried to power them
up. Some of the boards wouldn't power on. It turns out that we were trying to pull too much current from the power
supply on the same pins, and the PSU couldn't handle that kind of sudden draw. By rewiring our connections to use
different PSU pins, we got all the slaves to power on properly together.
Challenge 6
Once we had all the arrows connected, and the slaves communicating successfully with the master, we discovered
that our slaves sampled somewhat slowly and would not blink in time together. This was because a single sample of
the sensing LEDs could take a quarter of a second if said LED had little illumination. Since the software only
updated the arrow LEDs in between samples, if the state of the master TX pin changed during a sample, that slave
would not update the lights until the end of that sample. This was resolved by having the slaves update the lights
constantly during the sample-collection period, and adjusting the threshold constants accordingly. This resolved our
last major problem.
Future Improvements
While this project did successfully implement our original goals, there is still room for further development.
One improvement would be to fabricate Printed Circuit Boards (PCBs) in the shape of arrows to allow for more
uniform design and additional sensing LEDs within each square. This would provide greater reliability and a larger
sensing surface.
A second improvement we might make would be to improve the light sensing routine to take more uniform time, or
perhaps even drop the LED-as-sensor idea and just use a normal photo resistor or phototransistor for greater
sensitivity. We could also make improvements to the algorithm to make the sensor more robust to different light
conditions and footwear.
Another improvement would be to have more complicated blinking patterns of the arrow LEDS. They could show a
circular pattern, a pattern that turned the LEDs on in order from the base to the head of the arrow (traffic diversion),
and other random patterns.
High-Speed Communication API for FPGAs
Ken Eguro
Microsoft Research
Abstract
FPGAs can be used to speed up many applications by several orders of magnitude. Most of these computations
require both a software and hardware component. Unfortunately, setting up the communication between software
running on a host PC and a FPGA-based accelerator can often present a problem. Not only does this interface
generally require laborious, custom low-level software and hardware development, it is often a critical performance
bottleneck for the system as a whole. The lack of high-level support for fast and reliable communication
discourages programmers from using FPGAs for their applications. This project simplifies the process of building
FPGA-based hardware accelerators by providing a simple and high-performance software/hardware API
infrastructure.
1
Introduction
FPGAs can exploit massive parallelism to accelerate a wide range of different applications. However, despite
many successful academic and industrial research projects, FPGAs have not really gained widespread popularity.
Part of the reason for this is that FPGAs are notoriously difficult to use. This project focuses on solving one aspect
of this accessibility problem: simplifying the communication between software running on a host PC and accelerator
hardware mapped to an FPGA.
The communication between software and hardware is an important consideration for most potential FPGA
applications. This is because real-world computations are generally built with multiple phases of execution, each
with their own characteristics. For example, while the central loop of an application may be computationallyintensive and naturally parallel, the data setup/teardown before and after the central loop may be control-heavy and
highly sequential. In this case, it makes sense to map the computational core of the application to an FPGA in order
to make best use of the available resources. The rest of the application is likely better suited to run on a
conventional desktop processor. This intrinsic division of labor makes the communication scheme between the host
PC and FPGA extremely important to the fundamental operation of the system.
Although it is a necessary part of most FPGA-accelerated applications, the existing level of support for
building a software/hardware communication interface is relatively poor. Today’s systems present two major issues
for application developers. First, they require programming and debugging at a very low level of abstraction. This
is problematic because relatively few developers have sufficient knowledge and experience to implement fast and
reliable communication. Second, each application that a user would like to map to an FPGA requires device and/or
protocol-specific code. This makes development time-consuming and prevents user code from being portable across
different FPGAs or communication technologies.
The work presented here provides programmers with a simple and reusable communication API. Since the
interface contains only a small set of easy-to-understand communication commands, it allows programmers to focus
their development effort on their own software and hardware kernels. Furthermore, since it completely abstracts
away any device or communication protocol-specific details, all of the user’s code is completely portable to any
system that supports the API. The standard is completely open, allowing the community at large to support new
devices and communication protocols looking into the future.
2
Software and Hardware APIs
As seen in Figures 1 and 2, the software API to the user’s C++ code consists of an object class that provides a
small set of control and communication functions for the FPGA. These functions allow the user to configure the
FPGA, send/receive data to or from the device, and control/monitor the execution of their hardware logic circuit.
The software API translates these requests into protocol-specific commands, taking care of any necessary
negotiation, packetization, or error checking needed to use the desired data transport medium. As seen in Figures 1
and 3, the hardware API to the user’s Verilog-based circuit consists of a set of I/O memories and control signals.
The user is able to send and receive bulk data through separate input and output buffers while smaller control or
Software API
User Verilog
Configure/
SendRead/
SendWrite
I/O Mems &
Param Regs
ProtocolSpecific
Driver
Physical
Connection
ProtocolSpecific
Controller
FPGA Accelerator
User C++
Hardware API
Host PC
status values are exchanged through a set of parameter registers. The API negotiates execution of the user’s logic
through a simple “start” and “done” signaling system. Similar to what is implemented internally within the software
API, a controller instantiated within the hardware API’s logic handles all of the protocol-specific transaction details.
Looking at Figure 2 in greater detail, the user’s C++ code has access to the FPGA through a set of nine
functions. The first function is simply the API class constructor. When a user would like to map some part of their
computation to an FPGA, they simply create an instance of the API class object for a given communication protocol.
A simplified example of using the API for spam filtering is shown in Figure 4. Our system currently supports
communication over gigabit Ethernet and will soon be extended to include PCI-Express. The user can then
configure the FPGA by calling the object’s configure function. Our system currently pulls FPGA configuration
bitstreams from a CompactFlash card attached to the supported FPGA development board (the Digilent XUP-V5).
The Xilinx SystemACE chip that controls this configuration is capable of sending one of eight bitstreams to the
FPGA. In the future, we plan to overload the configure function to allow the user to directly send the bitstream
binary to the system. After the FPGA has been configured, the user can then send input data for their circuit using
the sendWrite and sendParamRegWrite functions. The user notifies their accelerator logic that it can execute with
the sendRun function. The existing system is build around the concept of batched processing, so when the user
subsequently calls the waitDone function, it will spin until the user’s circuit has indicated that it has completed
execution. At this point, the results can be retrieved from the FPGA with the sendRead and sendParamRegRead
functions. The throughput of the system for some kinds of computations may be improved by using stream-based
transactions, so this will be included in future work. The final function of the API class is abort. As will be
described later, after the user executes the sendRun function but before the user’s circuit has indicated that is it done
computing, control of the I/O buffers and parameter registers is transferred to the user’s circuit.
Figure 2: Software and hardware API architecture
class fpgaAPI{
public:
fpgaAPI(protocolType type);
module fpgaAPI(
input userClk;
output reset;
//Command to configure FPGA from SystemACE
bool configure(int configNum);
input inputMemReadAdd;
//Input memory
output inputMemReadData;
input outputMemWriteAdd; //Output Memory
input outputMemWriteData;
input outputMemWriteEn;
input regAddress;
//Parameter Registers
output regReadData;
input regWriteData;
input regWriteEn
//Memory I/O commands
bool sendRead(int startAddress, int length, byte* outputBuffer);
bool sendWrite(int startAddress, int length, byte* inputBuffer);
//Parameter Register I/O commands
bool sendParamRegRead(int regNumber, int *value);
bool sendParamRegWrite(int regNumber, int value);
//Execution commands
bool sendRun();
bool waitDone():
bool abort();
//Clock for user I/O
//Reset for user logic
output runSignal;
input resetRunSignal;
//Should user logic run?
//Computation complete?
);
}
Figure 3: User-accessible software API functions
Figure 4: Hardware API module interface
processMessage(message *input, buffer *results){
fpgaAPI *apiP = new (fpgaAPI(gigaEth));
apiP->configure(0);
//Create API object of type gigabit ethernet
//Configure FPGA with config #0 from SystemACE
//Setup computation
apiP->sendWrite(0, input->length, input->data);
apiP->sendParamRegWrite(0, input->length);
//Send input data to FPGA
//Set register #0 to indicate message length
//Process message
apiP->sendRun();
apiP->waitDone();
//Activate user circuit
//Wait until user circuit resets run signal
//Retrieve results
apiP->sendParamRegRead(1, &(results->length));
apiP->sendRead(0, results->length, results->data);
//Read register #1 that indicates result length
//Read back output data
}
Figure 5: Pseudo-code for host PC process of e-mail spam filtering
module processMessage(input clock);
wire reset, [7:0] inputByte, [31:0] regRData, runSignal;
reg [31:0] inputAdd, [31:0] outputAdd, [7:0] outputByte, outputWE;
reg [7:0] regAdd, [31:0] regWData, regWE, computationDone;
fpgaAPI api(clock, reset, inputAdd, inputByte, outputAdd,
//Create instance of API I/O buffers, parameter registers and controller
outputByte, outputWE, regAdd, regRData,
regWData, regWE, runSignal, computationDone);
initial begin
currState = IDLE;
end
always @(posedge clock) begin
if(reset) begin
currState <= IDLE;
//Begin so that the API controller has control of the I/O buffers
end
case(currState)
IDLE: begin
if(runSignal)
//Wait until the run signal goes high
currState <= RUNNING;
//Control over the I/O buffers has been given to the user circuit, start
execution
end
RUNNING: begin
if(computationDone)
//Wait until the user logic is done
currState <=IDLE;
//The API controller regains control of the I/O buffers
else if(!runSignal)
//Stop execution if the host PC aborts
currState <=IDLE;
else begin
user logic reads message length from param reg #0, reads N bytes of message data from the input buffer, processes the message,
writes M bytes of result data into output buffer, puts M into param reg #1 and raises the computationDone flag
end
end
endcase
end
endmodule
Figure 6: Pseudo-code for user logic for e-mail spam filtering
If something goes wrong, or if the user simply wants to cancel execution, they can call the abort function to
regain control of the I/O buffers and registers.
Looking at Figure 3 in more detail, the user’s accelerator Verilog code connects to the hardware API I/O
buffers, parameter registers and control mechanism via 13 signals. The first signal is a user domain clock. The user
presents a clock to the API controller that synchronizes communication between the user’s logic and the API I/O
buffers and parameter registers. This is independent of the clock used internally within the controller for the actual
physical communication interface to the host PC. The second signal is a system reset. When the system is initially
powered on (and potentially other times during operation), the entire system is reset before beginning normal
operation. If the user’s circuit requires this kind of reset signal, it can pull it from the hardware API. The next two
signals connect the user’s circuit to the input memory buffer. The API supports up to 32-bit byte-wise addressing
(4GB), although the input buffer in the current implementation of only contains 256KB (18-bit byte-wise
addressing). The next three signals are used to connect the user’s circuit to the output memory buffer. Similar to the
input buffer, the API supports up to 4GB of byte-addressable output memory, although the current implementation
only contains 8KB (13-bit byte-wise addressing). The next four signals can be used to read and write 255 32-bit
parameter registers. The last two signals for the API are used to negotiate execution of the user logic and write
control over the I/O buffers and parameter registers.
Figure 5 shows simplified Verilog pseudo-code of how a user might integrate the communication API with
their own logic. Essentially, the user’s circuit should wait until runSignal goes high. While runSignal is low, the
input buffer and parameter registers will only accept write commands from the host PC. During this time, any
writes attempted by the user logic to the output buffer or parameter registers will be ignored. When the user’s
software application calls sendRun, the API controller will raise runSignal. After runSignal goes high, the user’s
logic can process the input data and write to the output buffer and parameter registers. While runSignal is high, any
attempts by the user’s software to write to the input buffer or parameter registers will fail. When the user’s circuit
has completed computation, it will raise resetRunSignal. In response, the API controller will lower runSignal and
return write control of the input buffer and parameter registers to the host PC. The user’s circuit should also monitor
runSignal while it is running in case the user’s software application aborts execution.
3
Lessons Learned
Taking a step back for a moment, the overall goal of this project is to make FPGAs more accessible. We
believe that this will encourage more developers to incorporate hardware accelerators into their applications. As
part of this, we hope that this API will be extended to run on more FPGA platforms and with more communication
protocols. Towards this end, we would like to share some of the lessons learned during the development of our
initial prototype communicating over gigabit Ethernet. Perhaps most importantly, while it is relatively
straightforward for the hardware on the FPGA to send or receive data at full bandwidth, the same cannot be said for
the host PC. 1 gigabit per second only corresponds to the FPGA producing or receiving one byte per clock cycle at
125 MHz. Designs mapped to modern FPGAs can run at four times that clock rate, so it is relatively easy to build
circuits to meet this timing requirement. On the other hand, even when transferring maximum sized Ethernet
frames, 1 gigabit per second requires the host PC to produce or receive over 80,000 packets per second.
We took four main steps to minimize the CPU load and maximize the performance. First, it is essential to use
I/O completion ports so that the OS can handle the transfers asynchronously. Without I/O completion ports, the API
simply spends too much time spinning. Second, interrupts should be moderated to reduce the number of system
disruptions. Rather than raising an interrupt each time a packet is received, the system can bundle multiple requests
together. On the other hand, the high throughput that asynchronous I/O and moderated interrupts provide makes
buffering very important. Through empirical testing, we found that it was necessary to provide space for at least 500
incoming packets to avoid dropping packets when running near 1 Gbps. This additional space is necessary to handle
messages that arrive while the API code is context switched out and not actively running. Lastly, it is essential to
find drivers that are very efficient. Our current implementation for gigabit Ethernet piggybacks on the Virtual
Machine Network Services driver built into Virtual PC.
4
Conclusions
FPGAs are capable of exploiting such massive parallelism that they can be a disruptive technology. One
FPGA board may be capable of replacing racks and racks of conventional processor-based machines. Not only
might an FPGA-based implementation be faster than hundreds of processors, such a system would be easier to
maintain and only require a small fraction of the power. However, for FPGAs to be used in real-world deployable
systems, they need to be more accessible to a wider range of programmers. Our hope is that with better interfacing
and circuit development support, FPGA-based systems will open a new world of possibilities.
FPGA-Accelerated Processing of Network Data
Rene Mueller
ETH Zurich
Ken Eguro, Paul Larson
Microsoft Research
Abstract
Emerging large scale multicore architectures provide
abundant resources for parallel computation. In
practice, however, the speedup gained by
parallelization is limited by the fraction of code that
inherently needs to be executed sequentially (Amdahl’s
Law). Commonly encountered examples are I/O
operations such as network or disk access and object
serialization, e.g., marshalling of arguments in a
remote procedure call. In this work, we study
acceleration by offloading sequential processing to a
custom hardware circuit in an FPGA. The FPGA is
placed in the data path, i.e., between the network
interface and the CPU.
As a use case we investigate acceleration of the
processing of network data that is exchanged between
the Microsoft SQL Server and its clients. Frequently
occurring requests are offloaded to a hardware
accelerator whereas all other requests bypass the
accelerator and are handled in the existing software
stack. The object structures resulting from parsing and
unmarshalling of the requests are copied into the
memory space of the host system where they can be
directly accessed by the DB engine.
The research problem is the automatic generation of
hardware logic for a given network or serialization
protocol. This involves defining a specification
language for the protocol itself and a language for the
semantic actions. The latter is used to define how the
memory objects have to be created.
performance improvements by just using multicore
architectures alone is difficult [1]. In particular, the
speedup that can be obtained is limited by the
inherently sequential fraction of a program. This is
known as Amdahl’s Law [2]. Its statement is the
following: If by optimization (multicore, etc.) the
parallel fraction f of a program experiences a speedup
of S the speedup of the overall program is
Speedup =
Introduction
Indubitably, future system architectures will consist of
multiple cores. The big problem that needs to be
addressed by the software community is how to
efficiently make use of the additional resources. Most
research is focused on parallel algorithms and cacheconscious implementations. However, obtaining large
1−� +
�
.
�
Clearly, for � → ∞ the speedup is bound to 1 1 − � by
the sequential fraction 1 − �.
In practice, the sequential fraction of a program
involves I/O operations such as disk and network
access, i.e., serialization of data. In this work, one
particular type of serialization and deserialization is
considered; object marshalling and unmarshalling in
Remote Procedure Calls (RPCs). RPCs play an
important role in modern distributed and networked
systems. To that extend, minimizing overhead and
communication cost is crucial. Our approach uses an
FPGA that is placed in the data path between the
network interface and the host interface as illustrated in
Figure 1. The sequential protocol handling is offloaded
to an FPGA, hence, reducing the work to be performed
by the CPU cores.
FPGA
Network
1
1
RAM
CPU
Core
CPU
Core
Figure 1: FPGA in data path between network and CPU
Multiple protocol handling engines can be
instantiated on the FPGA allowing concurrent
processing of multiple user requests. The unmarshalled
objects are written back into main memory via DMA
transfers. The data structures then can be directly used
by the parallelized application program.
2.2 Hardware Acceleration using FPGAs
By inserting an FPGA into the data path the parsing of
the data can be offloaded. A possible system
architecture is depicted in Figure 3.
In the next sections we illustrate the use case our
work is based upon. In Section 3 We describe our
approach to automatically generate hardware circuits
out of protocol specifications before we conclude in
Section Error! Reference source not found.
DB Client
App
DB Engine
pinned page(s)
SQL Driver
2 Use Case: TDS Processing in Microsoft
SQL Server
Transport
Message content
directly copied into
main-memory data
structures (DMA)
Like other database management systems Microsoft
SQL Server uses a complex network protocol for
communicating with the database clients. In SQL
Server the Tabular Data Stream (TDS) protocol is used.
One particularly important subset of the protocol is
messages for RPCs. These RPC occur when a database
client invokes a Stored Procedure on the server through
an ODBC connection.
e.g., PCIe
FPGA
TCP/IP
Client Socket
NIC (TCP/IP
offloading)
Client
Server
may be
located on
same chip
Figure 3: Architecture with protocol offloading to an
FPGA
For example, consider the following invocation of a
Stored Procedure with an integer and a string argument:
The low-level Ethernet protocol and the TCP/IP stack
are also handled on the FPGA. The Ethernet MAC
hard-IP cores found on modern FPGAs such as the
Virtex-5 can be directly used. For the TCP/IP stack,
existing soft-IP cores can be added to the design. The
protocol engine is implemented as a custom circuit on
the FPGA. In that circuit the serial data stream is parsed
and the corresponding data structures are created in the
on-chip memory (BRAM). The FPGA itself is placed
on a PCI Express board that is inserted into a traditional
database server system. The created data structures are
then copied using DMA transfers to the main memory
of the host system and the database application notified
EXEC Broker_Volume
@topk = 2,
@sector_name = 'Zurich'
This call is translated in a binary TDS message by the
client-side driver. The message (Figure 2) is then sent to
the server. The message contains next to the arguments
also the name of the invoked procedure as a variable
length string.
2.1 Traditional Software Approach
procedure name
0000
0010
0020
0030
0040
0050
0060
03
02
42
6F
04
00
20
01
00
00
00
02
5A
20
00
00
72
6C
00
75
20
6F
00
00
00
00
72
20
00
00
6F
75
00
69
20
00
00
00
00
00
63
20
01
00
6B
6D
00
68
20
00
00
00
00
AF
20
20
16
00
65
65
1E
20
20
00
00
00
00
00
20
20
00
01
72
00
09
20
20
00
00
00
00
04
20
20
first argument
2nd argument
@sector_name
12
00
5F
00
D0
20
20
00
00
00
00
00
20
20
00
0D
56
26
34
20
20
00
00
00
04
1E
20
.
.
B
o
.
.
.
.
.
.
.
Z
.
.
r
l
.
u
o
.
.
.
.
r
.
.
o
u
.
i
.
.
.
.
.
c
.
.
k
m
.
h
.
.
.
.
.
.
.
e
e
.
.
.
.
.
.
.
.
r
.
.
.
.
.
.
.
.
.
_
.
.
.
.
.
.
.
.
.
V
&
4
.
.
.
.
.
@topk (Int32)
(CHAR(30))
Figure 2: Structure of a TDS message sent to the SQL Server
Parsing this message is not difficult but, nevertheless,
consumes CPU resources that could otherwise be spent
for the actual query processing.
that new data has arrived. The communication through
PCI Express also requires a driver stack on the server
side, in particular, memory pages must be pinned in
order to allow DMA transfers. This results in changes
of the memory system of the database server
application.
2.3 Implementation of the Protocol Engine in
FPGA Hardware
The protocol hardware is essentially a pattern matching
engine of known RPC message types. For example, the
FPGA can implement a simple finite state automaton
for each RPC request it can handle. Because the
message format for the underlying RPC data is small
these state automatons are very compact. For efficiency
and space reasons it makes sense not to handle every
possible RPC request type on the FPGA. Instead, only
the most frequent or most time-consuming requests (for
parsing), i.e., the heavy hitters, are offloaded to the
FPGA. The handled RPCs depends on the given query
load. All remaining RPC requests bypass the offloading
engine and are forwarded to the server where they are
processed in the conventional software stack of SQL
Server. This allows us to trade complexity/chip area vs.
functionality.
The protocol handling engine that is able to process a
certain set of calls can further be replicated on the chip
multiple times. This allows processing multiple requests
sent by several users concurrently.
3 Automatic Generation of Circuits out of
Protocol Specifications
The research goal is to automate the generation of
hardware circuits from given protocol specifications.
Similar to existing generator tools for parsers (Yacc) or
lexical scanners (Lex) a tool is currently being
developed that takes a protocol specification that is
annotated with semantic actions as input and produces
either VHDL or Verilog code. The key problems to be
addressed are the expressiveness of the protocol
language and the language of the semantic actions.
3.1 Overview
The generator tool generates a hardware circuit that
accelerates the protocol handling while at the same time
can be easily integrated into the existing software stack.
In this work, we study the integration into C/C++
software, i.e., data structures must be aligned in the
FPGA such that they remain compatible with the layout
of C types and C++ classes. The tool workflow is
illustrated below:
The user specifies the network protocol and the
corresponding semantic actions that are necessary to
create the object structures that are later used in
software. Thus, in order to generate the memory layout
of the objects the tool needs to know about the high-
level language type information, i.e., the C++ classes,
which are also have to provided to the tool (Figure 4).
Protocol Specification
with Semantic Actions
HDL Code
(Verilog/VHDL)
Generator
Tool
FPGA
Synthesis
Type
Information
FPGA Circuit
Figure 4: FPGA Circuit created out of a protocol
specification and type information.
3.2 Protocol Specification Language
In simple scenarios, protocol specifications can
correspond to regular languages. Hence, the
specification then represents a regular expression. The
implementation on the FPGA therefore is a finite state
automaton. More complex protocols that use some form
of nesting are no longer regular; they are context-free
languages. In this case, the language can be expressed
in Backus-Naur Form (BNF) and can be recognized by
a parser on the FPGA.
The expression language for the semantic actions has
to be expressive enough to create and manipulate data
structures in memory. For space reasons on the chip and
efficient implementation it should not be over
expressive. For example, there is no need to support
recursion or iteration.
3.3 Example
Although the work on both the specification and action
languages is still ongoing we provide an example of a
protocol specification for illustration purposes below.
The example shows the pattern expression and the
corresponding action code in {: :} for the
Broker_Value procedure shown in Section Error!
Reference source not found.. The language is similar
on the Yacc grammar. RPCCall and Int32Argument
and StringArgument correspond to C++ types. The
Yacc language is extended, for example, to capture
variable length arrays such as strings (the second
argument).
From this specification and the C++ type information
the generator tool produces a hardware circuit in
VHDL/Verilog that generates the data structures from a
data stream received over the network. This memory
block is then copied to the host memory via DMA
where it can be accessed by the CPU.
Int32Argument and StringArgument have virtual
methods and, hence, have pointers to their vtables.
arg[1]
BrokerVolumeRPC ::= BVHeader
ArgTopK:a1 BVBlock ArgName:a2
{: %% = new RPCCall(2);
%%.id
= 7;
%%.arg[0]
= a1;
%%.arg[1]
= a2;
:}
id
a1 a2
arg[0]
vtptr value
vtable
Int32Argument
ArgTopK ::= INT32:i
{:
%% = new Int32Argument();
%%.value = i;
:}
ArgName ::= UINT16:len STRING(len):str
{:
%% = new StringArgument();
%%.length = len;
%%.string = str;
:}
// Terminals
terminal signed (0 to 31) INT32;
terminal unsigned (0 to 31) UINT32;
terminal STRING(len) (0 to len) CHAR;
3.4 Hardware-Software Interface
The object layout on the FPGA has to be chosen such
that it is consistent once copied to the host memory. For
efficiency reasons no pointer relocation performed by
the host CPU after that memory block is copied into the
host memory. Thus, all object pointers must be setup
correctly in the FPGA. Furthermore, for object
allocation and deallocation the heap management on the
host system and the FPGA need to be kept
synchronized. Additional dynamic information is
needed on the FPGA for heap management and setup of
pointers to vtables (virtual method table). This
information is provided by the CPU and updated when
necessarily through mapped registers.
3.5 Example continued
Figure 5 shows the object diagram of the resulting object
structure. In the following it is assumed that the classes
vtptr
length
string
vtable
StringArgument
Figure 5: Object structure created on FPGA and copied to
host memory
The object structure is aligned using conventional
alignment rules. The FPGA circuit uses a set of
registers that hold the pointer location of the vtables and
the destination address of the top-level object in on the
heap in the host memory. This information is used by
the FPGA circuit to properly align the object structures.
4
BVHeader ::= 0x03 0x01 0x6F ...
BVBlock ::= 0x00 0x00 0xAF ...
Object Memory (to be copied)
Conclusions
The proposed solution is a non-invasive attempt to
offload the processing of I/O for marshalling and object
serialization to a custom FPGA circuit. A generator
automatically produces the digital logic out of a
protocol specification and the corresponding high-level
language types.
The generator tool is currently being implemented.
Both languages for protocol and semantic actions are
currently investigated for the necessary expressiveness
that they can be used in real application scenario such
as in the Microsoft SQL Server. Next, the resulting
circuits have to be evaluated and the resulting speedup
compared to a traditional CPU-based implementation
measured.
5
Bibliography
1 Laurus, James. Spending Moore's Dividend.
Communications of the ACM 5(52) (2009), 62-69.
2 Amdahl, Gene. Validity of the Single Processor
Approach to Achieving Large-Scale Computing
Capabilities. In AFIPS Conference Proceedings (
1967), 483-485.
A Full-System Concolic Simulator for Real-time System
Weiqin Ma
Texas A&M University
Alessandro Forin
Microsoft Research
Abstract
In this project, we add an x86 CPU model to the Giano full-system simulator which can run x86 programs in real-time.
We also develop a concolic execution engine to Giano, to run programs in a mixed symbolic-concrete (concolic) way.
Based on the concolic engine, we perform a case study to detect data races in a multi-threaded program.
1
Introduction
The first goal of this project is the real-time simulation of the x86 instruction set. The second goal is to create a concolic
module for a full-system simulator such as Giano, to execute applications in a symbolic way but guided by concrete
inputs.
Most existing monitoring techniques are based on direct execution or dynamic binary translation for pursue of
performance. Such techniques create inconsistencies in the timing behavior of the program. The Giano simulator
provides a facility to monitor and adjust the program execution speed, to maintain real-time consistency. As a result, the
executable of the software program does not change its temporal behavior.
Existing monitors also lack a systematic testing approach and tools to verify the correctness of the simulator. We test the
correctness of our CPU model by comparing the execution results of a large, comprehensive test set on an “oracle”
machine against the results on the simulator. We especially test the boundary values for the instructions. We plan to
automatically generate the tests using formal instruction specifications.
The first step is to add the x86 CPU model to Giano, as indicated on the left side of Figure 1. The second step is to build
a concolic engine to run the program in symbolic manner. The resulting simulator architecture is shown in Figure 1.
Figure 1: Simulator Architecture
The concolic engine includes symbolic execution using symbolic values and concrete execution using concrete values.
The concolic engine needs to generate path conditions and to solve the path conditions using a constraint solver, as
shown in Figure 2. We generate the path conditions based on the control flow graph (CFG), and use the data flow graph
(DFG) to prune the irrelevant paths and reduce the path explosion.
We use Z3 as the constraint solver to check the satisfiability of the path conditions and to generate the set of input values.
The input values are then used to concretely execute the program.
In the case study, we trace and analyze a multi-threaded application in a real-time embedded system. Our motivation for
operating at the binary level rather than at the source level is that programs are actually changed by the compiler
optimizations and by the out of order instruction execution in modern architectures. Our goal is to trace the binary
program at the machine instruction level and find the concurrency errors using concolic execution.
Data races are one type of Heisenbugs which include data race (race condition), live lock, dead lock etc. If two memory
accesses conflict concurrently, we have a data race, as show in Figure 3. A data race has three conditions: the instructions
must target the same location (a shared variable), the instructions must not be both reads, and the instructions cannot both
be synchronization operations.
We use the following scheme for data race detection. First, we can infer the sequential program execution of a concurrent
program by getting the [Min,Max] program sequence. We can eliminate equivalence traced statements using existing
reduction theories. Finally, we can compare the interleaved concurrent program execution with the sequential program
execution to find whether the result is the same.
Even though we operate at the machine level we do not need to lose track of all software abstractions. We can use
instruction introspection to recover those abstractions that are still relevant. For instance, on the x86 we use the value of
the CR3 register to identify different processes. We use the ESP register to identify different stacks which means
different threads. Also we record and trace the shared variables by tracing the memory locations which are accessed by
more than one thread.
2
Figure 2: Architecture of Concolic Engine
Project Demo
Figure 3: Data Race Example
We demonstrate (a) that the Giano simulator is real-time and (b) how we test the instruction set against the oracle
machine. We use the Doom video game, with audio, to show the real-time property. This demo currently uses an existing
ARM CPU model; we will eventually replace it with our x86 CPU model. We also show the xml-based configuration of
the Giano simulator. This makes it easy for users to compose a simulator using existing different modules.
We show the testing of the simulator using tests generated by an oracle. As show in Figure 4, the test engine (on the
oracle) sends different instructions with boundary values to the Giano simulator. The simulator runs the instruction and
shows any discrepancy between the results on the oracle test machine and the simulator, as shown in Figure 5.
Figure 4: The test engine sends different tests, one per line in the log above
Figure 5: The results of the test are compared by the simulator against the oracle’s results
Specification Mining in Real-time Embedded Systems
Wenchao Li
University of California at Berkeley
Alessandro Forin
Microsoft Research
Abstract
Software and hardware systems are often built without detailed documentations. The correctness of these
systems can only be verified as well as the specifications are written. The lack of sufficient specifications often leads to
misses of critical bugs, design re-spins, and time-to-market slips. In this project, we address this problem by mining
specification dynamically from simulation traces. Building on algorithms for pattern mining, we propose a novel
technique that mines specifications with timing constraints and we apply it to a number of practical cases. Timing
constraints are expressed as either inequalities or distributions. The technique applies to both time-labeled and
unlabeled traces. Specifications mined from unlabeled traces can be automatically synthesized using our PSL-to-Verilog
compiler to achieve zero-overhead runtime monitoring. Specifications mined from labeled traces can be used to pinpoint
sources of error. In this work, we focus on embedded software, digital circuits, and network protocols, but any ordered
trace of events is amenable to this analysis.
Introduction
We try to answer two common challenges in verification – “Did I miss any specification in my verification
process?” and “Where should I look in my error trace?” The first question is closely related to assertion coverage. In
assertion coverage, we check whether the verification test suite has exercised some specific functionalities of the design.
However, assertions are still supplied manually. As a result, engineers often face the question of when they can stop
writing assertions. We address this problem partially by dynamically mining recurring patterns from existing simulation
traces. These patterns can then be examined by the engineer to see whether they match the designer’s intent and check
with further verification. The intuition is that frequent patterns are likely to be true. Hence, in the context of mining for
verification, our tool takes a trace and optionally a user-defined event definition as input, and generates a set of
behavioral patterns that are almost always true in the trace as output. A trace is a sequence of events ordered by the time
of occurrences. Events in this case here can be the valuation of a set of signals in digital circuits, signatures of function
calls, or network packets. Given the trace, we match it to a library of parametric patterns. The matching algorithm will be
discussed in more details in the following section. Once the parametric patterns are instantiated, we rank them according
to some relevance metrics that we found useful empirically. The second question that we are trying to answer is
essentially trace diagnosis. Given a normal trace and an error trace, the goal is to first understand what goes wrong in the
error trace and then locate the source of error. We use our specification mining algorithm as a subroutine and look for
difference patterns between two traces. These are patterns that exist in one trace but not in the other, or patterns that exist
in both traces but with different timing bounds. After the difference patterns are found, a localization procedure is
applied to pinpoint the potential source of error.
Specification Mining
Our main contribution is the inclusion of time in specification mining. Timing constraints are ubiquitous in an
embedded environment. For example, in an x-by-wire automotive system, every task has an associated deadline. A task
can be a signal for turning off the air conditioner, or a signal for breaking. Missing the deadlines for some of these tasks
can have catastrophic effects. On the other hand, specifications are sometimes written in a way to just ensure logical
correctness, but not timing correctness. For example, the Linear Temporal Logic (LTL) formula “always (request →
eventually grant)” says every “request” will be eventually followed by a “grant”. However, depending on the
environment in which the design is deployed, the latency for when the grant signal is received can vary, and some of
them may be unacceptable. This mirrors the fact that in software, it may be fine as long as every “lock” is followed by an
“unlock”. But the same alternating pattern is not sufficient for mining useful specifications in a (real-time) embedded
system.
We extend the algorithm for mining alternating patterns to include timing constraints on the events in these
patterns. An alternating pattern can be expressed as a parametric regular expression (ab)*, where a and b are parameters
that can be instantiated with actual events. Without timing constraints, this alternating pattern can be used to express
specification such as “every request is followed by a response.” With timing constraints (timing bounds for example), we
can express richer specifications, such as “every request is followed by a response within 3 cycles, and two requests are
separated by at least 5 cycles.” As a feasibility study, we implement the algorithm in the naïve way by maintaining a 2D
table for all possible instantiations along with counters for each combination and timing bound recorders. Each time a
new cycle is parsed, we iterate over the symbols and for each symbol s, the corresponding row (s,*) and column (*,s) are
checked to see if these patterns are still true. If they are true, the counts and time bounds are updated. The online aspect
of operating on cycles as they come in makes mining this simple pattern appealing. The algorithm has a runtime of
O(nkl) where n is the total number of events, k is the maximum number of events at a cycle, and l is the length of the
trace. We further address the scalability issue by clustering traces according to their modules so that we can reduce both
the storage requirement (significantly) and runtime. However, inference rules may be required to compose local patterns
to form end-to-end specifications. In addition, we allow various degrees of imperfectness in the trace. For example, the
pattern is always true except for the last occurrence. Currently, we mine only alternating patterns with timing bounds.
We plan to extend both the pattern library to include more complex patterns and the timing constraints to include richer
constraints.
Experimental Results
A prototype of the algorithm is written in Perl. We apply the algorithm for learning difference patterns to a full
MIPS core that has approximately 18000 signals (wires and registers). We trace only the control signals, which results in
approximately 1500 events. In the absence of event definitions, we treat the value of a signal as an event. For example, if
a signal A changes value from 0 to 1 at time t1, we record the event A with tag t1. If A changes from 1 to 0 at a later time
t2, we record the event ~A with tag t2. We obtain two simulation traces, a correct one from the current design ran for
about 6.7 million clock cycles, and an error trace from a previous version with a known design bug and ran for about
2.35 million clock cycles. The bug can be described as the signal TLB_ERR going low too soon for exception handling
to finish. The objective of the experiment is to evaluate whether the difference patterns mined are useful to localize the
error.
The tool mines about 200 difference patterns with half of them due to differences in timing bounds. We rank the
two sets of difference patterns separately. For the set containing patterns that exist in one trace but not the other, we rank
them by the number of occurrences. For the set containing patterns due to differences in timing bounds, we rank them
first by time of first divergence and then tie break by occurrences in the normal trace. The top candidates in the
“untimed” set capture the effects of bug, such as constant flagging of exceptions. The top candidates in the “timed” set
localize to signals that are very close to TLB_ERR. The tool does not find TLB_ERR directly because the original bug is
triggered by a specific combination of values across a few cycles. This is not considered in the current implementation
when event definition is not available. We are currently looking at the possibility of synthesizing such combination by
leveraging well-studied techniques in the domain of sequential pattern mining.
Ongoing work
We are currently developing algorithms that can efficiently mine all chain patterns (a +b+c+…)*. Our technique
may also be useful in anomaly detection. One possible direction is to combine machine learning techniques such as
Principal Component Analysis (PCA) and classification to detect abnormal behaviors. The patterns mined can be treated
as features and used as input to the aforementioned techniques. We are also applying our pattern mining tool to
embedded software, with the aid of the Giano simulator.
Custom Floating-Point Units
Zhanpeng Jin
University of Pittsburgh
Richard Neil Pittman
Microsoft Research
Abstract
Multimedia and communication algorithms in the embedded system domain often make extensive use of floating-point
arithmetic. Due to the complexity and expense of the floating-point hardware, final implementation of these algorithms
are usually carried out using floating-point emulation in software, or by conversion of the floating-point operations to
fixed point operations. This study presents the design and implementation of custom floating-point units making use of
the flexibility and reconfigurability of FPGAs. In the eMIPS architecture, such custom floating-point units can be
dynamically configured, loaded, and executed when needed by software applications. We investigate the optimization
strategies for area, power, speed, and duty cycle of the custom units. We show how to construct a set of functional
modules that are optimized on a per-application basis. According to the analysis of the program characteristics and
design specification, the system is able to dynamically configure its own customized “optimal” floating-point units.
1
Introduction
Most of the current available microprocessors are implemented using fixed instruction sets, no matter what
instruction set architecture (ISA) they use, either Reduced Instruction Set Computer (RISC) or Complex Instruction Set
Computer (CISC). When designing instruction sets, computer architects attempt to capture all the instructions which are
necessary to cover the largest domain of potential applications, while such way might bring remarkable overhead in size,
cost, and power. Despite all these efforts, the goal of implementing an “optimal fixed instruction set architecture” is
theoretically impossible because the design space of applications to which the designers apply general purpose
processors evolves constantly [1]. In this case, especially for the emerging embedded market, such an “optimal general
purpose” microprocessor is inefficient and underutilized when the majority of applications never use a large subset of the
capabilities it provides. Thus, a popular solution is to use custom microprocessors with reduced instruction sets but with
customized instructions added specifically for intended application space at some certain execution phases.
Our current work in the eMIPS project [1], addresses such reconfigurable computing challenges and depicts a
promising vision for future computation-efficient embedded systems by using “dynamically extensible processors”.
eMIPS offers the infrastructure to allow for the kind of flexibility and extensibility possible through the use of Field
Programmable Gate Arrays (FPGAs). The FPGA is partitioned into sections containing a standard fixed logic processor
core with interconnects to reconfigurable regions termed “Extensions” that contain customized instructions and
functionality that loads, modifies and enables while the fixed counterpart continues to execute without any interruption.
In this way, the dynamically extensible processor, using a set of Extensions from which it can draw, adapts to the
changing application needs in the field [1]. The Extensions are able to take the form of any optimized or even new
instructions developed to meet certain application needs of the market. For more details about eMIPS architecture, please
refer to [2].
Floating-point (F-P) arithmetic, although extremely common in the general purpose computing market, was rarely
used in embedded systems world until recently. A number of communication and multimedia algorithms are designed
and simulated using floating-point arithmetic, but the implementation platforms for such algorithms often leave out any
hardware floating-point unit in favor of software emulation or float to fixed point conversion [3]. A variety of research
efforts are considering field-programmable gate arrays (FPGAs) as a means to accelerate floating-point computations
using their well-proved flexibility and reconfigurability [4] [5] [6]. Thus, it seems appealing to deploy floating-point
arithmetic into the eMIPS architecture, as a mean to advance floating-point intensive applications in the embedded
market. For instance, one clear advantage is that the reconfigurable floating-point extensions can be loaded only if and
when software applications use them. The architectural diagram of the resulting design is shown in Figure 1. The basic
Figure 1: Reconfigurable Processor with Configurable Floating-Point Extension
Figure 2: Proposed Design Strategy for Custom Floating-Point Units
idea is simple, but the implementation of an efficient and correct FPU is an extremely difficult, involved and time
consuming task. In addition, mapping difficulties occur due to the inherent complexity of floating-point arithmetic [7].
The increasing demand of application-specific floating-point arithmetic data paths presents the challenge of
accommodating several floating-point functional modules in the limited resources available [8]. This makes
considerations for cost-effectiveness a priority. Many recent studies have explored opportunities to improve the floatingpoint performance on FPGAs by optimizing the device architecture. Beauchamp et al. [9] present three architectural
modifications that make floating-point operations more efficient on FPGAs, including an embedded floating-point
multiply-add units and variable length shifters. Chong et al. [10] propose multi-mode embedded FPUs implemented on a
single FPGA, configuring each unit to either perform a different task or to collectively build massively parallel circuits.
The main contributions of this project are three-folds, as depicted in Figure 2:
1) Implementation of IEEE-754 compliant, modular floating-point functional units (e.g., fadd, fsub, fmul, fdiv,
fsqrt, etc.) using the standard MIPS floating point ISA.
2) Investigation of the floating-point optimization strategies for area, power, speed, and duty cycle, and delivering
of a set of optimized implementation solutions.
3) Procedures for analyzing an application characteristics, identifying corresponding performance requirements,
and then dynamically constructing and configuring the optimal floating-point functional modules and
deploying them as eMIPS extensible instructions.
Table 1: Synthesis Results of Floating-Point Functional Units
eMIPS-FPU
-fadd
-fmul
-fdiv
Available
Registers
2000 (2%)
844
833
688
69120
1
LUTs
3727 (5%)
1341
1534
1127
69120
LUT-FF pairs
1774 (44%)
711
1683
1269
3953
8
IOBs
113 (17%)
113
113
113
640
DSP48E
18 (28%)
0
18
0
64
23
(a) Single-precision
1
11
52
(b) Double-precision
Figure 3: IEEE Floating-Point Numbers
2
Floating-Point Format Representation
Floating-point numbers have the advantage of being able to cover a much larger dynamic range compared to fixedpoint numbers. However, they also bring much more complexity for the implementation in hardware.
The IEEE-754 standard [11] [12] specifies a representation for single and double precision floating-point numbers.
It is currently the standard that is used for real numbers on most computing platforms. Floating-point numbers consist of
three parts: sign bit, mantissa, and exponent. In the IEEE-754 format, the mantissa is stored as a fraction (f), which is
combined with an implied one to form a mantissa (1.f) such that the mantissa is multiplied by the base number (two) to
an exponent e, as shown in equation (1) and (2), single and double precision, respectively [9] [13]
� = (−1) � ∙ 1 ∙ � ∙ 2�−127
� = (−1) � ∙ 1 ∙ � ∙ 2�−1023
(1)
(2)
The IEEE standard specifies a sign bit, an 8-bit exponent, and a 23-bit mantissa for a single precision floating-point
number, as shown in Figure 3(a). A double precision floating-point number has a sign bit, an 11-bit exponent and 52-bit
mantissa, as shown in Figure 3(b). Since the mantissa is normalized to the range [1, 2) there will always be a leading one
in the mantissa. By implying the leading one instead of explicitly specifying it, a single bit of storage could be saved, but
it does raise the complexity of floating-point implementations.
Table 2: Floating-Point ADD Extension Definition
ENTRY(fadd_test)
nop
l.s
$f0,offset($a0)
l.s
$f1,offset($a1)
add.s
$f2,$f1,$f0
s.s
$f2,offset($a2)
jr
$ra
nop
END(fadd_test)
3
// void fadd_test(UINT32 *a0, UINT32 *a1, UINT32 *a2);
// The contents of the word in memory is loaded into F-P register f0
// The contents of the word in memory is loaded into F-P register f1
// The contents in f0 and f1 are arithmetically added
// The contents of the word in F-P register f0 is stored back into the memory
Implementation
In the current design, following the state-of-the-art algorithmic design method, we implement four basic floatingpoint operations using Verilog HDL: floating-point addition, subtraction, multiplication, and division. Figure 4 shows the
data path for a floating-point addition, which typically consists of five stages – exponent difference, pre-alignment,
addition, normalization and rounding [14]. Figure 5 shows the data path for a floating-point multiplier and the core
Radix-4 Modified Booth Encoded (MBE) Wallace multiplier was used in our design (shown in Figure 6). For more
details on floating-point arithmetic algorithms, please refer to [15] [16] [17]. The first version of the basic floating-point
arithmetic units was implemented and synthesized in Xilinx ISE 10.1 targeting a Vertex-5 FPGA. The preliminary
synthesis data is shown in Table 1.
We implement the floating-point unit (FPU) as a eMIPS extension, that is, the FPU can be loaded and executed by
the eMIPS system as a reconfigurable module. A semantic definition for one such type of block is shown in Table 2.
Figure 4: Floating-Point Adder Datapath
Figure 5: Floating-Point Multiplier Datapath
4
Figure 6: Booth Wallace Multiplier Structure
Testing and Verification
Testing and verification of the floating-point unit have long presented a unique challenge in the field of processor
verification, due to F-P unit inherent much complicated operations, such as rounding and normalization. The particular
complexity of this area also stems from the vast test space, which includes many corner cases that need to be targeted,
and from the intricacies of the implement of floating-point operations.
Main stream test generation tools, such as Genesys [18] and AVPGEN [19], offer some control for F-P test
generation. However, their lack of focus and internal knowledge of the F-P domain render them inadequate for providing
a full solution to the F-P verification problem. Test generation is supplemented by static legacy tests and by large
quantities of purely random testing.
In this study, we use the SoftFloat, a free, high-quality software implementation of the IEC/IEEE Standard for
Binary Floating-point arithmetic [20]. All functions dictated by the IEEE-754 Standard are supported except for
conversions to and from decimal. SoftFloat fully implements single-precision (32 bits) and double-precision (64 bits)
floating-point formats as well as the four most common rounding modes: round to nearest even, round up, round down,
and round toward zero.
A sample piece of generated test cases is shown in Table 3.
Table 3: Floating-Point Test Cases by SoftFloat
537bffbe
4e6c6b5c
000
00
537c3ad9
Floating-Point Operand 1
Floating-Point Operand 2
Floating-Point Operations
– “000”: Add
– “001”: Subtract
– “010”: Multiply
– “011”: Divide
– “100”: Square Root
Rounding Modes
– “00”: Round to nearest even
– “01”: Round up
– “10”: Round down
– “11”: Round toward zero
Expected Floating-Point Result
The aforementioned floating-point ADD extension (Table 3) including FADD, FLWC1 (load word), FSWC1 (store
word) instructions, has been fully tested using such type of test cases generated by SoftFloat. The simulation results
using ModelSim and Giano are printed on the console screen, as shown in Figure 7 and Figure 8.
5
Bibliography
1 Pittman, Richard Neil, Lynch, Nathaniel Lee, and Forin, Alessandro. eMIPS, A Dynamically Extensible Processor.
MSR-TR-2006-143, Microsoft Research, Redmond, WA, 2006.
2 Forin, Alessandro and Pittman, Richard Neil. eMIPS. http://research.microsoft.com/en-us/projects/emips/default.aspx.
3 Karuri, Kingshuk, Leupers, Rainer, and Kedia, Monu. Design and Implementation of a Modular and Portable IEEE
754 Compliant Floating-Point Unit. In Proceedings of Design, Automation and Test in Europe (Munich, Germany
2006), 1-6.
4 Ho, Chun Hok, Yu, Chi Wai, Leong, Philip, Luk, Wayne, and Wilton, Steven J. E. Floating-Point FPGA:
Architecture and Modeling. to appear in IEEE Transactions on Very Large Scale Integration (VLSI) Systems (2009).
5 Karlstrom, Per, Ehliar, Andreas, and Liu, Dake. High Performance Low Latency FPGA based Floating Point Adder
and Multiplier Units in a Virtex 4. In Proceedings of the 24th Norchip Conference (Linkoping, Sweden 2006), 31-34.
6 Sahin, Suhap, Kavak, Adnan, Becerikli, Yasar, and Demiray, H. Engin. Implementation of Floating-Point Arithmetics
using an FPGA. Mathematical Methods in Engineering (2007), 445-453.
7 Shirazi, Nabeel, Walters, Al, and Athanas, Peter. Quantitative Analysis of Floating Point Arithmetic on FPGA Based
Custom Computing Machines. In Proceeding of the IEEE Symposium on FPGAs for Custom Computing Machines
(Napa Valley, CA 1995), 155-162.
8 Krueger, Steven D. and Seidel, Peter-Michael. Design of an On-Line IEEE Floating-Point Addition Unit for FPGAs.
In Proceedings of the 12th Annual Symposium on Field-Programmable Custom Computing Machines (FCCM) (Napa,
CA 2004), 239-246.
9 Beauchamp, Michael J., Hauck, Scott, Underwood, Keith D., and Hemmert, Scott. Architectural Modifications to
Enhance the Floating-Point Performance of FPGAs. IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, 16, 2 (2008), 177-187.
10 Chong, Yee Jern and Parameswaran, Sri. Flexible Multi-Mode Embedded Floating-Point Unit for Field
Programmable Gate Arrays. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable
Gate Arrays (FPGA) (Monterey, CA 2009), 171-180.
11 IEEE STD 754-1985. IEEE Standard for Binary Floating-Point Arithmetic. IEEE Computer Society. 1985.
12 IEEE STD 754-2008. IEEE Standard for Floating-Point Arithmetic. IEEE Computer Society. 2008.
13 Koren, Israel. Computer Arithmetic Algorithms. A. K. Peter, Natick, MA, 2002.
14 Hennessy, John L. and Patterson, David A. Computer Architecture: A Quantitative Approach. Morgan Kaufmann,
San Francisco, CA, 2006.
15 Overton, Michael L. Numerical Computing with IEEE Floating Point Arithmetic. Society for Industrial and Applied
Mathematics (SIAM), Philadelphia, PA, 2001.
16 Flynn, Michael J. and Oberman, Stuart F. Advanced Computer Arithmetic Design. Wiley-Interscience, Malden, MA,
2001.
17 Ercegovac, Milos D. and Lang, Tomas. Digital Arithmetic. Morgan Kaufmann, San Francisco, CA, 2004.
18 Behm, M., Ludden, J., Lichtenstein, Y., Rimon, M., and Vinov, M. Industrial experience with test generation
languages for processor verification. In Proceedings of the 41st ACM/EDAC/IEEE Design Automation Conference
(San Diego, CA 2004), 36-40.
19 Chandra, A., Geist, D, Wolfsthal, Y. et al. AVPGEN - A test generator for architecture verification. IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, 3 (1995), 188-200.
20 Hauser, J. SoftFloat. http://www.jhauser.us/arithmetic/SoftFloat.html, 2002.
21 Underwood, Keith. FPGAs vs. CPUs: Trends in Peak Floating-Point Performance. In Proceedings of the 12th
ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA) (Monterey, CA 2004), 171-180.
Figure 7: Floating-Point Add Extension Test Start
Figure 8: Floating-Point Add Extension Test End
M2V – Automatic Hardware Generation from Software Binaries
Ruirui Gu
University of Maryland at College Park
Alessandro Forin
Microsoft Research
Abstract
The MIPS-to-Verilog (M2V) compiler translates blocks of MIPS machine code into a hardware design represented
in Verilog. The design constitutes an Extension for the eMIPS processor, a dynamically extensible processor realized on
the Xilinx Virtex-4 and Virtex-5 FPGAs. The Extension interacts closely with the basic pipeline of the microprocessor
and recognizes special extended instructions, instructions that are not part of the basic MIPS ISA. Each instruction is
semantically equivalent to one or more blocks of MIPS code. The tool-chain involving M2V automatically executes,
profiles, and patches the original binary executable to taek advantage of hardware acceleration platforms.
We are planning the first source-level release of the M2V compiler. The previous M2V version can accelerate single
block cases with the supports of load and stores, interrupts, and the automatic encoding of extended instructions. At the
summit we demonstrated the further development of the compiler to support self-looped basic blocks, which takes
advantage of both dataflow graphs and control flow structures. The released M2V will support multiple basic blocks with
the four basic control patterns and their combinations thereof.
1
Introduction
The goal of the project is to automatically generate hardware accelerators from software binaries. Figure 9 shows
the complete tool chain to automate the generation of hardware accelerators, and the role M2V plays in it. Other tools
(not shown) synthesize the Verilog file, generate the configuration bitfile, and merge it with the patched binary.
BB Tools
*.EXE Compiled
Binary
Executable
GIANO
Simulator
*.BBW
Extension
Basic Block
File
M2V
(MIPS to
Verilog
Compiler)
*.V
Synthesizable
Verilog
*.EXE
Patched
Binary
Figure 9: Tool-chain for automatically accelerating executable binary files. GIANO is a full-system simulator that
executes the application and extracts basic block profiles. The BB Tools are a series of tools to select the basic
blocks to accelerate and patch the binary image with the special instructions for the accelerator.
The input to the tool chain is executable binary files, which are generated by off-the-shelf compilers for the given ISA.
This tool-chain restricts the code selection problem to the set of most-frequently executed basic blocks in the application.
Each basic block is a directed acyclic graph (DAG), which is a set of machine instructions that do not contain branches
and are branched-to only at the very first instruction. The best candidate blocks are those blocks which require a lot of
computation and occur at high frequency in the application. These candidates are extracted by executing the application
using the Giano full-system simulator, in concert with the data obtained via static analysis of the application binary. The
profile directs the BBTools in selecting the candidate basic blocks, and in patching the binary image with the special
instructions for the accelerator. M2V automatically generates the design for the hardware accelerator, which is then
synthesized onto programmable logic such as FPGA boards, using the manufacturer’s tools (e.g. Xilinx ISE).
This tool-chain applies to any programmable logic attached to a tightly coupled pipeline, since the tight coupling creates
minimal latency between the accelerator and the RISC pipeline. The extensible MIPS (eMIPS) processor is such a
platform that is being developed at Microsoft Research as an example of a RISC processor integrated with programmable
logic. The eMIPS platform consists of a standard MIPS pipeline and an extension unit (EU). The EU contains
programmable logic that is used for extensions to the MIPS instruction set. These extensions are used to accelerate the
execution of an application. The machine code for the extended instruction is inserted before the accelerated basic block
in the MIPS binary. When the extended instruction completes, program execution will proceed at the address following
the basic block or at the address of a branch target. See Figure 7 for a simple example.
The objective of the M2V compiler is to automatically create the logic for eMIPS extensions using a .bbw file as the
hardware specification. The M2V compiler generates synthesizable Verilog which is synthesized using the standard
Xilinx place and route tools to create a bit file that can be loaded onto the eMIPS platform. In an embedded platform, the
extension can be loaded at power-up. In a more general purpose system, the extension can be dynamically loaded when
a binary image is loaded. Dynamic loading of the extension requires partial reconfiguration of the programmable logic.
By dynamically loading and unloading accelerators, the area of the programmable hardware can be used more efficiently.
It is worth to note that, the original code is preserved so that execution can fall back to software when necessary, which
ensures the reliability of the system and provides software more flexibility in scheduling the accelerator units.
2
M2V Compiler Architecture
The M2V compiler is a four-pass compiler, as shown in Figure 10.
Build controlflow graph
Build dataflow
circuit graph
Static
scheduling
Emit Verilog
code
Figure 10: Four-pass MIPS instruction set to Verilog Compiler
The first pass builds the control flow graph based on the relationship between basic blocks. Besides single block cases,
we are dealing with four kinds of basic control patterns among basic blocks: Sequential, Self-loop, Branch and Join, as
shown in Figure 11. It is easy to see that these four patterns cover all possible control graphs.
B1
B1
B1
B2
B1
B2
B2
Sequential
Self-loop
B3
Branch
B3
Join
Figure 11: An application’s control flow graph is built out of these four kinds of basic control patterns.
The Sequential pattern is the case where at the end of block B1 there is an unconditional branch pointing to block B2. We
deal with this case by combining both blocks together and re-cannibalizing the global register set. The Self-loop pattern
is the case where at the end of the basic block B1 there is a conditional branch back to the beginning of the block. A
simple “FOR” loop in software will generate this pattern. The Branch pattern is the case where at the end of B1 there is a
conditional branch targeting either B2 or B3. The Join pattern is the case where two entry blocks B1 and B2 both
unconditionally jump to the same block B3; note that only one of either B1 or B2 can be active at any given time. After
analyzing the control flow between basic blocks, we conduct different strategies for optimized implementations. At the
summit, we demonstrated how M2V recognizes a Self-loop pattern, and how to compile Self-loops efficiently.
The second pass in M2V semantically analyzes the MIPS instructions within one block, and builds and connects
nodes in a dataflow graph revealing their data dependencies. There are two types of nodes in the graph: register nodes
and instruction nodes. The semantic analysis provides the cost for each instruction and the function of register
dependencies. The register nodes represent a register access which could go to the register file or to a temporary storage
location in the EU. The register table tracks whether a register has already been read from the register file, where the last
update to the register is locally held, and whether the register needs to be written back to the register file. When multiple
instructions read the same unchanged register value, the register table provides the information so only a single register
node is created. Register nodes may or may not result in an actual clocked hardware register depending on specific
control patterns. The final schedule for the extension determines when pipeline stages are added and whether a register
node will result in a hardware register.
A major challenge for the M2V compiler is to constrain the EU such that it does not interfere with instructions flowing
through the eMIPS pipeline before and after the extended instruction executes. eMIPS uses a standard five stage RISC
pipeline, with IF, ID, EX, MA, and WB stages. This pipeline is tightly integrated with the EU. An extended instruction
will take multiple cycles to execute since it is semantically equivalent to all of the MIPS instructions in a basic block.
During ID, the extension will snoop the register reads that are visible to the primary eMIPS pipeline. If the instruction is
an extended instruction, the EU will claim the instruction and stall the instructions behind it while it executes.
Instructions before the extended instruction complete normally and must have access to the same resources that they
would normally use. Figure 12 shows how the instructions proceed through the eMIPS pipeline, where instruction m is
the extended instruction executed on EU. Here m is a single instruction that replaces one or more basic blocks.
Cycle Number
0
1
2
3
4
Instruction m-2
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
IF
ID
EX1
EX2
Instruction m-1
Extended Instruction m
Instruction m+1
Instruction j
Instruction j+1
5
...
n+2
n+3
n+4
n+5
n+6
n+7
...
Exn-2
EXn-1
EXn
MA
WB
IF
ID
EX
MA
WB
IF
ID
EX
MA
n+8
IF
WB
Figure 12: eMIPS pipeline with the extended instruction on EU. Extended instruction m represents the basic
block executed on EU.
During cycle 3 in Figure 12, the EU will decode and claim the extended instruction, snoop the reads from the register
file, store the register reads, and prepare to stall the trailing instructions in cycle 4. The instruction fetch in cycle 3 does
not perform useful work since this instruction is the first instruction of the accelerated basic block. During cycle 4,
instruction m-2 has control of the register write-back logic, instruction m-1 has control of the memory access logic, and
the extended instruction begins stage EX1. In EX1, reads to the register file are controlled by the EU since future
instructions are stalled. In EX2, the EU can read from the register file and access memory through the main memory
unit. The EU is in steady-state from EX3 until EXn-2 and it can control all ports on the register file and access to the
memory logic. In stage EXn-1, instruction j must be fetched and so any branch conditions and branch addresses must be
resolved by this stage. In cycle n+4, the EU must relinquish control of the register read ports to instruction j which is in
the ID stage. In cycle n+5, the EU performs its last memory access and can also write to the register file. In cycle n+6,
the EU performs its last write to the register file and the extended instruction is complete.
The third pass of the M2V compiler creates the schedule for register and memory accesses by doing a
constrained depth-first traversal of the dependency graph created in the second pass of compilation. The traversal begins
at the register nodes and continues until a dependency cannot be met. When the node cannot be completed, it is placed
on a queue to be traversed in the next cycle. The nodes with unmet dependencies at the end of a cycle mark where
pipeline stages will be inserted. Constraints on the schedule are the register and memory resources available in a given
cycle and the delay through a sequential stream of operations. The register-to-register delay is estimated from the
complexity of the instruction and the fan-out of the register nodes. The semantic analyzer provides the complexity of the
instruction and the dependency graph yields the fan-out from each node. When the delay exceeds the cycle-time
threshold, a pipeline stage is added. In order to distinguish with traditional meaning of “pipeline stage”, we denote state
as one intermediate stage in which a set of instructions are executed in a parallel way. Each state may contain several
cycles if the cost of some instruction (e.g. memory loads) is more than one cycle. During static scheduling of the
dataflow graph, one important factor is the determination of the two registers, rs and rt, to be encoded in the extension
instruction. These two registers are available directly from the decode stage of the MIPS pipeline, without further access
penalty. The selection of these two registers determines the roots of the scheduling tree, and therefore affects the
execution time of the extension. In M2V we determine these two registers based on the parameters of fan-out and depth.
Based on the dependency graph, the fourth pass generates the synthesizable Verilog, which is executed on EU.
3
Self-loop Basic Blocks
In the previous version of M2V, one basic block is executed only once, and then the EU resource is returned back to the
processor. In the case of a self-loop, this results in a pipeline-constrained system, as shown in Figure 13. Every time one
loop is finished, the resource and program counter (PC) are returned back to the processor’s main pipeline. When
looping, the processor simply executes the extended instruction again. This is not efficient for the processor has to start
another pipeline to load and execute the extension again. Although we use the same programmable logic in the same
extension, the time between executions of two loops is not efficient for hardware accelerators.
RF
Read
R1
RF
Read
R2
RF
Read
R4
[4]
SLL
[8]
SRL
[10]
SLL
[14]
SRL
[1c]
SLL
[20]
SRL
[30]
SLL
Temp
Write/
Read
R1
Temp
Write/
Read
R3
Temp
Write/
Read
R2
Temp
Write/
Read
R3
Temp
Write/
Read
R4
Temp
Write/
Read
R3
RF
Write
R5
[c]
OR
RF
Read
R5
[18]
OR
[24]
OR
State 1
State 2
RF
Write/
Read
R1
RF
Read
R6
RF
Write
R2
RF
Read
R0
RF
Write
R4
State 3
State 4
State 5
[28]
SLTU
Temp
Write/
Read
R3
[2c]
BEQ
Back to
pipeline
Figure 13: Circuit graph: Pipeline-based implementation of self-loop basic block. RF represents register file. The
number in [] is the byte-index of the corresponding MIPS instruction. It starts from 4 because 0 corresponds to
the extended instruction. Different colors represent different states, and each state may execute for one or more
than one cycle. “Back to pipeline” represents the resource being returned to the processor.
RF
Read
R1
RF
Read
R2
RF
Read
R4
[4]
SLL
[8]
SRL
[10]
SLL
[14]
SRL
[1c]
SLL
[20]
SRL
[30]
SLL
Temp
Write/
Read
R1
Temp
Write/
Read
R3
Temp
Write/
Read
R2
Temp
Write/
Read
R3
Temp
Write/
Read
R4
Temp
Write/
Read
R3
RF
Write
R5
[c]
OR
RF
Read
R5
[18]
OR
[24]
OR
State 1
State 2
State 3
RF
Write/
Read
R1
RF
Read
R6
RF
Write
R2
RF
Read
R0
RF
Write
R4
State 4
State 5
State 6
[28]
SLTU
Temp
Write/
Read
R3
[2c]
BEQ
Figure 14: Circuit graph: Control-flow-based implementation of self-loop basic block. The blue arrow represents
the transition from state 4 to state 1.
We propose the control-flow based routine to implement the self-loop basic block shown in Figure 14. In this
implementation, the extension executes all the loops at one time before returning the resource back to the processor. At
the end of each loop, the branch condition is calculated. If the target PC points to the beginning of the block, the
extension will execute the same logic again. The processor is not involved into the execution on the extension, so all the
computation of self-loop is finished in one pipeline with more cycles assigned to EX stage.
4
Demo and results
At the summit, we showed the demo of M2V, including the following parts:
A.
B.
C.
D.
Input .BBW file to M2V
Generated circuit graph in M2V
Generated Verilog coding from M2V
Simulation results in ModelSim
The example basic block implements part of a 64-bit division, which turns out to be frequently used in real-time
scheduling and in video games. The disassembly of the block is given in Figure 15:
[0] ext0
r4,r2,offset0
[4] sll
r1,r1,1
[8] srl
r3,r2,31
[c] or
r1,r1,r3
[10] sll
r2,r2,1
[14] srl
r3,r4,31
[18] or
r2,r2,r3
[1c] sll
r4,r4,1
[20] srl
r3,r5,31
[24] or
r4,r4,r3
[28] sltu
r3,r1,r6
[2c] beq
r0,r3,offset0
[30] sll
r5,r5,1
Figure 15: Example basic block
The automatically generated circuit graph is shown in
Figure 16. NB: The index of instructions here starts from 0.
Figure 16: M2V automatically generated this circuit graph for the example basic block in Figure 15.
A portion of the ModelSim-based simulation of eMIPS executing the Self-loop example is shown in Figure 17. The
highlighted signal of state_r corresponds to the state shown in Figure 14. The area confined by two yellow lines shows
two loops of this block. It is easy to find that the states in each loop are just repetition of some fixed patterns.
Figure 17: Wave form in the execution of the example Self-looped basic block. The signal of state_r represents the
corresponding states in Figure 14.
5
Status of M2V Development
The current version of M2V supports the implementation of single basic block on the extension. It also supports memory
load and store instructions, external interrupts, and TLB misses. We have already extended M2V to support the efficient
implementation of the self-loop basic block pattern.
We are going to release the first source version of M2V in August 2009. The M2V release will support multiple basic
blocks with the four basic control patterns and their combinations. By using the M2V involved tool chain described in
Figure 9, one can enjoy hardware acceleration from easily obtained executable binary files.
Dual-Core eMIPS
Zhimin Chen
Virginia Tech
Richard Neil Pittman
Microsoft Research
Abstract
The Dual-Core eMIPS research platform shown at the Faculty Summit integrates two eMIPS cores in a Xilinx
Virtex-5 C5VLX110T FPGA on the BEE3 board. The platform provides a shared-memory architecture for interprocessor communication, barrier synchronization support, and a number of peripherals. The first demo is a software
self-test process for the critical modules in the dual-core system. The second is a parallelized Montgomery modular
multiplication of large integers, with a speedup factor of 1.9x over the sequential version.
1
Introduction
As a part of the Multi-Core eMIPS platform, the Dual-Core eMIPS platform contains two eMIPS cores in one Xilinx
Virtex-5 C5VLX110T FPGA on the BEE3 board. It implements an on-chip shared-memory architecture with both local
and shared memories as well as shared I/O peripherals. Figure 1 is a block diagram of the system.
eMIPS
Core 0
eMIPS
Core 1
MRU
LMP
LMP
BR
DDR2 RAM
Shared
Message
Router
RINGs
SMP
RINGs
Figure 1: Overview of the Dual-Core eMIPS platform
In Figure 1, MRU represents the Memory Reservation Unit, which supports the LoadLink and StoreConditional
standard instructions in the MIPS-2 ISA. LMP is the Local Memory Peripherals, including local BlockRAM and local
timer. SMP is the Shared Memory Peripherals which contains the shared DDR2 SDRAM, shared BlockRAM, shared
interrupt controller, shared USART module, shared GPIO module and all other (shared) peripherals. Inside SMP, BR is
the bridge connecting the local memory bus to the shared peripherals. Every memory module or I/O module is connected
to the two eMIPS cores by means of this bridge. Shared Message Router is the module that handles inter-FPGA
communication through the RING connections among FPGAs on the BEE3 board.
Besides the MRU, the platform integrates another simple but efficient mechanism to support barrier synchronization.
Two shared message boxes (32-bit registers) are connected to both eMIPS cores through a BR. After a system boot-up
reset, both boxes contain 0x0. With the following code, operations in two eMIPS cores can be synchronized at every pair
of barrier functions.
eMIPS Core 0
eMIPS Core 1
void barrier(void){
void barrier(void){
volatile UINT32 * mb0 = MBADDR0;
volatile UINT32 * mb0 = MBADDR0;
volatile UINT32 * mb1 = MBADDR1;
volatile UINT32 * mb1 = MBADDR1;
*mb0 = 0x5555aaaa;
*mb1 = 0xaaaa5555;
while(*mb1 != 0xaaaa5555);
while(*mb0 != 0x5555aaaa);
*mb1 = 0x0;
*mb0 = 0x0;
}
}
This platform is easy to use to explore parallel programming and/or scheduling. In our demonstrations, programs are
cross-compiled on a PC and the user interacts with them through a USART console.
2
Demonstration
At the Faculty Summit, we present two demonstrations. The first is a self-test process of the critical modules in the
dual-core system. The second is a Montgomery modular multiplication of large integers, including both a sequential and
a parallel version. The parallel version shows a speedup over the sequential one of 1.9x.
2.1
Demonstration 1: test process of critical modules
In the first demonstration, modules under test include the shared BlockRAM, the shared DDR2 SDRAM, the
Memory Reservation Unit, and the Processor ID module.
To test the shared BlockRAM, we go through the following steps.
1) eMIPS Core 0 writes 0xFFFFFFFF to every address of the shared BlockRAM, then eMIPS Core 1 reads every
address of the shared BlockRAM to check the value;
2) Change the value to 0x00000000, 0xAAAAAAAA, and 0x55555555, and go through step 1 another 3 times;
3) eMIPS Core 1 writes 0xFFFFFFFF to every address of the shared BlockRAM, then eMIPS Core 0 reads every
address of the shared BlockRAM to check the value;
4) Change the value to 0x00000000, 0xAAAAAAAA, and 0x55555555, and go through step 3 another 3 times;
We use the same method to test the shared DDR2 SDRAM. Only 1K space of the DDR2 SDRAM is under test. If
the test passes, we consider the shared DDR2 memory works correctly.
The third test is on the Memory Reservation Unit, which consists of the following steps. For convenience, we use
LL0, SC0, LL1, SC1 to represent the LoadLink and StoreConditional operations performed by eMIPS Core 0 and eMIPS
Core 1.
1) In sequence, LL0(datamem1), SC0(datamem1, 0x11110000), LL0(datamem2), SC0(datamem2, 0x11110000)
are performed. data1 and data2 are used to indicate whether SC0 operations are successful or not. If the MRU
works correctly, after the operations, both datamem1 and datamem2 should be 0x11110000; both data1 and
data2 should be 0x1.
2) In sequence, LL0(datamem1), LL0(datamem2), SC0(datamem1, 0x55555555), SC0(datamem2, 0xaaaaaaaa)
are performed. data1 and data2 are used to indicate whether SC0 operations are successful or not. If the MRU
works correctly, after the operations, both datamem1 and datamem2 should remain 0x11110000; both data1
and data2 should be 0x0.
3) In sequence, LL0(datamem1), LL0(datamem2), SC0(datamem2, 0x55555555), SC0(datamem1, 0xaaaaaaaa)
are performed. data1 and data2 are used to indicate whether SC0 operations are successful or not. If the MRU
works correctly, after the operations, datamem1 should remain 0x11110000, datamem2 should be 0x55555555,
data1 should be 0x0 while data2 should be 0x1.
4) eMIPS Core 1 performs the same operations as shown from step 1 to step 3.
5) In sequence, LL0(datasmem), LL1(datasmem), SC0(datasmem, 0x11111111), SC1(datasmem, 0x55555555)
are performed. Both cores use their own data1 to indicate whether the SC operations are successful. If the MRU
works correctly, after the operations, datasmem should be unchanged; both cores have data1 as 0x0.
6) In sequence, LL0(datasmem), LL1(datasmem), SC1(datasmem, 0x11111111), SC0(datasmem, 0x55555555)
are performed. Both cores use their own data1 to indicate whether the SC operations are successful. If the MRU
works correctly, after the operations, datasmem should be 0x11111111; Core 0 has data1 as 0x0 while Core 1
has data1 as 0x1.
Finally, we test the Processor ID module. Only Core 0 always has its PID valid (0x20). Therefore, after reset, Core 0
gets its PID 0x20 while Core 1 gets its PID 0x00. As the primary core, Core 0 can configure the PID of Core 1 to a valid
PID (e.g. 0x21). After configuration, Core 1 gets its own PID 0x21. Except for Core 0, all the other cores are not able to
access the PID of other processors. So even after configuration, when Core 1 tries to read others’ ID, it always gets 0x00.
Figure 2 shows the results of the above 4 tests, which proves that the parallel platform works correctly.
Figure 2: Results of demonstration 1.
Figure 3: Results of demonstration 2.
2.2
Demonstration 2: parallelized Montgomery modular multiplication on large integers
The second demonstration gives an example of parallel programming based on the dual-core platform. The
application is a Montgomery modular multiplication without the final subtraction adjustment. We tested both the
sequential implementation as well as the parallel one. The speedup of the parallel one over the sequential one is up to 1.9.
The results are presented in Figure 3.
From the results, we found that, through proper parallel programming, high speed up can be achieved based on the
current platform.
NetBSD on eMIPS
David Sheldon, Alessandro Forin
Microsoft Research
Abstract
We demonstrate an online scheduling algorithm for hardware accelerators and its implementation on the NetBSD
operating system. The scheduler uses the current performance characteristics of the accelerators to select which
accelerators to load or unload. The evaluation on a number of workloads shows that the scheduler is typically within
20% of the optimal schedule computed offline. The hardware support consists of simple cost-benefit indicators, usable
for any online scheduling algorithm. The NetBSD modifications consist primarily in loadable kernel modules, with
minimal changes to the operating system itself. The system was demonstrated running multi-user on an ML402 board
and running diskless on a BEE3 board.
1
Introduction
eMIPS is a dynamically extensible processor that includes a standard MIPS trusted ISA tightly connected to
reconfigurable hardware. The programmable logic is divided in extension slots that plug into the main pipeline stages
during the execution of a program, as depicted in Figure. At DemoFest, we present a scheduling algorithm for allocating
the extension slots to competing applications, under a general-purpose operating system such as NetBSD.
Figure 1: The scheduler supports tightly-coupled micro-processor architecture with a number of hardware
extension slots usable for accelerating software applications.
2
Hardware and Software Support
Hardware support for accelerator scheduling is based on a pair of performance counters shown in Figure 2, which
identify the costs and benefits in using the accelerators. The choice is intentionally general enough that software has
ample freedom to schedule the resources as desired. The scheduler we implemented is independent of thread scheduling,
which we considered an orthogonal problem. The scheduler is realized as a loadable kernel module, thereby eliminating
all fixed overheads (e.g. in case it is not used) and allowing for easy selection of alternate implementations. Additional
software support includes a new image format for accelerators and related utilities.
Figure 2: Hardware support for scheduling includes hit and miss counters. Each time an Extension slot recognizes
an extended instruction the corresponding hit counter is incremented. Each time an extended opcode is not
recognized by any of the slots the corresponding miss counter is incremented. The mapping from opcodes to
accelerators is provided by software.
3
The demos
The first demo shows the system running in multi-user mode on the ML402 board. On this system, we used a 2GB
compact flash card as disk storage, taking advantage of the SystemACE component. The system can actually operate in
much less disk storage, but about 400MB of disk space are usually recommended for a fully-featured NetBSD system.
This system has been operational for some time and is very stable. A fan-sink solves the excessive heat generated at the
DDR memory interface and the system can therefore be left running indefinitely. There is no network on this system.
Figure 3: The NetBSD system on the ML402 board is complete and fairly stable.
The second demo shows the system booting diskless on the BEE3 board, using the just-completed eNIC Gigabit Ethernet
peripheral. We use a VirtualPC on a portable PC to run the DHCP, TFTP and NFS servers used to assign an IP and
configuration data to the BEE3, to provide the kernel image, and to provide the disk storage services, respectively. Figure
4 shows the complete output on the serial line, from system reset to single-user prompt. In this system, we incorporate
the boot loader in the BRAM image to work around the very poor performance of the USB serial interface. We ask the
loader (typing the “bbbb” command) to extract itself into DDRAM and to fetch the BSD kernel image via the eNIC. The
image we select is the default “nfsnetbsd”. Once fetched and started, the NetBSD kernel initializes itself, then repeats the
DHCP inquiry to find its NFS file server and eventually gives us the single-user shell prompt. The whole process takes
130 seconds, from start to finish.
Figure 18: Boot sequence of NetBSD on the BEE3 board, from reset to single-user prompt.
Appendix A: Posters
The following posters were presented at DemoFest:
Education
Multi-Touch
Dance Pad
Gigabit
SQL
Concolic
Mining specs
FPU
eMIPS overview
Vikram
M2V
Multicore
NetBSD
Appendix B: Movies and Pictures
The following movie demonstrates something. If you are looking at the Microsoft Word version of this document you
can double-click on the icons to watch the videos.
Here are a few pictures from the DemoFest booth.