Visualizing CPU Microarchitecture
Visualizing CPU Microarchitecture
Visualizing CPU Microarchitecture
24 (2015): 197–210
doi: 10.4467/20838476SI.16.017.4358
Tomasz Wojtowicz
Department of Computer Sciences and Computer Methods, Pedagogical University
ul. Podchorażych
, 2, 30-084 Kraków, Poland
e-mail: [email protected]
1. Introduction
There are two interesting trends emerging recently in the computing. One of those
trends is related to the fact, that computer science industry achieved a certain level
of maturity and acknowledges that over the decades it created a vast pool of works
(software and applications in general). It is becoming a common understanding, that
198
all this software represents a certain value (technical and cultural) and there are some
efforts started to preserve it for future generations [1,2]. This means there is a need of
being able to run that software in the environment that hardware wise and operating
system wise is very far from the original platform the software was written for. While
over the years many ISA level emulators have been created – engineers are now in
pursue of ‘high fidelity’ emulation platform [3, 4].
One of the means to achieve such high accuracy of emulation is to build a low-
level simulator for often forgotten and still proprietary legacy platforms. To reach that
goal the legacy platforms are analysed at the very low level, for example by decapping
integrated circuits [5–7] and based on their physical layout creating models that mimic
almost exactly the way those machines worked.
Another trend is a rise of ubiquitous computing [8], that involves wearable com-
puters, low-power, low-complexity embedded sensors and microcontrollers. The char-
acteristic of those devices is that they are far less complex than the desktop PCs we
are using, and to get the most of it – code needs to be optimised to the great levels.
From the programmers perspective this resembles the rise of the microcomputers in
the late 1980s [9–11] and in this sense technically links directly to the above trend on
software preservation.
Both of the trends raise the need for better understanding of how the certain
integrated circuit, either a legacy CPU or a dedicated chip or some small sensor,
works internally. Knowing in detail the microarchitecture enables the programmers
to write more efficient software or more accurate simulator for a given platform.
An additional driver for effective visualisation of the chip at the low levels is that
simulation techniques running at transistor or gate-level, while accurate, are relatively
slow comparing to ISA level emulation. An insight of how integrated circuit works
internally, augmented with some tools measuring the statistics of the simulation model
performance, will be an aid in improving the simulation speed.
In this paper, we propose an enhanced CPU visualisation framework, on top of the
simulator running at the transistor level. We are presenting a framework architecture,
describing how its functionality differs from the previous approaches. We will show in
detail certain aspects of the visualisation that may aid the engineers in both reverse
engineering the legacy software or writing new software. We will highlight some
metrics that can be gathered on top of existing simulation models to speed it up in
the future.
2. Previous works
The presentation of CPU microarchitecture have been studied since the beginnings of
computing. In the early ages the amount of computing elements (relays, switches and
later on transistors) was relatively low so it could often fit to the smallest details on a
single blueprint. Since the introduction of transistors their volume grew rapidly over
time following the Moore law. To be able to show it effectively abstraction levels are
introduced. A CPU is cut into functional components, like ALU, operation decoders,
199
internal registers, internal buses, etc. This approach is widely present in the literature
[12,13]. What is described are usually abstract models, and as such – one cannot really
write programs for it nor observe how they work in practise. There are works that
focus on some specific aspects – like memory subsystem and cache misses simulation
[14, 15]. On the other hand there are efforts in the literature that build computers
bottom up, from the level of single transistors and logic gates [16] but again, these
computers are just theoretical. Full, real CPU simulators with visualisation, that can
run real software are emerging but still rare [5, 7].
3. Proposed approach
Overlays definition
CPU nodenames
Visualization
Engine
formation (through the input files) that contain segments (nodes), transistors and
named segments (for semantic analysis). It is worth to note, that this information is
at very basic level and consists only of interconnection information about these ele-
ments. Another key component is the visualisation module. Original works rendered
the CPU in 2D only and as being non-accelerated slow down the execution quite sig-
nificantly. In our implementation we went for OpenGL accelerated view (using Java
bindings via JOGL). As a result the main CPU where simulation runs is tied mostly
to the actual simulation, while the visualisation bit is handled by the dedicated GPU
unit. Moreover OpenGL allows us to easily introduce all sorts of overlays, customised
blending, views from various perspectives.
Spanning both the core and the visualisation modules is the simulation control
and UI component. It is instrumented by overlay information (with functional com-
ponents of the CPU, like specific registers, internal buses, CPU pads – used during
visualisation), test binary program and disassembly map (of binary code into the
assembly language commands).
Typical visualisation of the whole CPUs so far was using mostly physical aspects of
the design. CPU was rendered by layers, with different types of colours corresponding
to different type of material. There was little indication of the current activity inside
the processor.
While the simulator itself is loaded with the chip definition containing only basic
connectivity between segments and transistors – the framework can be also provided
with component overlays. Definition of those requires some semantic information
on the chip, but gives an additional value to the engineer, showing where particular
functional units are on the floor chip, and how the data flows through them as the
program is executed. Most of functional units on the CPU view are augmented with
their current value in the cycle. This is equivalent to the CPU and memory status
view (log), but in a more visual form. As there are multiple components inside the
CPU, we narrowed the overlay only to selected ones.
201
Figure 2. Typical visualisation of the real CPU. Various colours denote various
types of layers, like metal, polysilicon, powered diffusion, grounded diffusion.
Another enhancement on the CPU floor plan view is to highlight the internal buses
that are part of the processor. Through these buses the data flows between the
functional components highlighted in the previous section. This time, to emphasise
the directional and transport aspects of these buses they are visualised by accenting
the segments they are build of. As these buses are multiple bits wide (in case of 6502
the data bus is 8-bits while the address bus is 16-bits), only a subsets of segments are
used. Again – to have a better view on data flows inside the CPU, data paths and
buses are presented with their current values in the cycle.
In practise what works best is a combination of functional bounding boxes and
data paths between these functional areas. Therefore in our visualisation framework
this is all configurable by the user.
202
To be able to trace, how the data goes through the CPU, when one operation con-
cludes and the other starts, it is helpful to see the CPU state in the sequence, cycle
after cycle, all in one view. Therefore next to the graphical view of the CPU floor
plan with all the important functional areas, data paths and buses we are also pro-
viding a more typical view. Each half-cycle we are logging the state of all the above
components and provide the outline in a tabular form to the user. We have added
an additional colouring here, to give at a glance view, where is the barrier between
commands and how the individual commands are structured (especially the ones that
fetch or store data from the memory at address provided in the operands). With this
colouring one can easily spot, how the CPU passes internally the address vectors and
the actual data.
Such view allows to spot early forms of pipelining in the 6502 CPU. Note in Table 3
203
that in 1st cycle data bus already contains the next command to be executed (LDA),
while the internal address bus is already provided with the address $0001 to fetch
the older byte of the LDA operand (0xFE). In the very next cycle (2) the data bus
204
Figure 4. Proposed visualisation approach with internal buses and datapaths shown,
combined with overlay for selected functional components.
already fetches the operand while IR is just set to LDA. At the same time the internal
address bus is set to fetch the younger byte of the LDA operand (from address 02).
This address is exposed on AB in cycle 3 and fetching of the actual data (0x00) is
in cycle 4. Such detailed analysis can be also augmented with visual indicators, that
highlight which parts of the chip are active at which stages, what will be shown in
3.6.
One of the novel features we introduced to the visualisation is showing the segment
activity over clock cycle. We are tracking which segments are changing state, and
205
Table 2. Short 6502 program excerpt (including PC, binary and assembly source
code).
highlight them on the chip layout. When combined with functional component lay-
outs (like ALU, or instruction decoder (PLA)) it can quite well show the engineer –
what parts of the CPU are involved in which exact activities in each cycle.
(a) (b)
Figure 5. Internal CPU activity, combined with functional component overlays. Red
segments denote segments that changed state comparing to the recent cycle in two
consecutive cycles (a) and (b). Note that on the left most activity is in the ALU area
and external data pads, while on the right most activity is related to internal CPU
data transfers.
This, combined with the continuous CPU state information available to the user,
can quite effectively explain why certain operations take just 2 cycles (4 half-cycles)
and why some other operations require up to 6 cycles (12 half-cycles).
206
Table 3. Internal CPU states during execution of LDA $00FE command followed
by STA $0000 command. Part of the program from 2. Note how the STA command
is interleaved with the 2nd half of the LDA command. Such colouring highlights the
data flow within the command execution (fetch command, fetch operands, fetch data
– if necessary, perform the command, fetch new command).
Below is a CPU activity during the execution of LDA $00FE command and subsequent
loading of STA $0000 command (based on the example from 3.4.).
The original 6502 CPU was usually running between 1 and 1.5 MHz. The initial
version of Visual6502 simulator [7] was running at approximately 25 Hz. Thanks to
recent years advancements on JavaScript runtime in web browsers, the simulator can
now reach around 250 Hz (with no visualisation enabled). A more optimised C port
of the simulator can run at approximately 1 kHz on a modern PC.
The fastest software simulation up to date is based on reverse engineered VHDL
model of the original chip schematics and can run as fast as 4 kHz (using Verilator
software). There are of course FPGA based implementations developed since the
207
(g) (h)
Value from $00FE loaded to A
Figure 6. Activity map of LDA $00FE command (that takes 4 cycles) followed by
the STA $0000 command.
208
publication of [7] that can run at 1 MHz and more (and can be even integrated into
working Atari computer – with all the other dedicated chips working just fine with
the FPGA implementation of the 6502 CPU). Our model implementation written in
Java can run around 500–700 cycles per second with visualisation enabled.
The challenge with transistor level simulator is, while its pretty accurate in be-
tween the cycles, it requires multiple iterations to reach a stable status within a single
cycle. On real world programs it can be as high as 30 iterations (but also as low as
2-5). The proposed framework supports gathering of such statistics.
20
18 Iterations/cycle
16
14
#Iterations
12
10
8
6
4
2
0
0 20 40 60 80 100 120 140 160 180 200
Cycles
Figure 7. Statistics gathered from the simulator: how many iterations were required
to stabilise the CPU state in the cycle.
1400
#Active segments
1200 #Evaluated segments
1000
#Segments
800
600
400
200
0
0 20 40 60 80 100 120 140 160 180 200
Cycles
Figure 8. Statistics gathered from the simulator: how many segments had to be
evaluated in the cycle against the number of segments that actually changed the
state (active). The total amount of segments in 6502 CPU is 1704.
The statistics collected so far indicate, that the activity factor for 6502 is around
10%. This means that on average one tenth of segments change state over each
clock cycle. Another metric shows that the current version of the simulator evaluates
approximately 5 times more segments, than actually change state over cycle. Taking
this and the average amount of iterations needed currently to stabilise the state of
the CPU we believe there is a significant potential to speed up the simulation by a
factor or two.
209
The proposed framework addresses quite well the need of straightforward, interactive
and dynamic simulation framework for architecture learning purposes. It provides
the engineer many capabilities not available so far without a real equipment. It gives
an easy and efficient insight on how the microprocessor works.
One of the future work directions is definitely optimising the simulation core itself
in terms of performance. A framework is developed in such a way, that it is possible
to swap the simulation core with an alternative, test it using some asserts and traces
at a single instruction level. Collected measurements indicate there is still a lot of
potential for improvements in the way the current model works. Moreover – so far
no spatial characteristics were leveraged (fixed topology of the CPU) nor speculative
execution was exercised (predict the next state based on the gathered data from
previous executions). Another direction of work can be expanding the framework to
support any CPU core (like original Motorola 6800, or Intel 4004/8008, or others) –
if only the basic netlist information is available.
5. References
[1] Matthews B., Shaon A., Bicarregui J., Jones C., A framework for software preser-
vation. The International Journal of Digital Curation, 2010, 5(1), pp. 91–105.
[2] Owens T., Preserving.exe: Towards a National Strategy for Software Preser-
vation. NDIIPP. http://www.digitalpreservation.gov/multimedia/documents
/PreservingEXE report final101813.pdf, 2013 [Accessed 7-July-2014].
[4] Byuu, Accuracy takes power: one mans 3GHz quest to build a perfect SNES emu-
lator. http://arstechnica.com/gaming/2011/08/accuracy-takes-power-one-mans-
3ghz-quest-to-build-a-perfect-snes-emulator/, 2011 [Accessed 5-July-2014].
[5] Aspray W., The intel 4004 microprocessor: What constituted invention? IEEE
Annals of the History of Computing, 1997, 19(3), pp. 4–15.
[6] Faggin F., Intel 4004 - 35th Anniversary Project. http://www.4004.com/, 2005
[Online; accessed 5-July-2014].
[7] James G., Silverman B., Silverman B., Visualizing a classic cpu in action: the
6502. In: SIGGRAPH Talks, ACM, 2010.
210
[8] Zhou Y., Xu T., David B., Chalon R., Innovative wearable interfaces: an ex-
ploratory analysis of paper-based interfaces with camera-glasses device unit. Per-
sonal and Ubiquitous Computing, 2014, 18(4), pp. 835–849.
[9] Bagnall B., Commodore: A Company on the Edge, 2010.
[10] Maher J., The Future Was Here: The Commodore Amiga (Platform Studies).
The MIT Press, Cambridge, 2012.
[11] Vendel C., Goldberg M., Atari Inc.: Business is Fun. Syzygy Press, New York
2012.
[12] Hanson D.F., A vhdl conversion tool for logic equations with embedded d latches.
In: Proceedings of the 1995 Workshop on Computer Architecture Education.
WCAE-1 ’95, New York, ACM, 1995.
[13] Tanenbaum A.S., Structured Computer Organization (5th Edition). Prentice-
Hall, Inc., Upper Saddle River, NJ, USA, 2005.
[14] Drepper U., What every programmer should know about memory.
http://lwn.net/Articles/250967/, 2007 [Accessed 5-July-2014].
[15] Inc. O., Missing cache visualization. http://www.overbyte.com.au/misc/Lesson3
/CacheFun.html, 2010 [Accessed 5-July-2014].
[16] Stokes J., Inside the Machine: An Illustrated Introduction to Microprocessors
and Computer Architecture. ArsTechnica Library, San Francisco, 2006.
[17] James G., Silverman B., Silverman B., Visual 6502 CPU simulator.
http://www.visual6502.com/, 2011 [Accessed 5-July-2014].