Dodmain
Dodmain
Dodmain
Richard Fabian
1 Data-Oriented Design 1
1.1 It’s all about the data . . . . . . . . . . . . . . . . . 2
1.2 Data is not the problem domain . . . . . . . . . . . 4
1.3 Data and Statistics . . . . . . . . . . . . . . . . . . . 5
1.4 Data changes . . . . . . . . . . . . . . . . . . . . . . 6
1.5 How is data formed? . . . . . . . . . . . . . . . . . . 8
1.6 The framework . . . . . . . . . . . . . . . . . . . . . 11
i
ii CONTENTS
5 Condition Tables 53
5.1 Building Questions . . . . . . . . . . . . . . . . . . . 53
5.2 Flip it on its head . . . . . . . . . . . . . . . . . . . 55
5.3 Logging decisions for debug . . . . . . . . . . . . . . 56
5.4 Branching without branching . . . . . . . . . . . . . 58
7 Searching 65
7.1 Indexes . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.2 data-oriented Lookup . . . . . . . . . . . . . . . . . 66
7.3 Finding low and high . . . . . . . . . . . . . . . . . . 70
7.4 Finding random . . . . . . . . . . . . . . . . . . . . . 71
8 Sorting 75
8.1 Do you need to? . . . . . . . . . . . . . . . . . . . . 75
8.2 Maintain by insertion sort or parallel merge sort . . 76
8.3 Sorting for your platfom . . . . . . . . . . . . . . . . 77
9 Relational Databases 81
9.1 Normalisation . . . . . . . . . . . . . . . . . . . . . . 83
9.2 Implicit Entities . . . . . . . . . . . . . . . . . . . . 93
9.3 Components imply entities . . . . . . . . . . . . . . . 96
9.4 Cosmic Hierarchies . . . . . . . . . . . . . . . . . . . 99
9.5 Structs of Arrays . . . . . . . . . . . . . . . . . . . . 100
CONTENTS iii
10 Optimisations 105
10.1 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . 106
10.2 Transforms . . . . . . . . . . . . . . . . . . . . . . . 110
10.3 Spatial sets for collisions . . . . . . . . . . . . . . . . 112
10.4 Lazy Evaluation for the masses . . . . . . . . . . . . 112
10.5 Not getting what you didn’t ask for . . . . . . . . . 113
10.6 Varying Length Sets . . . . . . . . . . . . . . . . . . 115
10.7 Bit twiddling decision tables . . . . . . . . . . . . . . 117
10.8 Joins as intersections . . . . . . . . . . . . . . . . . . 119
10.9 Data driven techniques . . . . . . . . . . . . . . . . . 119
11 Concurrency 121
11.1 Thread-safe . . . . . . . . . . . . . . . . . . . . . . . 122
11.2 Inherently concurrent operations . . . . . . . . . . . 124
11.3 Gateways . . . . . . . . . . . . . . . . . . . . . . . . 126
12 In Practice 131
12.1 Data-manipulation . . . . . . . . . . . . . . . . . . . 131
12.2 Game entities . . . . . . . . . . . . . . . . . . . . . . 136
16 Hardware 193
16.1 Sequential data . . . . . . . . . . . . . . . . . . . . . 194
16.2 Deep Pipes . . . . . . . . . . . . . . . . . . . . . . . 195
16.3 Microcode . . . . . . . . . . . . . . . . . . . . . . . . 197
16.4 Single Instruction Multiple Data . . . . . . . . . . . 198
16.5 Predictable instructions . . . . . . . . . . . . . . . . 199
17 Future 201
17.1 Parallel processing . . . . . . . . . . . . . . . . . . . 203
17.2 Distributed computing . . . . . . . . . . . . . . . . . 205
17.3 Runtime hardware configuration . . . . . . . . . . . 207
17.4 Data centric development in hardware design . . . . 208
Introduction
It took a lot longer than expected to be able to write this book, but
it’s been all the better for how long the ideas have been fermenting
in my head and how many of the concepts have been tested out in
practice. In many cases, the following chapters are ideas that started
out as simple observations and slowly evolved into solid frameworks
for building software in a data-oriented manner. Some sections have
remained largely unchanged from their first drafts, such as the sec-
tions on existence based processing or the arguments against the
object-oriented approach, but others have been re-written a number
of times, trying to distill just the right kind of information at the
right pace.
Most people come at data-oriented design from object-oriented
design, and have heard of it only because of people claiming that
object-oriented design is bad, or wrong, or simply not the only way
of doing things. This may be the case for large scale software, and
though object-oriented code does have its place (as we shall discuss
in chapter 15), it has been the cause1 of much wasted time and effort
during its relatively short life in our passionate industry of software
development and games development in particular.
My own journey through imperative procedural programming,
then Object-oriented programming then finally finding, embracing,
1 Large Scale C++ is a book almost entirely dedicated to showing how object
v
vi INTRODUCTION
here’s to opening minds, but not so much that your brains fall out.
Chapter 1
Data-Oriented Design
1
2 CHAPTER 1. DATA-ORIENTED DESIGN
ing ineffective programming, but the games being designed are less
demanding. As we move towards ubiquity for multi-core machines,
not just the desktops, but our phones too, and towards a world of
programming for massively parallel processing design as in the com-
pute cores on our graphics cards and in our consoles, the need for
data-oriented design will only grow. It will grow because abstrac-
tions and serial thinking will be the bottleneck of your competitors,
and those that embrace the data-oriented approach will thrive.
cess. This means that game development can be tuned to ensure the
release date will coincide with marketing dates. There will always
be an element of randomness in high profile games development be-
cause there will always be an element of innovation that virtually
guarantees that you will not be able to predict how long the project,
or at least one part of the project, will take.
Part of the difficulty in adding new and innovative features to a
game is the data layout. If you need to change the data layout for a
game, it will need objects to be redesigned or extended in order to
work within the existing framework. If there is no new data, then a
feature might require that previously separate systems suddenly be
able to talk to each other quite intimately. This coupling can often
cause system wide confusion with additional temporal coupling and
corner cases so obscure they can only be found one time in a million.
This sounds fine to some developers, but if you’re expecting to sell
five to fifty million copies of your game, that’s five to fifty people
who can take a video of your game, post it on the internet, and call
your company rubbish because they hadn’t fixed an obvious bug.
Big iron developers had these same concerns back in the 1970’s.
Their software had to be built to high standards because their pro-
grams would frequently be working on data that was concerned with
real money transactions. They needed to write business logic that
operated on the data, but most important of all, they had to make
sure the data was updated through a provably careful set of oper-
ations in order to maintain its integrity. Database technology grew
from the need to process stored data, to do complex analysis on it,
to store and update it, and be able to guarantee that it was valid
at all times. To do this, the ACID test was used to ensured atomic-
ity, consistency, isolation, and durability. Atomicity was the test to
ensure that all transactions would either complete, or do nothing.
It could be very bad for a database to update only one account in
a financial transaction. There could be money lost, or created if a
transaction was not atomic. Consistency was added to ensure that
all the resultant state changes that should happen during a transac-
tion do happen, that is, all triggers that should fire, do, even if the
1.6. THE FRAMEWORK 13
Existence Based
Processing
If you saw that there weren’t any apples in stock, would you still
haggle over their price?
15
16 CHAPTER 2. EXISTENCE BASED PROCESSING
sure that the data exists, or is making sure bounds are observed.
In real world cases, the most common use of an explicit flow con-
trol statement is in defensive programming. Most of our flow con-
trol statements are just to stop crashes, to check bounds, pointers
being null, or any other exceptional cases that would bring the pro-
gram to a halt. The second most common is loop control, though
these are numerous, most CPUs have hardware optimisations for
these, and most compilers do a very good job of removing condition
checks that aren’t necessary. The third most common flow control
comes from polymorphic calls, which can be helpful in implement-
ing some of the gameplay logic, but mostly are there to entertain
the do-more-with-less-code development model partially enforced in
the object-oriented approach to writing games. Actual visible game
design originating flow control comes a distant fourth place in these
causes of branching, leading to an under-appreciation of the effect
each conditional has on the performance of the software. That is,
when the rest of your code-base is slow, it’s hard to validate writing
fast code for any one task.
If we try to keep our working set of data as a collections of arrays,
we can guarantee that all our data is not null. That one step alone
will eliminate most of our flow control statements. The third most
common set of flow control statements were the inherent flow control
in a virtual call, they are covered later in section 2.5, but simply put,
if you don’t have an explicit type, you don’t need to switch on it,
so those flow control statements go away as well. Finally, we get
to gameplay logic, which there is no simple way to eradicate. We
can get most of the way there with condition tables which will be
covered in chapter 5, but for now we will assume they are allowed
to stay.
Reducing the amount of conditions, and thus reducing the cyclo-
matic complexity on such a scale is an amazing benefit that cannot
be overlooked, but it is one that comes with a cost. The reason we
are able to get rid of the check for null is that we now have our data
in a format that doesn’t allow for null. This inflexibility will prove
to be a benefit, but it requires a new way of processing our entities.
18 CHAPTER 2. EXISTENCE BASED PROCESSING
Where we once had rooms, and we looked in the rooms to find out
if there were any doors on the walls we bumped into (in order to
either pass through, or instead do a collision response,) we now look
in the table of doors to see if there are any that match our roomid.
This reversal of ownership can be a massive benefit in debugging,
but sometimes can appear backwards when all you want to do is find
out what doors you can use to get out of a room.
If you’ve ever worked with shopping lists, or todo lists, you’ll
know how much more efficient you are when you have a definite list
of things to get or get done. It’s very easy to make a list, and easy to
add to it too. If you’re going shopping, it’s very hard to think what
might be missing from your house in order to get what you need.
If you’re the type that tries to plan meals, then a list is nigh on
essential as you figure out ingredients and then tally up the number
of tins of tomatoes, or how many different meats or vegetables you
need to last through all the meals you have planned. If you have a
todo list and a calendar, you know who is coming and what needs to
be done to prepare for them, how many extra mouths need feeding,
how many extra bottles of beer to buy, or how much washing you
need to do to make enough beds for the visitors. Todo lists are
great because you can set an end goal and then add in sub tasks
that make a large and long distant goal seem more doable, and also
provide a little urgency that is usually missing when the deadline is
so far away.
When your program is running, if you don’t give it lists to work
with, and instead let it do whatever comes up next, it will be ineffi-
cient, slow, and probably have irregular frame timings. Inefficiency
comes from not knowing what kind of processing is coming up next.
In the case of large arrays of pointers to heterogeneous classes all be-
ing called with an update() function, you can get very high counts
of cache misses both in data and instruction cache. If the code tries
to do a lot with this pointer, which it usually does because one of
the major beliefs of object-oriented programmers is that virtual calls
are only for higher level operations, not small, low level ones, then
you can virtually guarantee that not only will the instruction and
2.1. WHY USE AN IF 19
data cache be thrashed by each call, but most branch predictor slots
could be too dirty to offer any benefit when the next update() runs.
Assuming that virtual calls don’t add up because they are called on
high level code is fine until they become the go to programming style
and you stop thinking about how they affect your application when
there are millions of virtual calls per second. All those inefficient
calls are going to add up somewhere, but luckily for the object ori-
ented zealot, they never appear on any profiles. They always appear
somewhere in the code that is being called. Slowness also comes
from not being able to see how much work needs to be done, and
therefore not being able to scale the work to fit what is possible in
the time-frame given. Without a todo list, and an ability to estimate
the amount of time that each task will take, it is impossible to decide
the best course of action to take in order to reduce overhead while
maintaining feedback to the user. Irregular frame timings also can
be blamed on not being able to act on distant goals ahead of time.
If you know you have to load an area because a player has ventured
into a position where it is possible they will soon be entering an un-
loaded area, the streaming system can be told to drag in any data
necessary. In most games this happens with explicit triggers, but
there is no such system for many other game elements. It’s unheard
of for an AI to pathfind to some goal because there might soon be a
need to head that way. It’s not commonplace to find a physics sys-
tem doing look ahead to see if a collision has happened in the future
in order to start doing a more complex breakup simulation. But, if
you let your game generate todo lists, shopping lists, distant goals,
and allow for preventative measures by forward thinking, then you
can simplify your task as a coder into prioritising goals and effects,
or writing code that generates priorities at runtime.
In addition to the obvious lists, metrics on your game are highly
important. If you find that you’ve been optimising your game for
a silky smooth frame rate and you think you have a really steady
30fps or 60fps, and yet your customers and testers keep coming back
with comments about nasty frame spikes and dropout, then you’re
not profiling the right thing. Sometimes you have to profile a game
20 CHAPTER 2. EXISTENCE BASED PROCESSING
while it is being played. Get yourself a profiler that runs all the time,
and can report the state of the game when the frame time goes over
budget. Sometimes you need the data from a number of frames
around when it happened to really figure out what is going on, but
in all cases, unless you’re letting real testers run your profiler, you’re
never going to get real world profiling data.
Existence-based-processing is when you process every element in
a homogeneous set of data. You run the same instructions for every
element in that set. There is no definite requirement for the output
in this specification, however, usually it is one of three types of
operation: filter, mutation, or emission. A mutation is a one to one
manipulation of the data, it takes incoming data and some constants
that are setup before the transform, and produces one element for
each input element. A filter takes incoming data, again with some
constants set up before the transform, and produces one element or
zero elements for each input element. An emission is a manipulation
on the incoming data that can produce multiple output elements.
Just like the other two transforms, an emission can use constants,
but there is no guaranteed size of the output table; it can produce
anywhere between zero and infinity elements.
Every CPU can efficiently handle running processing kernels over
homogeneous sets of data, that is, doing the same operation over
and over again over contiguous data. When there is no global state,
no accumulator, it is proven to be parallelisable. Examples can
be given from existing technologies such as mapreduce and opencl
as to how to go about building real work applications within these
restrictions. Stateless transforms also commit no crimes that prevent
them from being used within distributed processing technologies.
Erlang relies on these as language features to enable not just thread
safe processing, not just inter process safe processing, but distributed
computing safe processing. Stateless transforms of stateful data is
highly robust, and deeply parallelisable.
Within the processing of each element, that is for each datum
operated on by the transform kernel, it is fair to use control flow.
Almost all compilers should be able to reduce simple local value
2.1. WHY USE AN IF 21
Then we can run an update function over the table that might
look like this:
1 void updatehealth ( entity * e ) {
2 timetype t imes in c e l a s t s h o t = e - > ti m e o f l a s t d a m a g e - currenttime ;
3 bool ishurt = e - > health < max_health && e - > health > 0;
4 bool regencanstart = t i m e s i n c e l a s t s h o t >
time_before_regenerating ;
5 if ( ishurt && regencanstart ) {
6 e - > health = min ( max_health , e - > health + ticktime * regenrate ) ;
7 }
8 }
Which will run for every entity in the game, every update.
We can make this better by looking at the flow control statement.
The function only needs to run if the health is less than full health,
and more than zero. The regenerate function only needs to run if it
has been long enough since the last damage dealt.
Let’s change the structures:
1 struct entity {
2 // information about the entity position
3 // ...
4 // other entity information
5 };
6 struct entitydamage {
7 float timeoflast d a m a g e ;
8 float health ;
9 }
10 list < entity > entities ;
11 map < entityref , entitydamage > entitydamages ;
We can now run the update function over the health table rather
than the entities.
1 void updatehealth () {
2 foreach ( eh in entitydamages ) {
3 entityref entity = eh - > first ;
24 CHAPTER 2. EXISTENCE BASED PROCESSING
and any other internal state. C++ doesn’t implement this explicitly,
but if a class allows the use of an internal state variable or variables,
it can provide differing reactions based on that state as well as the
core language runtime virtual table lookup. Consider the following
code:
1 class shape {
2 public :
3 shape () {}
4 virtual ~ shape () {}
5 virtual float getarea () const = 0;
6 };
7 class circle : public shape {
8 public :
9 circle ( float diameter ) : d ( diameter ) {}
10 ~ circle () {}
11 float getarea () const { return d * d * pi /4; }
12 float d ;
13 };
14 class square : public shape {
15 public :
16 square ( float across ) : width ( across ) {}
17 ~ square () {}
18 float getarea () const { return width * width ; }
19 float width ;
20 };
21 void test () {
22 circle circle ( 2.5 f ) ;
23 square square ( 5.0 f ) ;
24 shape * shape1 = & circle , * shape2 = & square ;
25 printf ( " areas are % f and % f \ n " , shape1 - > getarea () , shape2 - >
getarea () ) ;
26 }
9 switch ( m_type ) {
10 case circletype : return di stancea cross * distanc eacross * pi /4;
11 case squaretype : return di stancea cross * distanc eacross ;
12 }
13 }
14 void setnewtype ( shapetype type ) {
15 m_type = type ;
16 }
17 shapetype m_type ;
18 float distanceac ross ;
19 };
20 void testint e r n a l t y p e () {
21 mutableshape shape1 ( circletype , 5.0 f ) ;
22 mutableshape shape2 ( circletype , 5.0 f ) ;
23 shape2 . setnewtype ( squaretype ) ;
24 printf ( " areas are % f and % f \ n " , shape1 . getarea () , shape2 .
getarea () ) ;
25 }
Though this works, all the pointers to the old class are now
invalid. Using handles would mitigate these worries, but add another
layer of indirection in most cases, dragging down performance even
more.
If you use existence-based-processing techniques, your classes de-
fined by the tables they belong to, then you can switch between ta-
bles at runtime. This allows you to change behaviour without any
tricks, without any overhead of a union to carry all the differing data
around for all the states that you need. If you compose your class
from different attributes and abilities then need to change them post
creation, you can. Looking at it from a hardware point of view, in
2.6. EVENT HANDLING 31
Component Based
Objects
Component oriented design is a good head start for high level data-
oriented design. Components put your mind in the right frame for
not linking together concepts when they don’t need to be, and com-
ponent based objects can easily be processed by type, not by in-
stance, which leads to a smoother processing profile. Component
based entity systems are found in games development as a way to
provide a data-driven functionality system for entities, which can al-
low for designer control over what would normally be the realm of a
programmer. Not only are component based entities better for rapid
design changes, but they also stymie the chances of the components
getting bogged down into monolithic objects as most game designers
would demand more components with new features over extending
the scope of existing components. Most new designs need iterating,
and extending an existing component by code doesn’t allow design
to change back and forth, trying different things.
When people talk about how to create compound objects, some-
times they talk about using components to make up an object.
35
36 CHAPTER 3. COMPONENT BASED OBJECTS
2 Gas Powered Games released a web article about their component based
First, we move the data out into separate structures so they can
be more easily combined into new classes.
1 struct PlayerPhysic al {
2 Vec pos , up , forward , right ;
3 Vec velocity ;
4 };
5 struct PlayerGamepl ay {
6 float health ;
7 int xp ;
8 int usedPowerups ;
9 int SPEED , JUMP , STRENGTH , DODGE ;
10 bool cheating ;
11 };
12 struct EntityAnim {
13 AnimID currentAn im Go al ;
14 AnimID currentAnim ;
15 SoundHandle p la y i n g S o u n d H a n d l e ; // null most of the time
16 };
17 struct PlayerControl {
18 int controller ;
19 bool controllable ;
20 };
21 struct EntityRender {
22 bool visible ;
23 AssetID playerModel ;
24 };
25 struct EntityInWorld {
26 LocomotionType c u r r e n t L o c o m o t i v e M o d e l ;
27 };
28 struct Inventory {
29 Array < ItemType > inventory ;
30 int bulletCount ;
31 };
32
33 SparseArray < PlayerPhysical > phsyicalArray ;
34 SparseArray < PlayerGameplay > gameplayArray ;
35 SparseArray < EntityAnim > animArray ;
36 SparseArray < PlayerControl > controlArray ;
37 SparseArray < EntityRender > renderArray ;
38 SparseArray < EntityInWorld > inWorldArray ;
39 SparseArray < Inventory > invento ryArray ;
40
41 class Player {
42 public :
43 Player () ;
44 ~ Player () ;
45 // ...
46 // ... the member functions
47 // ...
48 private :
40 CHAPTER 3. COMPONENT BASED OBJECTS
49 int EntityID ;
50 };
32 }
33 }
34 }
35 }
36 }
37 foreach ( { index , inv } in in ventory Array ) {
38 if ( inv . bulletCount > 0 ) {
39 if ( index in animArray ) {
40 anim = animArray [ index ];
41 if ( anim . currentAnim == SHOOT_ONCE ) {
42 anim . currentAnim = SHOT ;
43 inven toryArr ay [ index ]. bulletCount -= 1;
44 anim . p l a y i n g S o u n d H a n d l e = PlaySound ( GUNFIRE ) ;
45 }
46 }
47 }
48 }
49 }
50 }
What happens when we let more than just the player use these
arrays? Normally we’d have some separate logic for handling player
fire until we refactored the weapons to be generic weapons with
NPCs using the same code for weapons proably by having a new
weapon class that can be pointed to by the player or an NPC, but
instead what we have here is a way to split off the weapon firing
code in such a way as to allow the player and the NPC to share
firing code without inventing a new class to hold the firing. In fact,
what we’ve done is split the firing up into the different tasks that it
really contains.
Tasks are good for parallel processing, and with component based
objects we open up the opportunity to make most of our previously
class oriented processes into more generic tasks that can be dished
out to whatever CPU or co-processor can handle them.
43 if ( anim == SHOOT_ONCE ) {
44 anim = SHOT ;
45 inventoryArr ay [ index ]. bulletCount -= 1;
46 playingSound = PlaySound ( GUNFIRE ) ;
47 }
48 }
49 }
50 int NewPlayer ( int padID , Vec startPoint ) {
51 int ID = newID () ;
52 controllerID [ ID ] padID ;
53 GetAsset ( " PlayerModel " , ID ) ; // adds a request to put the
player model into modelArray [ ID ]
54 orientatio n A r r a y [ ID ] = Orientation ( startPoint ) ;
55 velocityArray [ ID ] = VecZero () ;
56 return ID ;
57 }
Moving away from a player class means that many other classes
can be invented without adding much code. Allowing script to gen-
erate new classes by composition increases the power of script to
dramatically increase the apparent complexity of the game without
adding more code. What is also of major benefit is that all the dif-
ferent entities in the game now run the same code at the same time.
everyone’s physics is updated before they are rendered4 and every-
one’s controls (whether they be player or AI) are updated before
they are animated. Having the managers control when the code is
executed is a large part of the leap towards fully parallelisable code.
Hierarchical
Level-of-Detail and
Implicit-state
45
46 CHAPTER 4. HIERARCHICAL LEVEL OF DETAIL
the lineup over the whole race track2 , and we’ve literally combined
people together into one person instead of having loads of people on
screen at once3 . This kind of reduction of processing is common-
place. Now use it everywhere appropriate, not just when a player is
not looking.
4.2 Mementos
Reducing detail introduces an old problme, though. Changing level
of detail in game logic systems, AI and such, brings with it the loss
of high detail history. In this case we need a way to store what is
needed to maintain a highly cohesive player experience. If a high
detail squadron in front of the player goes out of sight and another
squadron takes their place, we still want any damage done to the
first group to reappear when they come into sight again. Imagine
if you had shot out the glass on all the ships and when they came
round again, it was all back the way it was when they first arrived.
A cosmetic effect, but one that is jarring and makes it harder to
suspend disbelief.
When a high detail entity drops to a lower level of detail, it
should store a memento, a small, well compressed nugget of data that
contains all the necessary information in order to rebuild the higher
detail entity from the lower detail one. When the squadron drops
out of sight, it stores a memento containing compressed information
about the amount of damage, where it was damaged, and rough
positions of all the ships in the squadron. When the squadron comes
into view once more, it can read this data and generate the high
detail entities back in the state they were before. Lossy compression
is fine for most things, it doesn’t matter precisely which windows, or
how they were cracked, maybe just that about 2/3 of the windows
were broken.
2 Ridge Racer was one of the worst for this
3 Populous did this
4.3. JIT DATA GENERATION 49
many parts of our game already, but in this case, there is a passive
presentation response to the variable, or axis being monitored. The
presentation is usually some graphical, or logical level of detail, but
it could be something as important to the entity as its own existence.
How long until a player forgets about something that might oth-
erwise be important? This information can help reduce memory
usage as much as distance. If you have ever played Grand Theft
Auto IV, you might have noticed that the cars can dissappear just
by not looking at them. As you turn around a few times you might
notice that the cars seem to be different each time you face their
way. This is a stunning use of temporal level of detail. Cars that
have been bumped into or driven and parked by the player remain
where they were, because, in essence, the player put them there.
Because the player has interacted with them, they are likely to re-
member that they are there. However, ambient vehicles, whether
they are police cruisers, or civilian vehicles, are less important and
don’t normally get to keep any special status so can vanish when
the player looks away.
In adition to time-since-seen, some elements may base their level
of detail on how far a player has progressed in the game, or how
many of something a player has, or how many times they have done
it. For example, a typical bartering animation might be cut shorter
and shorter as the game uses the axis of how many recent barters
to draw back the length of any non-interactive sections that could
be caused by the event. This can be done simply, and the player
will be thankful. It may even be possible to allow for multi-item
transactions after a certain number of transactions have happened.
In effect, you could set up gameplay elements, reactions to situations,
triggers for tutorials or extensions to gameplay options all through
these abstracted level of detail style axes.
This way of manipulating the present state of the game is safer
from transition errors. Errors that happen because going from one
state to another may have set something to true when transitioning
one direction, but not back to false when transitioning the other way.
You can think of the states as being implicit on the axis, not explicit,
52 CHAPTER 4. HIERARCHICAL LEVEL OF DETAIL
Condition Tables
53
54 CHAPTER 5. CONDITION TABLES
5 xorMask |= BIT ( i ) ;
6 if ( column [ i ]. condition == null )
7 ignoreMask |= BIT ( i ) ;
8 }
9 for ( int i = tableColumns ; i < MAX_BITS ; ++ i ) {
10 ignoreMask |= BIT ( i ) ;
11 }
Once you have your inputs transformed into the array of bit
fields, you can run them through the condition table. If they are
an array of structures, then it’s likely that caching the inputs into
a separate table isn’t going to make things faster, so this decision
pass can be run on the bit field directly. In this snippet we run one
bit field through one set of condition masks.
1 unsigned int conditionsMet = ( value ^ xorMask ) | ignoreMask ;
2 unsigned int unmet = ~ conditionsMet ;
3 if ( ! unmet ) {
4 addToTable ( t ) ;
5 }
The code first XORs the values so the false values become true
for all the cases where we want the value to be false. Then we OR
with the ignore so that any entries we don’t care about become true
60 CHAPTER 5. CONDITION TABLES
and don’t falsely break the condition. Once we have this final value,
we can be certain that it will be all ones in case of a match, and
will have some zeros if it is a miss. An if using the binary-not of
the final value will suffice to prove the system works, but we can do
better.
Using the conditionsMet value, we can generate a final mask onto
an arbitrary value.
1 unsigned int czero = ( cvalue + 1 ) ; // one more than all the ones
is zero .
2 unsigned int cmask = ~( czero | - czero ) >> 31; // only zero returns
-1 from this .
3 ID valueIn = valueOut & cmask ; // if true , return the key , else
return zero .
Now, we can use this value in many ways. We can use it on the
primary key, and add this key to the output table, and providing a
list table that can be reduced by ignoring nulls, we have, in effect, a
dead bucket. If we use it on the increment counter, then we can just
add an entry to the end of the output table, but not increase the size
of the output table. Either way, we can assume this is a branchless
implementation of an entry being added to a table. If this was
an event, then a multi-variable condition check is now branchlessly
putting out an event. If you let these loops unroll, then depending
on your platform, you are going to process a lot more data than a
branching model will.
Chapter 6
Finite state machines are a solid solution to many game logic prob-
lems. They are the building block of most medium level AI code, and
provide testable frameworks for user interfaces. Notoriously convo-
luted finite state machines rule the world of memory card operations
and network lobbies, while simple ones hold domain over point and
click adventure puzzle logic. Finite state machines are highly data
driven in that they react to an alphabet of signals, and change state
based on both the signal received, and their current state.
61
62 CHAPTER 6. FINITE STATE MACHINES
the player has different controls when under water, the onEnter of
the inWater state table could change the player input mapping to
allow for floating up or sinking down. It would also be a good idea
to attach any oxygen-level gauge hooks here to. If there is meant
to be a renderable for the oxygen-level, it should be created by the
player entering the inWater state. It should also be hooked into the
inWater onExit transition table, as it will probably want to reset or
begin regenerating at that point.
One of the greatest benefits of being able to hook into transitions,
is being able to keep track of changes in game state for syncing. If
you are trying to keep a network game sycnronised, knowing exactly
what happened and why can be the difference between a 5 minute
and 5 day debugging session.
Searching
7.1 Indexes
Database management systems have long held the concept of in-
dexes. They are traditionally automatically added when a database
management system notices that a particular query has been run a
large number of times. We can use this idea and implement a just-
in-time indexing system in our games to provide the same kinds of
performance improvement. Every time you want to find out if an
element exists, or even just generate a subset, like when you need
65
66 CHAPTER 7. SEARCHING
to find all the entities in a certain range, you will have to build it
as a pocess to generate a row or table, but the process that causes
the row or table generation can be thought of as an object that
can transform itself dependent on the situation. Starting out as
a simple linear search query (if the data is not already sorted), the
process can find out that it’s being used quite often through internal
telemetry, and proably be able to discover that it’s generally returns
a simply tunable set of results, such as the first N items in a sorted
list, and can hook itself into the tables that it references. Hooking
into the insertion, modification, and deletion would allow the query
to update itself rather than run the full query again. This can be a
significant saving, but it can also be useful as it is optional. As it’s
optional, it can be garbage collected.
If we build generalised back ends to handle building queries into
these tables, they can provide multiple benefits. Not only can we
provide garbage collection of indexes that aren’t in use, but they
can also make the programs in some way self documenting and self
profiling. If we study the logs of what tables had pushed for building
indexes for their queries, then we can quickly see data hotspots and
where there is room for improvement. The holy grail here is self
optimising code.
speed up a search but only by looking at the data. If the data is not
already sorted, then an index is a simple way to find the specific item
we need. If the data is already sorted, but needs even faster access,
then a search tree optimised for the cache-line size would help.
A binary search is one of the best search algorithms for using
the smallest number of instructions to find a key value. But if you
want the fastest algorithm, you must look at what takes time, and
a lot of the time, it’s not instructions. Loading a whole cache-line
or information and doing as much as you can with that would be
a lot more helpful than using the smallest number of instructions.
Consider these two different data layouts for a animation key finding
system:
1 struct Anim {
2 float keys [ num ];
3 VALUE_TYPE values [ num ];
4 };
5
6 struct KeyClump { float startTimes [32]; } // 128 bytes
7 keyClump root ;
8 keyClump subs [32];
9 // implicit first element that has the same value as root .
startTimes [ n ] ,
10 // but last element is the same as root . startTimes [ n +1]
11 VALUE_TYPE Values [32*32];
But most data isn’t this simple to optimise. A lot of the time, we
have to work with spatial data, but because we use objects, it’s hard
to strap on an efficient spatial reference container after the fact. It’s
virtually impossible to add one at runtime to an externally defined
class of objects.
Adding spatial partitioning when your data is in a simple data
format like rows allows us to generate spatial containers or lookup
systems that will be easy to maintain and optimise for each partic-
ular situation. Because of the inherent reusability in data-oriented
transforms, we can write some very highly optimised routines for the
high level programmers.
pop the lowest before adding the new element to the sorted best. If
there is a deletion, or a modification that makes one in the sorted set
a contender for elimination, a quick check of the rest of the elements
to find a new best set might be necessary. If you keep a larger than
necessary set of best values, however, then you might find that this
never happens.
1 Array < int > bigArray ;
2 Array < int > bestValue ;
3 const int LIMIT = 3;
4
5 void AddValue ( int newValue ) {
6 bigArray . push ( newValue ) ;
7 bestValue . sortedinsert ( newValue ) ;
8 if ( bestValue . size () > LIMIT )
9 bestValue . erase ( bestValue . begin () ) ;
10 }
11 void RemoveValue ( int deletedValue ) {
12 bigArray . remove ( deletedValue ) ;
13 bestValue . remove ( deletedValue ) ;
14 }
15 int GetBestValue () {
16 if ( bestValue . size () ) {
17 return bestValue . top () ;
18 } else {
19 int best = bigArray . findbest () ;
20 bestvalue . push ( best ) ;
21 return best ;
22 }
23 }
The trick is to find, at runtime, the best value to use that covers
the solution requirement. The only way to do that is to check the
data at runtime. For this, either keep logs, or run the tests with
dynamic resizing based on feedback from the table’s query optimiser.
entries with the same key a lot easier than a tree implementation
as it can use the standard linked list of entries in a hash bucket to
collect all the messages with a certain sender/recipient.
74 CHAPTER 7. SEARCHING
Chapter 8
Sorting
75
76 CHAPTER 8. SORTING
call is alpha blended. This then allows the renderer to adjust the sort
at runtime to get the most out of the bandwidth available. In the
case of the rendering list sort, you could run the whole list through a
general sorting algorithm, but in reality, there’s no reason to sort the
alpha blended objects with the opaque objects, so in many cases you
can take a first step of putting the list into two separate buckets,
and save some n for your O. Also, choose your sorting algorithm
wisely. With opaque objects, the most important part is usually
sorting by textures then by depth, but that can change with how
much your fill rate is being trashed by overwriting the same pixel
multiple times. If your overdraw doesn’t matter too much but your
texture uploads do, then you probably want to radix sort your calls.
With alpha blended calls, you just have to sort by depth, so choose
an algorithm that handles that case best, usually a quick sort or a
merge sort bring about very low but guaranteed accurate sorting.
http://seven-degrees-of-freedom.blogspot.co.uk/2010/07/question-of-sorts.html
78 CHAPTER 8. SORTING
utility header so you can utilise the benefits of miniaturistion, but don’t drop in
a bloated std::sort
8.3. SORTING FOR YOUR PLATFOM 79
A / /B / A0
B / / / B0
a’ <= MAX(a,b)
b’ <= MIN(a,b)
This is fast on any hardware. The MAX and MIN functions will
need different implementations for each platform and data type, but
in general, branch free code executes faster than code that includes
branches.
Introducing more elements:
A / /I / /B / A0
B / /I / / / /B / B0
C / / / /B / / / C0
D / / / / / D0
80 CHAPTER 8. SORTING
What you notice here is that the critical path is not long (just
three stages in total), and as these are all branch free functions,
the performance is regular over all data permutations. With such a
regular performance profile, we can use the sort in ways where the
variability of sorting time length gets in the way, such as just-in-
time sorting for sub sections of rendering. If we had radix sorted
our renderables, we can network sort any final required ordering as
we can guarantee a consistent timing.
Sorting networks are somewhat like predication, the branch free
way of handling conditional calculations. Because sorting networks
use a min / max function, rather than a conditional swap, they
gain the same benefits when it comes to the actual sorting of in-
dividual elements. Given that sorting networks can be faster than
radix sort for certain implementations, it goes without saying that
for some types of calculation, predication, even long chains of it, will
be faster than code that branches to save processing time. Just such
an example exists in the Pitfalls of Object Oriented Design presen-
tation, concluding that lazy evaluation costs more than the job it
tried to avoid. I have no hard evidence for it yet, but I believe a lot
of AI code could benefit the same, in that it would be wise to gather
information even when you are not sure you need it, as gathering it
might be quicker than deciding not to. For example, seeing if some-
one is in your field of vision, and is close enough, might be small
enough that it can be done for all AI rather than just the ones that
require it, or those that require it occasionally.
Chapter 9
Relational Databases
1 There are existing APIs that present strings in various ways dependent on
how they are used, for example the difference between human readable strings
(usually UTF-8) and ascii strings for debugging. Adding sounds, textures, and
meshes to this seems quite natural once you realise that all these things are
resources for human consumption
81
82 CHAPTER 9. RELATIONAL DATABASES
In the first step, you take all the data from the object creation
script, and fit it into rows in your table. To start with, we create a
table for each type of object, and a row for each instance of those
obejcts. We need to invent columns as necessary, and use NULL
everywhere that an instance doesn’t have that attribute/column.
Once we have taken the construction script and generated ta-
bles, we find that these tables contain a lot of nulls. The nulls in
the rows replace the optional content of the objects. If an object
instance doesn’t have a certain attributes then we replace those fea-
tures with nulls. When we plug this into a database, it will handle
it just fine, but storing all the nulls seems unnecessary and looks
like it’s wasting space. Present day database technology has moved
towards keeping nulls to the point that a lot of large, sparse table
databases have many more null entries than they have data. They
operate on these sparsely filled tables with functional programming
techniques. We, however should first see how it worked in the past
with relational databases. Back when SQL was first invented there
were three known stages of data normalisation (there currently are
six recognised numbered normal forms, plus BoyceCodd (which is a
stricter variant of third normal form), and Domain Key (which, for
the most part defines a normalisation which requires using domain
knowledge instead of storing data).
9.1. NORMALISATION 83
Meshes
MeshID MeshName
m1 ”filename”
m2 ”filename2”
Textures
TextureID TextureName
t1 ”texturefile”
t2 ”texturefile2”
Pickups
MeshID TextureID PickupType ColourTint PickupID
m5 t5 KEY Gold k1
m6 t6 POTION (null) k2
Rooms
RoomID MeshID TextureID WorldPos Pickup1 Pickup2 ...
r1 m1 t1 0,0 k1 k2 ...
r2 m2 t2 20,5 (null) (null) ...
r3 m3 t3 30,-5 (null) (null) ...
... Trapped DoorTo Locked Start Exit
... 10hp r2 (null) (null) false
... (null) r3 k1 22,7 false
... (null) (null) (null) (null) true
9.1 Normalisation
First normal form requires that every row be distinct and rely on
the key. We haven’t got any keys yet, so first we must find out what
these are.
In any table, there is a set of columns that when compared to any
other row, are unique. Databases guarantee that this is possible by
not allowing complete duplicate rows to exist, there is always some
differentiation. Because of this rule, every table has a key because
every table can use all the columns at once as its primary key. For
example, in the mesh table, the combination of meshID and filename
is guaranteed to be unique. The same can be said for the textureID
and filename in the textures table. In fact, if you use all the columns
of the room table, you will find that it can be used to uniquely define
that room (obviously really, as if it had the same key, it would in
84 CHAPTER 9. RELATIONAL DATABASES
Rooms
RoomID MeshID TextureID WorldPos Exit
r1 m1 t1 0,0 false
r2 m2 t2 20,5 false
r3 m3 t3 30,-5 true
Room Pickup
RoomID Pickup1 Pickup2
r1 k1 k2
Rooms Doors
RoomID DoorTo Locked
r1 r2 (null)
r2 r3 k1
Rooms Traps
RoomID Trapped
r1 10hp
Start Point
RoomID Start
r2 22,7
The data, in this format takes much less space in larger projects
as the number of NULL entries would have only increased with in-
creased complexity of the level file. Also, by laying out the data this
way, we can add new features without having to revisit the original
objects. For example, if we wanted to add monsters, normally we
would not only have to add a new object for the monsters, but also
add them to the room objects. In this format, all we need to do is
add a new table:
And now we have information about the monster and what room
it starts in without touching any of the original level data.
9.1. NORMALISATION 87
Room Pickup
RoomID Pickup
r1 k1
r1 k2
9.1.2 Reflections
What we see here as we normalise our data is a tendency to add
information when it becomes necessary, but at a cost of the keys
that associate the data with the entity. Looking at many third
party engines and APIs, you can see parallels with the results of
these normalisations. It’s unlikely that the people involved in the
design and evolution of these engines took their data and applied
database normalisation techniques, but sometimes the separations
between object and componets of objects can be obvious enough
that you don’t need a formal technique in order to realise some
positive structural changes.
In some games the entity object is not just an object that can
be anything, but is instead a specific subset of the types of entity
involved in the game. For example, in one game there might be an
object type for the player character, and one for each major type
of enemy character, and another for vehicles. This object oriented
approach puts a line, invisilbe to the player, but intrusive to the
developer, between classes of object and instances. It is intrusive
because every time a class definition is used instead of a differing
data, the amount of code required to utilise that specific entity type
in a given circumstance increases. In many codebases there is the
concept of the player, and the player has different attributes to other
entities, such as lacking AI controls, or having player controls, or
having regenerating health, or having ammo. All these different
traits can be inferred from data decisions, but sometimes they are
made literal through code in classes. When these differences are
put in code, interfacing between different classes becomes a game of
adapting, a known design pattern that should be seen as a symptom
88 CHAPTER 9. RELATIONAL DATABASES
Rooms
RoomID MeshID TextureID WorldPos Exit
r1 m1 t1 0,0 false
r2 m2 t2 20,5 false
r3 m3 t3 30,-5 true
Room Pickup
RoomID Pickup
r1 k1
r1 k2
Rooms Doors
RoomID DoorTo
r1 r2
r2 r3
Rooms Doors Locks
RoomID DoorTo Locked
r2 r3 k1
Rooms Traps
RoomID Trapped
r1 10hp
Start Point
RoomID Start
r2 22,7
Monster
MonsterID Attack HitPoints StartRoom
M1 2 5 r1
M2 2 5 r3
9.1.5 Sum up
At this point we can see it is perfectly reasonable to store any highly
complex data structures in a database format, even game data with
its high interconnectedness and rapid design changing criteria. What
is still of concern is the less logical and more binary forms of data
such as material data, mesh data, audio clips and runtime integra-
tion with scripting systems and control flow.
Platform specific resources such as audio, texture, vertex or video
formats are opaque to most developers, and moving towards a ta-
ble oriented approach isn’t going to change that. In databases, it’s
common to find column types of boolean, number, and string, and
when building a database that can handle all the natural atomic
elements of our engine, it’s reasonable to assume that we can add
new column types for these platform or engine intrinsics. The tex-
tureID and meshID column used in room examples could be smart
pointers to resources. Creating new meshes and textures may be as
simple as inserting a new row with the asset URL and keeping track
of whether or not an entry has arrived in the fulfilled streaming re-
quest table or the URL to assetID table that could be populated by
a behind the scenes task scheduler.
As for integration with scripting systems and using tables for
logic and control flow, chapters on finite state machines, existence
based processing, condition tables and hierarchical level of detail
show how tables don’t complicate, but instead provide opportunity
to do more with fewer resources as results flow without constraint
normally associated with object or entity linked data structures.
9.2. IMPLICIT ENTITIES 93
through the fact that it is being rendered, and that it has a position.
In fact, the last table has taken advantage of the fact that the room
is now implicit, and changed from a bool representing whether the
room is an exit room or not into a one column table. The entry in
that table implies the room is the exit room. This will give us an
immediate memory usage balance of one bool per room compared to
one row per exit room. Also, it becomes slightly easier to calculate
if the player is in an exit room, they simply check for that room’s
existence in the ExitRoom table and nothing more.
Another benefit of implicit entities is the non-intrusive linking of
existing entities. In our data file, there is no space for multiple doors
per room. With the pointer littering technique, having multiple
doors would require an array of doors, or maybe a linked list of
doors, but with the database style table of doors, we can add more
rows whenever a room needs a new door.
Doors
RoomID bfRoomID
r1 r2
r2 r3
Locked Doors
RoomID bfRoomID Locked
r2 r3 k1
Doors
DoorID RoomID RoomID
d1 r1 r2
d2 r2 r3
Locked Doors
DoorID Locked
d2 k1
Doors
DoorID RoomID
d1 r1
d1 r2
d2 r2
d2 r3
Locked Doors
DoorID Locked
d2 k1
we can choose to use this or ignore it, but only because we know
that doors have two and only two rooms. A good reason to ignore
it could be that we assume the door opens into the first door. Thus
these two room IDs actually need to be given slightly different names,
such as DoorInto, and DoorFrom.
Sometimes, like with RoomPickups, the rows in a table can be
thought of as instances of objects. In RoomPickups, the rows rep-
resent an instance of a PickupType in a particular Room. This is a
many to many relationship, and it can be useful in various places,
even when the two types are the same, such as in the RoomDoors
table.
When most programmers begin building an entity system they
9.4. COSMIC HIERARCHIES 99
industry.
then when you iterate over a large collection of them, all the data
has to be pulled into the cache at once. If we assume that a cacheline
is 128 bytes, and the size of floats is 4 bytes, the Keyframe struct is
16 bytes. This means that every time you look up a key time, you
accidentally pull in four keys and all the associated keyframe data.
If you are doing a binary search of a 128 key stream, that could
mean you end up loading 128bytes of data and only using 4 bytes
of it in up to 6 chops. If you change the data layout so that the
searching takes place in one array, and the data is stored separately,
then you get structures that look like this:
1 struct KeyData
2 {
3 float x ,y , z ;
4 };
5 struct stream
9.6. STREAM PROCESSING 101
6 {
7 float * times ;
8 KeyData * values ;
9 };
Doing this means that for a 128 key stream, a binary search is
going to pull in at most three out of four cachelines, and the data
lookup is guaranteed to only require one.
Database technology also saw this, it’s called column oriented
databases and the provide better throughput for data processing
over traditional row oriented relational databases simply because
irrelevant data is not loaded when doing column aggregations or
filtering.
When you prepare a primitive render for a graphics card, you set
up constants such as the transform matrix, the texture binding, any
lighting values, or which shader you want to run. When you come
to run the shader, each vertex and pixel may have its own scratch
pad of local variables, but they never write to globals or refer to
a global scratchpad. Enforcing this, we can guarantee parallelism
because the order of operations are ensured to be irrelevant. If a
shader was allowed to write to globals, there would be locking, or it
would become an inherently serial operation. Neither of these are
good for massive core count devices like graphics cards, so that has
been a self imposed limit and an important factor in their design.
lot of memory locations at once, then save them out once they are
all in the cache. If your input data can be modified by your output
buffer, then you have to tread very carefully. Now consider this:
1 int q =10;
2 int p [10];
3 for ( int i = 0; i < q ; ++ i )
4 p[i] = i;
The compiler can figure out that q is unaffected, and can happily
unroll this loop or replace the check against q with a register value.
However, looking at this code instead:
1 void foo ( int * p , const int & q )
2 {
3 for ( int i = 0; i < q ; ++ i )
4 p[i] = i;
5 }
6
7 int q =10;
8 int p [10];
9 foo ( p , q ) ;
Optimisations and
Implementations
105
106 CHAPTER 10. OPTIMISATIONS
10.1 Tables
To keep things simple, advice from multiple sources indicates that
keeping your data as vectors has a lot of positive benefits. There
are reasons to not use STL, including extended compile and link
times, as well as issues with memory allocations. Whether you use
std::vector, or roll your own dynamicly sized array, it is a good
starting place for any future optimisations. Most of the processing
you will do will be transforming one array into another, or modifying
a table in place. In both these cases, a simple array will suffice for
most tasks.
For the benefit of your cache, structs of arrays can be more cache
friendly if the data is not related. It’s important to remember that
this is only true when the data is not meant to be accessed all at once,
as one advocate of the data-oriented design movement assumed that
structures of arrays were intrinsically cache friendly, then put the
x,y, and z coordinates in separate arrays of floats. The reason that
this is not cache friendly should be relatively easy to spot. If you
need to access the x,y, or z of an element in an array, then you more
than likely need to access the other two axes as well. This means
that for every element he would have been loading three cache-lines
of float data, not one. This is why it is important to think about
where the data is coming from, how it is related, and how it will
be used. Data-oriented design is not just a set of simple rules to
convert from one style to another.
If you use dynamic arrays, and you need to delete elements from
them, and these tables refer to each other through some IDs, then
you may need a way to splice the tables together in order to process
them. If the tables are sorted by the same value, then it can be
written out as a simple merge operation, such as in Listing 10.1.
This works as long as the == operator knows about the table
types and can find the specific column to check against, and as
long as the tables are sorted based on this same column. But what
about the case where the tables are zipped together without being
the sorted by the same columns? For example, if you have a lot of
10.1. TABLES 107
whether your tables are big1 or not. If your tables are too big to use
such a trivial join, then you will need an alternative strategy.
You can join by iteration, or you can join by lookup2 , or you can
even join once and keep a join cache around.
The first thing could do is use the ability to have tables sorted in
multiple ways at the same time. Though this seems impossible, it’s
perfectly feasible to add auxilary data that will allow for traversal of
a table in a different order. We do this the same way databases allow
for any number of indexes into a table. Each index is created and
kept up to date as the table is modified. In our case, we implement
each index the way we need to. Maybe some tables are written to
in bursts, and an insertion sort would be slow, it might be better to
1 dependent on the target hardware, how many rows and columns, and
whether you want the process to run without trashing too much cache
2 often a lookup join is called a join by hash, but as we know our data, we
can use better row search algorithms than a hash when they are available
10.1. TABLES 109
sort on first read. In other cases, the sorting might be better done
on write, as the writes are infrequent, or always interleaved with
reads.
Concatenation trees provide a quick way to traverse a list. Conc-
trees usually are only minimally slower than a linear array due to
the nature of the structure. A conc-tree is a high level structure
that points to a low level structure, and many elements can pass
through a process before the next leaf needs to be loaded. The code
for a conc-tree doesn’t remain in memory all the time like other list
handling code, as the list offloads to an array iteration whenever it
is able. This alone means that sparse conc-trees end up spending
little time in their own code, and offer the benefit of not having to
rebuild when an element goes missing from the middle of the array.
In addition to using concatenation trees to provide a standard
iterator for a constantly modified data store, it can also be used as a
way of storing multiple views into data. For example, perhaps there
is a set of tables that are all the same data, and they all need to be
processed, but they are stored as different tables for some reason,
such as what team they are on. Using the same conc-tree code,
they can be iterated as a full collection with any code that accepts
a conc-tree instead of an array iterator.
Modifying a class can be difficult, especially at runtime when
you can’t affect the offsets in running code without at least stopping
the process and updating the executable. Adding new elements to
a class can be very useful, and in languages that allow it, such as
Ruby and Python, and even Javascript, it can be used as a sub-
stitute for virtual functions and compositing. In other languages
we can add new functions, new data, and use them at runtime. In
C++, we cannot change existing classes, unless they are defined by
some other mechanism than the compile time structure syntax. We
can add elements if we define a class by its schema, a data driven
representation of the elements in the class. The benefit of this is
that schema can include more, or less, than normal given a new con-
text. For example, in the case of some merging where the merging
happens on two different axes, there could be two different schema
110 CHAPTER 10. OPTIMISATIONS
10.2 Transforms
Taking the concept of schemas another step, a static schema defini-
tion can allow for a different approach to iterators. Instead of iterat-
ing over a container, giving access to an element, a schema iterator
can become an accessor for a set of tables, meaning the merging work
can be done during iteration, generating a context upon which the
transform operates. This would benefit large, complex merges that
do little with the data, as there would be less memory usage creat-
ing temporary tables. It would not benefit complex transforms as it
would reduce the likelihood that the next set of data is in memory
ready for the next cycle.
For large jobs, a smarter iterator will help in task stealing, the
concept of taking work away from a process that is already running
in order to finish the job faster. A scheduler, or job management
system, built for such situations, would monitor how tasks were
progressing and split remaining work amongst any idle processors.
Sometimes this will happen because other processes took less time
to finish than expected, sometimes because a single task just takes
longer than expected. Whatever the reason, a transform based de-
sign makes task stealing much simpler than the standard sequential
10.2. TRANSFORMS 111
One trick found in most updating hierarchies is the dirty bit, the
flag that says whether the child or parent members of a tree imply
that this object needs updating. When traversing the hierarchy, this
dirty bit causes branching based on data that has only just loaded,
usually meaning there is no chance to guess the outcome and thus
in most cases, causes a pipeline flush and an instruction lookup.
If your calculation is expensive, then you might not want to go
the route that renderer engines now use. In render engines, it’s
cheaper to do every scene matrix concatenation every frame than it
is only doing the ones necessary and figuring out if they are.
For example, in the /emphGCAP 2009 - Pitfalls of Object Ori-
ented Programming presentation by Tony Albrecht in the early slides
he declares that checking a dirty flag is less useful than not check-
ing it as if it does fail (the case where the object is not dirty) the
calculation that would have taken 12 cycles is dwarfed by the cost
of a branch misprediction (23-24 cycles).
If your calculation is expensive, you don’t want to bog down the
game with a large number of checks to see if the value needs up-
dating. This is the point at which existence-based-processing comes
into its own again as existence the dirty table implies that it needs
updating, and as a dirty element is updated it can be pushing new
dirty elements onto the end of the table, even prefetching if it can
improve bandwidth.
A B C D
"
a ab c cd
(
a ab abc abcd
Then once you have the last element, backfill all the other el-
ements you didn’t finish on your way to making the last element.
When you come to write this in code, you will find that these back
filled values can be done in parallel while making the longest chain.
They have no dependency on the final value so can be given over to
another process, or managed by some clever use of SIMD.
a ab c abcd
!
a ab abc abcd
Also, for cases where the entity count can rise and fall, you need
a way of adding and deleting without causing any hiccups. For this,
if you intend to transform your data in place, you need to handle
the case where one thread can be reading and using the data that
10.7. BIT TWIDDLING DECISION TABLES 117
3 This would be from false sharing, or even real sharing in the case of different
bits in the same byte
10.8. JOINS AS INTERSECTIONS 119
4 Take a look at the section headed The Massively Vectorized Virtual Machine
Concurrency
121
122 CHAPTER 11. CONCURRENCY
simple ground rules that provide very stable tools for developing
truly concurrent software.
int shared = 0;
void foo() {
int a = shared;
a += RunSomeCalculation();
shared = a;
}
int shared = 0;
Mutex sharedMutex;
void foo() {
sharedMutex.acquire();
int a = shared;
a += RunSomeCalculation();
shared = a;
sharedMutex.release();
}
And now it works. No matter what, this function will now always
finish its task without some other process damaging its data. Every
time one of the hardware threads encounters this code, it stops all
processing until the mutex is acquired. Once it’s acquired, no other
124 CHAPTER 11. CONCURRENCY
hardware thread can enter into these instructions until the current
thread releases the mutex at the far end.
Every time a thread-safe function uses a mutex section, the whole
machine stops to do just one thing. Every time you do stuff inside
a mutex, you make the code thread-safe by making it serial. Every
time you use a mutex, you make your code run bad on infinite core
machines.
Thread-safe, therefore, is another way of saying: not concurrent,
but won’t break anything. Concurrency is when multiple threads are
doing their thing without any mutex calls, semaphores, or other form
of serialisation of task. Concurrent means at the same time. A lot
of the problems that are solved by academics using thread-safety to
develop their multi-threaded applications needn’t be mutex bound.
There are many ways to skin a cat, and many ways to avoid a mutex.
Mutex are only necessary when more than one thread shares write
privileges on a piece of memory. If you can redesign your algorithms
so they only ever require one thread to be given write privilege, then
you can work towards a fully concurrent system.
Ownership is key to developing most concurrent algorithms. Con-
currency only happens when the code cannot be in a bad state, not
because it checks before doing work, but because the design is such
that no process can interfere with another in any way.
that can be compared with the write head to find out if there is any
waiting elements. Given this, a gathering gateway or reduce gateway
can be made by cycling through all known gateways and popping
any data from the gateway read heads. This is a fully concurrent
technique and would be just as at home in variable CPU timing
solutions as it is in standard programming practices as it implies a
consistent state through ownership of their respective parts. A write
will stall until the read has allowed space in the queue. A read will
either return ”no data” or stall until the write head shows that there
is something more to read.
When it comes to combining all the data into a final output,
sometimes it’s not worth recombining it into a different shape, in
which case we can use conc-trees, a technique that allows for cache-
friendly transforming while maintaining most of the efficiency of
contiguous arrays. Conc-trees are a tree representation of data that
almost singlehandedly allows for cache oblivious efficient storage of
arbitrary lists of data. Cache oblivious algorithms don’t need to
know the size of the cache in order to fully utilise them and normally
consist of some kind of divide and conquer core algorithm, and conc-
trees do this by either representing the null set, a concatenation, or
some data. If we use conc trees with output from transforms, we
can optimise the data that is being concatenated so it fits well in
cache lines, and we can also do the same with conc tree concatenate
nodes, making them concatenate more than just two nodes, thus
saving space and memory access. We can tune these structures per
platform or by data type.
Given that we have a friendly structure for storing data, and that
these can be built as concatenations rather than some data soup, we
have basis for a nicely concurrent yet associative transform system.
If we don’t need to keep the associative nature of the transform,
then we can optimise further, but as it comes at little to no cost,
and most reduce operations rely on associativity, then it’s good that
we have that option.
Moving away from transforms, there is the issue of now that
crops up whenever we talk about concurrent hardware. Sometimes,
128 CHAPTER 11. CONCURRENCY
when you are continually updating data in real time, not per frame,
but actual real time, there is no safe time to get the data, or process
it. Truly concurrent data analysis has to handle reading data that
has literally only just arrived, or is in the process of being retired.
data-oriented development helps us find a solution to this by not
having a concept of now but a concept only of the data. If you are
writing a network game, and you have a lot of messages coming in
about a player, and their killers and victims, then to get accurate
information about their state, you have to wait until they are already
dead. With a data-oriented approach, you only try to generate the
information that is needed, when it’s needed. This saves trying to
analyse packets to build some predicted state when the player is
definitely not interesting, and gives more up to date information as
it doesn’t have to wait until the object representing the data has
been updated before any of the most recent data can be seen or
acted on.
The idea of now is present only in systems that are serial. There
are multiple program counters when you have multiple cores, and
when you have thousands of cores, there are thousands of different
nows. Humans think of things happening at a certain time, but
sometimes you can take so long doing something that more data has
arrived by the time you’re finished that you should probably go back
and try again.
Take for example the idea of a game that tries to get the lowest
possible latency between player control pad and avatar reaction. In
a lot of games, you have to put up with the code reading the pad
state, the pad state being used to adjust animations, the animations
adjusting the renderables, the rendering system rastering and the
raster system swapping buffers. In most games it takes at least
three frames
In Practice
12.1 Data-manipulation
12.1.1 The Cube
In 22cans first experiment, there was call to handle a large amount
of traffic from a large number of clients, potentially completely frag-
mented, and yet also completely syncronised. The plan was to de-
velop a service capable of handling a large number of incoming pack-
ets of update data while also compiling it into compressed output
data that could be send to the client all with a minimal turnaround
time.
The system was developed in C++11 on Linux, and followed the
basic tenets of data-oriented design from day one. The data was
loaded, but not given true context. Manipulations on the data were
procedures that operated on the data given other data. This was
131
132 CHAPTER 12. IN PRACTICE
benefit you get from optimisations when your main bottleneck is the VU and GS.
The same was true here, we had a simple bottleneck of the size of the data and
the operations to run on it. Optimisations had minimal impact on processing
speed, but they did impact our ability to debug the service when it did crash.
12.1. DATA-MANIPULATION 133
been much more efficient, not just saving CPU cycles, but also al-
lowing for platform specific optimisations at run time by having
the renderer analyse the current set of renderables and organise the
whole list of jobs by what caused the least program changes, texture
changes, constant changes and primitive render calls. The system
seemed too good to be true, and though we tried to write the system,
Broadsword never finished it. After Broadsword dissolved, when I
did finish it, I didn’t have access to a console2 , so I was only able
to test out the performance on a PC. As expected the new system
was faster, but only marginally. With hindsight, I now see that the
engine was only marginally more efficient because the tests were be-
ing run on a system that would only see marginal improvements, but
even the x86 architecture saw improvements, which can be explained
away as slightly better control flow and generally better cache utili-
sation.
First let me explain the old system.
All renderables came from a scene graph. There could be mul-
tiple scene graph renders per frame, which is how the 2D and 3D
layers of the game were composited, and each of them could use
any viewport, but most of them just used the fullscreen viewport as
the engine hadn’t been used for a split screen game, and thus the
code was probably not working for viewports anyway. Each scene
graph render to viewport would walk the scene3 collecting trans-
forms, materials, colour tints, and meshes, to render, at which point
the node that contained the mesh would push a render element onto
the queue for rendering into that viewport. This queueing up of
elements to render seemed simple and elegant at the time. It meant
that programmers could quickly build up a scene and render it, most
of the time using helpers that loaded up the assets and generated the
right nodes for setting textures, transforms, shader constants, and
meshes. These rendering queues were then sorted before rendering.
2 The target platforms of the engine included the Playstation 2 and the Nin-
Sorted by material only for solid textures, and sorted back to front
for alpha blended materials. This was the old days of fixed function
pipelines and only minimal shader support on our target platforms.
Once the rendering was done, all the calculated combinations were
thrown away. This meant that for everything that was rendered,
there was definitely a complete walk of the scene graph.
The new system, which was born before we were aware of data-
oriented design, but was definitely born of looking at the data, was
different in that it no longer required walking the scene graph. We
wanted to maintain the same programmer friendly front edge API, so
maintained the facade of a scene graph walk, but instead of walking
the graph, we only added a new element to the rendering when the
node was added, and only removed it when it was removed from the
scene graph. This meant we had a lot of special code that looked for
multiple elements registered in multiple view port lists, but, other
than that, a render merely looked up into the particular node it
cared about, gathered the latest data, and processed it pulling when
it required it rather than being pushed things it didn’t even care
about.
The new system benefited from being a simple list of pointers
from which to fetch data (the concept of a dirty transform was re-
moved, all transforms were considered to be dirty every frame), and
the computation was simplified for sorting as all the elements that
were solid were already sorted from the previous render, and all the
alpha blended elements were sorted because they belonged to the set
of alpha blended elements. This lack of options accounted for some
saved time, but the biggest saving probably came from the way the
data was being gathered per frame rather than being generated from
a incoherent tree. A tree that, to traverse, requried many pointer
lookups into virtual tables as all the nodes were base classed to a
base node type and all update and render calls were virtual causing
many misses all the way through each of the tree walks.
In the end the engine was completely abandoned as Broadsword
dissolved, but that was the first time we had taken an existing code-
base and (though inadvertantly) converted it to data-oriented.
12.1. DATA-MANIPULATION 135
many different queries to run over the core data, and allowed for
anyone to add another new query easily because there was no data-
hiding getting in the way of any new kind of access. Object-oriented
approaches to this kind of data handling often provide a gateway
to the data but marshall it so that the queries become heavyweight
and consistently targetted for optimisation. With a very simple data
structure and open access to the data in any form, any new query
could be as easily optimised as any existing one.
If the first change was a success, then the second was a major
success. Instead of keeping track of when a weapon was ready to
fire again, the time that it would be available was inserted into a
sorted list of recharge times. Only the head of the list was checked
every global update, meaning that a large number of weapons could
be recharging at once and none of them caused any data access until
they were very nearly or actually ready.
solution is to give the profiler and optimiser a better chance by leaving them
less to do
140 CHAPTER 12. IN PRACTICE
8 http://cowboyprogramming.com/2007/01/05/evolve-your-heirachy/
142 CHAPTER 12. IN PRACTICE
Chapter 13
143
144 CHAPTER 13. MAINTENANCE AND REUSE
13.1 Debugging
The prime causes of bugs are the unexpected side effects of a trans-
form, or an unexpected corner case where a conditional didn’t return
the correct value. In object-oriented programming, this can manifest
in many ways, from an exception caused by de-referencing a null, to
ignoring the interactions of the player because the game logic hadn’t
noticed it was meant to be interactive.
13.1.1 Lifetimes
One of the most common causes of the null dereference is when an
object’s lifetime is handled by a separate object to the one manipu-
lating it. For example, if you are playing a game where the badguys
can die, you have to be careful to update all the objects that are
using them whenever the badguy gets deleted, otherwise you can
end up dereferencing invalid memory which can lead to dereferenc-
ing null pointers because the class has destructed. data-oriented
development tends towards this being impossible as the existence of
an entity in a table implies its processability, and if you leave part
of an entity around in a table, you haven’t deleted the entity fully.
This is a different kind of bug, but it’s not a crash bug, and it’s
easier to find and kill as it’s just making sure that when an entity is
destroyed, all the tables it can be part of also destroy their elements
too.
13.2 Reusability
A feature commonly cited by the object-oriented developers that
seems to be missing from data-oriented development is reusability.
The idea that you won’t be able to take already written libraries
146 CHAPTER 13. MAINTENANCE AND REUSE
resuable on the outside, the fact is that it maintains the same amount
of information in a simpler form, so it’s more reusable as it doesn’t
carry the baggage of related data or functions like object-oriented
programming, and doesn’t require complex transforms to generate
the input and extract from the output like procedural progamming
tends to generate due to the normalising.
Duck typing, not normally available in object-oriented program-
ming due to a stricter set of rules on how to interface between data,
can be implemented with templates to great effect, turning code that
might not be obviously reusable into a simple strategy, or a sequence
of transforms that can be applied to data or structures of any type,
as long as they maintain a naming convention.
The object-oriented C++ idea of reusability is a mixture of in-
formation and architecture. Developing from a data-oriented trans-
form centric viewpoint, architecture just seems like a lot of fluff code.
The only good architecture that’s worth saving is the actualisation
of data-flow and transform. There are situations where an object-
oriented module can be used again, but they are few and far between
because of the inherent difficulty interfacing object-oriented projects
with each other.
The most reusable object-oriented code appears as interfaces to
agents into a much more complex system. The best example of an
object-oriented approach that made everything easier to handle, was
highly reusable, and was fully encapsulated was the FILE type from
stdio.h that is used as an agent into whatever the platform and
OS would need to open, access, write, and read to and from a file
on the system.
into the same object and requiring complex setup state before a
test can be carried out, has given unit testing a stunted start in
games as object-oriented programming caused simple tests to be
hard to write. Making tests is further complicated by the addition
of the non-obvious nature of how objects are transformed when they
represent entities in a game world. It can be very hard to write
unit tests unless you’ve been working with them for a while, and
the main point of unit tests is that someone that doesn’t fully grok
the system can make changes without falling foul of making things
worse.
13.4 Refactoring
During refactoring, it’s always important to know that you’ve not
broken anything by changing the code. Allowing for such simple unit
testing gets you halfway there. Another advantage of data-orieted
development is that, at every turn, it peels away the unnecessary
elements, so you might find that refactoring is more a case of switch-
ing out the order of transforms more than changing how things are
represented. Refactoring normally involves some new data represen-
tation, but as long as you normalise your data, there’s going to be
little need of that. When it is needed, tools for converting from one
schema to another can be written once and used many times.
150 CHAPTER 13. MAINTENANCE AND REUSE
Chapter 14
Design Patterns
151
152 CHAPTER 14. DESIGN PATTERNS
14.2.1 A to B transform
A stateless function that takes elements from one container and pro-
duces outputs into another container. All variables are constant over
the lifetime of the transform.
The normal way of working and the best way to guarantee that
your output and input tables don’t clash, is to create a new table
and transform into it. The A to B transform is also the basis of dou-
ble buffered state and provides a powerful mechanism for enabling
concurrent processing as you can ensure that A is read only, and
B is write only for the whole time slice. The A to B transform is
the transform performed by shaders and other compute kernels, and
the same conditions apply: constants do not change, local variables
only, globals are considered to be constants, data can be processed
in any order and without any communication to any other element
in the transform.
A to B transforms aren’t expensive unless the row counts change,
and even then you can mitigate those more expensive transforms
with conc-trees and other in place transforms that reduce in the
case of filtering transforms. There are two things to consider when
transforming from one table to another:
Is this a transform to create new, or an updated state? Is this
transform going to significantly change the values before they hit
the output table?
The simplest transform is an update. This allows for concur-
rent state updates across multiple associated systems without cou-
pling. The condition table transform is generally used to signif-
icantly change the values, putting them into a new table with a
different schema, ready for decision transforms. Sometimes a trans-
form will do more than one of these, such as when an update also
emits rows into an event table ready for subscribing entities to react
to a state change on the data.
An important requirement of the transform is that it is stateless.
Transforms operate on data, use constants to adjust how they trans-
form that data, and output in a standardised manner. Some amount
154 CHAPTER 14. DESIGN PATTERNS
back once at most for each transform, otherwise aliasing rules can
bite, for example, writing to a local accumulator when the type of
the accumulator is the same as a type being read from, then aliasing
may require your accumulator to be re-read every time it is updated.
One way around this is to mark all your elements as restricted,
thus ensuring that the compiler will assume it’s safe to use the last
value without reloading, but this might be vendor specific so test by
looking at the assembly produced before declaring that the restrict
will fix the problems inherent in accumulators or other load-modify-
store values that may be used in in-place transforms.
Another thing to mention is that for both the in-place transform
and the A to B transform, using a structure’s static member is just
as invalid as a global variable.
14.2.6 Gatherer
A method by which large tasks can be split off into multiple parallel
tasks and can then be reduced back into one container for further
processing.
14.2. DATA-ORIENTED DESIGN PATTERNS 157
When you run concurrent processes, but you require the output
to be presented in one table, there is a final stage of any transform
that turns all the different outputs into one final output. You can
reduce the amount of work and memory taken up by using the con-
catenate function and generate a table that is a conc-tree, or you
can use a gather transform. A gather transform has as many inputs
as you have parallel tasks leading into the final output table, and
each input is a queue of data waiting to enter the final transform.
As long as the data is not meant to be sorted, then the gatherer
can run while the concurrent transforms are running, peeling the
transformed rows out of the processes independent output queues,
and dropping them into the final output table. This allows for com-
pletely lockless parallelism of many tasks that don’t require the data
to be in any particular order at the final stage.
If the gatherer is meant to produce the final table in the same
order as the input table then you will need to pre-parse the data or
have the gatherer use a more complex collating technique. When
you pre-parse the data you can produce a count that can be used
to initialise the output positions making the output table secure for
the processing transforms to write directly into it, but if there is no
way to pre-parse the data, then gatherer might need to co-operate
with the task scheduler so that the processes are biased to return
ealier work faster, making final collation a just in time process for
the transfer to the next transform.
14.2.7 Tasker
When a container is very large, break off pieces and run transforms
as concurrent tasks.
When you have a single large table, you can transform it with
one transform, or many. Using a tasker to fake multiple independent
tables allows for concurrent transforming as each transform can han-
dle their specific part of the input table without having to consider
any of the other parts in any way.
158 CHAPTER 14. DESIGN PATTERNS
14.2.8 Dispatcher
Rather than switch which instructions to run based on data in the
stream, prefer producing new tables that can each be transformed
more precisely.
When operating on data, sometimes the transform is dependent
on the data. Normally any enumerable types or booleans that imply
a state that can change what transform is meant to run are hidden
behind existence based processing tables, but sometimes it’s good
to store the bool or enum explicitly for size sake. When that is the
case, it can be beneficial to pre-process a table and generate a job
table, a table with a row for each transform that is meant to be run
on the data.
A dispatcher takes on input table and emits into multiple tables
based on the data. This allows for rapid execution of the tasks
without the overhead of instruction cache misses due to changing
what instructions are going to be run based on the data as all the
instructions are collected into each of the job tables.
the second is the type requested for creation. This inherently object-
oriented design problem stems from the desire to reuse code to drive
object creation, but as C++ stops you from using objects by their
features unless already declared in some way, it is hard to retroac-
tively apply old creation techniques to new types. Normally we set
up our classes by inheriting from an abstract base class so we can
later provide new classes under new creation objects. The abstract
factory is the eventual extension of allowing the creation function to
be chosen at runtime based on the type of object required (which
is statically checked) and the scheme or style of the object (which
is dynamically calculated by the abstract factory), so choosing the
right concrete factory object to create your new object.
When writing component entity systems, this is the equivalent
of having components being inserted into an entity based on two
dimensional data. The fact that a component of a certain type needs
creating (the create call) is emulated by the row in the component
create table, and the abstract factory calling into a concrete factory
is emulated by the value causing the create call to be dispatched to an
appropriate final factory table. Only once it has been dispatched to
the right table does it get inserted into the final concrete component
table.
This pattern is useful for multi-platform object-oriented devel-
opment where you might have to otherwise refactor the internals
of your object creators, but with a component driven approach, you
can switch your output tables, transforms, or complete tree of trans-
forms without losing consistency where it is necessary.
14.3.2 Builder
The builder pattern is all about wrapping up collections of reusable
low-level parts of a complex transform in order to provide a tool kit
for macro scale programming within a problem domain. The builder
pattern is used when you want to be able to switch which tool-kits
are used, thus providing a way of giving a high level program the
ability to direct the creation of objects through the interface of a
160 CHAPTER 14. DESIGN PATTERNS
collection of tools.
This can be seen to parallel the tool-kit of transforms you tend
to build when developing with tables. With a data-oriented func-
tional programming approach, we don’t immediately call builders,
but instead generate the necessary build instructions that can then
dispatched to the correct parallel processes where possible before
being reduced into the final data. Usually the builders are state-
less transforms that convert the request to build into data outputs,
or take a small amount of input data and build it into larger or
translated output.
Once you work out that the builder pattern is mostly a rewording
of procedural reuse with the addition of switching which procedures
are called based on a state in a larger scope, it is easy to see why
the pattern is applicable in all programming paradigms.
14.3.4 Prototype
The prototype pattern makes a lot of sense in an object-oriented code
base, as the ability to generate more class instances from another
class, only to modify it slightly before releasing it as finished almost
gives you the flexibility to generate new classes at run-time, but any
flexibility still has to be built into the classes before they can be used
as the classes can only inherit from types existing at compile time.
Some Object-oriented languages do let you generate new features at
run-time but C++ doesn’t allow for it without considerable coercion.
The existence-based-processing way of working stops you from
creating a prototype as its existence would imply that it is in the
game. There is no place to hide a prototype. But, there is also
little call for it as creating a prototype is about creating from a plan
of sorts, and because all the component creation can be controlled
via read-only or just-in-time mementos, its very easy to make new
classes with their components defined and specialised without re-
sorting to hacks like keeping in memory a runtime template for a
class you might need later. Unlike the object-oriented prototype,
the component system memento does not have all the feature of the
final component set, it only has the features of a creation data-type
and thus fit’s its purpose more purely. It cannot accidentally be
used and damaged in the way prototypes can, for example, proto-
types can be damaged by global update functions that accidentally
modified the state. Because the only difference between a prototype
and the runtime instance is whether the program knows that they
are prototypes, it’s hard to secure them against unwanted messages.
14.3.5 Singleton
There is no reason to think that a single instance of a piece of data
has any place in a data-oriented development. If the data is definitely
unique, then it deserves to be exposed as a system wide global. It
is only when globals are used for temporary store that they become
dangerous. Singletons are an inherently object-oriented solution to
162 CHAPTER 14. DESIGN PATTERNS
a problem that only persists if you truly want to write platform igno-
rant code. Initialising and deinitialising managers in specific order
for component managers, or for hardware or dependency sake can
just as easily be managed explicitly, and is less prone to surprising
someone. When you change the initialisation order accidentally by
referencing a singleton, that reference won’t be seen as changing the
order of initialisation and it has been seen to cause quite difficult to
debug when the effect of the change in order is subtle enough to not
crash immediately.
14.3.6 Adapter
An adapter provides an API wrapper around an otherwise incom-
patible class. Classes have their APIs locked down after they are
created. If two classes that need to work with each other come from
third party software libraries, the likelihood of them being naturally
compatible is low to non-existent. An adapter wraps the callee class
in a caller friendly API. The adapter turns the caller class friendly
API calls into callee class friendly API calls.
When writing programs data-oriented, most data is structs of
arrays or tables. In data-oriented development there is no reason for
an adapter, as in the first case, a transform can easily be made to
take arguments of the specific arrays used, and in the second case,
the transform would normally take two or more schema iterators that
traversed the table providing arguments to and capturing outputs
from the transform. In a sense, every table transform iterator is an
adapter because it converts the explicit form of the table into the
explicit form accepted by the transform.
Finally, data itself has no API, and thus the idea of an adapter is
not appropriate as there is nothing to adapt from, so this relegates
the idea to the concept of transforming the data so it can be provided
to third party libraries. Preparing data for external processes is part
of working with other code, and is one of the prices you pay for not
having full control of the source code. The adapter pattern doesn’t
help this in any way, only provides a way of describing the inevitable.
14.3. EXISTING DESIGN PATTERNS 163
14.3.7 Bridge
The bridge pattern is an object-oriented technique for providing en-
capsulation from implementation when the class’s implementation
is the functionality that derived classes use to get by in their world.
For example, the platform specific code for an object can be privately
owned by the parent class, and the concrete class can derive from
that class. Then, it can use the implementation of it’s parent class,
but the implementation of the parent class member functions can be
indirectly implemented differently depending on the platform. This
way, the child class can call its parent’s functions to do what it needs
to do, while the parent class can call the implementation’s functions
which can be defined at runtime or compile time depending on what
level of polymorphism is required.
Because all platform specific functionality is defined inside trans-
forms and the way in which they are called, there is no reason to
consider this pattern for platform independent reasons. All that is
needed to benefit from the side effect of using this pattern is to recog-
nise the boundary between platform specific transforms and platform
independent transforms. For situations where the parent class imple-
mentation differs from instance to instance, you only need to notice
that components can be different, yet still register with the same
events. This means, any differing implementations required can be
added as additional components which register into the same event
tables as the existing components and replace those on creation as
part of the different creation sequence.
Also, transforms that take data and do with it something plat-
form specific would be written statically for each platform. This
kind of object-oriented approach to everything can be what gives
object-oriented development such a bad name when it comes to per-
formance. There’s not really any benefit to making your choice of
implementation a runtime polymorphic call when the call can only
ever be to one concrete method per platform as, at least in C++,
there is not such thing as platform independent builds. For C#, it
can be different, but the benefit of doing it is just the same, you
164 CHAPTER 14. DESIGN PATTERNS
14.3.8 Composite
Unlike it’s programming namesake, the composite pattern is not
about an object being made out of multiple other classes by owning
them in the implementation. The composite pattern is the idea
that an object that is part of a system can be identified as a single
object, or as a group of objects. Usually, a game entity references one
renderable unit with a single AI instance, which is why this pattern
can be useful. If an enemy unit class can represent a single unit, or
a squad of units, then it becomes easier to manage commands in a
real time strategy game, or organise AI path finding.
When you don’t use objects, the level of detail of individual parts
of entities are not so hard tied to each other as the entities are not
objects. When you design your software around objects you get
caught up thinking about each object as being indivisible, and thus
the composite pattern lets you at least handle objects in groups. If
you don’t limit your data by tying it to objects, then the pattern is
less useful as you can already manipulate the data by group in any
way you require.
14.3.9 Decorator
The decorator acts much like adding more components to an entity.
Their effect comes online when their existence marks their necessity.
Decorators add features, which is exactly like the entity components
with their ability to add attributes and functionality to a referenced
entity.
14.3.10 Façade
Even though there are no tangled messes of interacting objects in
the data-oriented approach, sometimes it might be better to channel
14.3. EXISTING DESIGN PATTERNS 165
14.3.11 Flyweight
The flyweight pattern offers a way to use very fine grain elements in
the same way it uses larger constructs. In data-oriented development
there is no need to differentiate between the two, as there is no
inherent overhead in interfacing data in a data-oriented manner. The
reason for a flyweight is similar to the reason for the bridge pattern,
once you start thinking object-oriented and want to keep your whole
codebase working like that, you can end up painting yourself into a
corner about it and start adding objects to everything, even things
that are too small or fragile for such a heavy handed, human centric
way of seeing and manipulating the data.
14.3.12 Proxy
In a sense, the proxy pattern is much like the level of detail system
in most games. You don’t load final assets until you are near enough
to need them. This topic was fully covered in chapter 4.
14.3.14 Command
This is realised fully when using tables to initiate your actions. All
actions have to exist as a row in a table at some point, usually as a
condition table row in the first place.
14.3.15 Interpreter
This design pattern doesn’t seem very well suited to Object-oriented
programming at all, though it is rooted in data driven flow control.
The problem with the interpreter is it seems to be the design pattern
for parsers, but doesn’t add anything in itself.
14.3.16 Iterator
Although the table iterators that convert table schema into trans-
form arguments are iterators, they don’t fit into the design pattern
because they are a concrete solution to iterating over tables. There
is no scope for using them for arguments to functions or using them
to erase elements and continue, they are stateful, but restricted to
their one job. However, because there are no containers in table
oriented solutions other than the tables, we can safely ignore true
iterators as a design. In a stream based development, iterators are
so common and pure, they don’t really exist in the design pattern
sense.
14.3.17 Mediator
The mediator can be thought of as a set up involving attaching
event transforms to table updates. When a table entry updates,
the callback can then inform the other related tables that they need
to consider some new information. However, this data hiding and
decoupling is not as necessary in data-oriented development as most
solutions are considering all the data and thus don’t need to be
decoupled at this level.
14.3. EXISTING DESIGN PATTERNS 167
14.3.18 Memento
Stashing a compressed state in a table is very useful when changing
level of detail. Much was made of this in count lodding. Mementos
also provide a mechanism for undoing changes, which can be essen-
tial to low memory usage network state prediction systems, so they
can rollback update, and fast forward to the new predicted state.
14.3.20 Observer
The publish and subscribe model presented in the table based event
system is an observer pattern and should be used whenever data
relies on some other data. The observer patter is high latency, but
provides for callbacks, event handling, and job starting. In most
games there will be an observer registry for the frame sync, the
physics tick, the AI tick and for any IO completion. User defined
callback tables will be everywhere, including how many finite state
machines are implemented.
14.3.21 State
Data controlling the flow of the program is considered an anti-
pattern in data-oriented development, so this design pattern is to
be understood, and never used in the way it is presented. Instead,
have your state be implicitly defined by the table in which your
168 CHAPTER 14. DESIGN PATTERNS
14.3.22 Strategy
14.3.23 Template Method
The template method can still be used but overriding the calls by
removing the template callbacks and adding the specific new call-
backs would be how you organise the equivalent of overriding the
virtual calls.
14.3.24 Visitor
The visitor pattern would be fine if it were not for the implicit al-
lowance for it to be stateful. If the visitor was not allowed to accu-
mulate during it’s structure traversal, then it would be an in-place
transform. By not defining it as a stateless entity, but a stateful ob-
ject that walks the container, it becomes an inherently serial, cache
unfriendly container analyser. This pattern in particular is quoted
by people who don’t understand the fundamental requirements of a
transform function. If you talk to an Object-Oriented programmer
about taking a list of data and doing a transform on it, they will
tend to mention it as the visitor patter because there is a tendency
to fit solutions to problems with design patterns. The best way to
define a visitor pattern is an object that changes state based on the
incoming data. The final state is dependent on the order in which
the data is given to the visitor, thus it is a serial operation. The
best analogy to a transform is not compatible with a visitor because
a transform pattern denies the possibility of accumulation.
14.4. LOCKING OUT ANTI-PATTERNS 169
problem of all the methods in one class, then we know we’re fine as
there are no such boundaries.
14.4.5 Stovepipe
The more that bugs are fixed and corner cases are added, the harder
it can be to refactor an existing Object-oriented codebase. This is
not as true in data-oriented as towards the end of a project the
difficultly in refactoring the design should be mitigated by the un-
derstanding and transparency of the whole program.
What’s wrong?
than C++. There’s some arguments against the results, but there’s others that
back it up. Read, make up your own mind.
173
174 CHAPTER 15. WHAT’S WRONG?
spawning off sound effects. This seems to be a good trade off, but
it was only a good trade off when these particular instructions were
cheap.
Two out of the four instructions are loads, which don’t seem like
they should cost much, but in actual fact, unless you hit the cache, a
load takes around 600 cycles on modern consoles. The add is cheap,
to modify the register value to address the correct function pointer,
but the branch is not always cheap as it doesn’t know where it’s going
until the second load completes. This could cause cache miss in the
instructions. All in all, it’s common to see around 1800 cycles wasted
due to a single virtual call in any significantly large scale game. In
1800 cycles, the floating point unit alone could have finished naively
calculating over fifty dot products, or up to 27 square roots. In the
best case, the virtual table pointer will already be in memory, the
object type the same as last time, so the function pointer address will
be the same, and therefore the function pointer will be in cache too,
and in that circumstance it’s likely that the unconditional branch
won’t stall as the instructions are probably still in the cache too.
But this best case is fairly uncommon.
The implementation of C++ doesn’t like how we iterate over
objects. The standard way of iterating over a set of heterogeneous
objects is to literally do that, grab an iterator and call the virtual
function on each object in turn. In normal game code, this will
involve loading the virtual table pointer for each and every object.
This causes a wait while loading the cache line, and cannot easily
be avoided. Once the virtual table pointer is loaded, it can be used,
with the constant offset (the index of the virtual method), to find the
function pointer to call, however due to the size of virtual functions
commonly found in games development, the table won’t be in the
cache. Naturally, this will cause another wait for load, and once this
load has finished, we can only hope that the object is actually the
same type as the previous element, otherwise we will have to wait
some more for the instructions to load.
The reason virtual functions in games are large is that games
developers have had it drilled into them that virtual functions are
176 CHAPTER 15. WHAT’S WRONG?
okay, as long as you don’t use them in tight loops, which invariably
leads to them being used for more architectural considerations such
as hierarchies of object types, or classes of solution helpers in tree
like problem solving systems (such as path finding, or behaviour
trees).
In C++, classes’ virtual tables store function pointers by their
class. The alternative is to have a virtual table for each function and
switch function pointer on the type of the calling class. This works
fine in practice and does save some of the overhead as the virtual
table would be the same for all the calls in a single iteration of a
group of objects. However, C++ was designed to allow for runtime
linking to other libraries, libraries with new classes that may inherit
from the existing codebase, and due to this, there had to be a way to
allow a runtime linked class to add new virtual methods, and have
them able to be called from the original running code. If C++ had
gone with function oriented virtual tables, the language would have
had to runtime patch the virtual tables whenever a new library was
linked whether at link time for statically compiled additions, or at
runtime for dynamically linked libraries. As it is, using a virtual
table per class offers the same functionality but doesn’t require any
link time or runtime modification to the virtual tables as the ta-
bles are oriented by the classes, which by the language design are
immutable during link time.
Combining the organisation of virtual tables and the order in
which games tend to call methods, even when running through lists
in a highly predictable manner, cache misses are commonplace. It’s
not just the implementation of classes that causes these cache misses,
it’s any time the instructions to run are controlled by data loaded.
Games commonly implement scripting languages, and these lan-
guages are often interpreted and run on a virtual machine. However
the virtual machine or JIT compiler is implemented, there is always
an aspect of data controlling which instructions will be called next,
and this causes branch misprediction. This is why, in general, inter-
preted languages are slower, they either run code based on loaded
data in the case of bytecode interpreters, or they compile code just
15.2. MAPPING THE PROBLEM 177
in time, which though it creates faster code, causes cache misses and
thrashing of its own.
When a developers implements and Object-oriented framework
without using the built-in virtual functions, virtual tables and this
pointers present in the C++ langauge, it doesn’t reduce the chance
of cache miss unless they use virtual tables by function rather than
by class. But even when the developer has been especially careful,
the very fact that they are doing Object-oriented programming with
games developer access patterns, that of calling single functions on
arrays of heterogeneous objects, they are still going to have the same
cache misses as found when the virtual table has survived the call
into the object. That is, the best they can hope for is one less cache
miss per virtual call. That still leaves the opportunity for two cache
misses, and an unconditional branch.
So, with all this apparent inefficiency, what makes games devel-
opers stick with Object-oriented coding practices? As games devel-
opers are frequently cited as a source of how the bleeding edge of
computer software development is progressing, why have they not
moved away wholesale from the problem and stopped using Object-
oriented development practices all together?
for not just one target platform, but usually at least two. In the past
it was between console manufacturers, and now even games develop-
ers on the PC have to manage both Windows(tm) and MacOS(tm)
platforms. The abstractions in the past were mostly hardware ac-
cess abstractions, and naturally some gameplay abstractions as well,
but as the game development industry matured, we found common
forms of abstractions for areas such as physics, AI, and even player
control. Finding these common abstractions allowed for third party
libraries, and many of these use Object-oriented design as well. It’s
quite common for libraries to interact with the game through agents.
These agent objects contain their own state data, whether hidden or
publicly accessible, and provide functions by which they can be ma-
nipulated inside the constraints of the system that provided them.
The game design inspired objects (such as ship, player, level) keep
hold of agents and use them to find out what’s going on in their
world. A player object might use their physics agent to find out
where they are going to be, and what is possible given where they
are, and using their input agent, find out the player’s intention, and
in return providing some feedback to their physics agent about the
forces, and information to their animation agent about what ani-
mations they need to play. A non-player character could mediate
between the world physics agent and its AI agent. A car object
could contain references to physics data for keeping it on the road,
and also runtime modifiable rendering information can be chained
to the collision system to show scratches and worse.
The entities in Object-oriented design are containers for data and
a place to keep all the functions that manipulate that data. Don’t
confuse these entities with those of entity systems, as the entities
in Object-oriented design are immutable over their lifetime. An
Object-oriented entity does not change class during its lifetime in
C++ because there is no process by which to reconstruct a class
in place in the language. As can be expected, if you don’t have
the right tools for the job, a good workman works around it. Games
developers don’t change the type of their objects at runtime, instead
they create new and destroy old in the case of a game entity that
15.2. MAPPING THE PROBLEM 179
with some stubs stuck on it. We can happily classify the same item
but with different dimensions into three distinct classes of object.
Table, coffee table, lump of wood with some little bits of wood on
it. But, at what point does the lump of wood become a coffee table?
Is it somewhere between 4 and 8 inch long legs? This is the same
problem as presented about sand, when does it transition from grains
of sand, to a pile of sand? How many grains are a pile, are a dune?
The answer must be that there is no answer. The answer is also
helpful in understanding how a computer thinks. It doesn’t know
the specific difference between our human classifications because to
a certain degree even humans don’t.
In most games engines, the Object-oriented approach leads to a
lot of objects in very deep hierarchies. A common ancestor chain for
an entity might be: PlayerEntity → CharacterEntity → MovingEn-
tity → PhysicalEntity → Entity → Serialisable → ReferenceCounted
→ Base.
These deep hierarchies virtually guarantee you’ll never have a
cache hit when calling virtual methods, but they also cause a lot
of pain when it comes to cross-cutting code, that is code that af-
fects or is affected by concerns that are unrelated, or incongruous
to the hierarchy. Consider a normal game with characters moving
around a scene. In the scene you will have characters, the world,
possibly some particle effects, lights, some static and some dynamic.
In this scene, all these things need to be rendered, or used for ren-
dering. The traditional approach is to use multiple inheritance or
to make sure that there is a Renderable base class somewhere in
every entity’s inheritance chain. But what about entities that make
noises? Do you add an audio emitter class as well? What about
entities that are serialised vs those that are explicitly managed by
the level? What about those that are so common that they need
a different memory manager (such as the particles), or those that
only optionally have to be rendered (like trash, flowers, or grass in
the distance). Traditionally this has been solved by putting all the
most common functionality into the core base class for everything
in the game, with special exceptions for special circumstances, such
182 CHAPTER 15. WHAT’S WRONG?
about how the class fulfils that contract, there won’t be any new
bugs introduced by change. In theory, the object implementation
can change in any way as long as the contract is not modified, but
only extended. This is the open closed principle. A class should be
open for extension, but closed for modification.
A contract is meant to provide some guarantees about how a
complex system works. In practice, only unit testing can provide
these guarantees.
Sometimes, programmers unwittingly rely on hidden features of
objects’ implementations. Sometimes the object they rely on has
a bug that just so happens to fit their use case. If that bug is
fixed, then the code using the object no longer works as expected.
The use of the contract, though it was kept intact, has not helped
the other piece of code to maintain working status across revisions.
Instead it provided false hope that the returned values would not
change. It doesn’t even have to be a bug. Temporal couplings inside
objects, or accidental or undocumented features that goes away in
later revisions can also damage the code using the contract without
breaking it.
One example would be the case where an object has a method
that returns a list. If the internal representation is such that it
always maintains a sorted list, and when the method is called it
returns some subset of that list, it would be natural to assume that
in most implementations of the method the subset would be sorted
also. A concrete example could be a pickup manager that kept a
list of pickups sorted by name. If the function returns all the pickup
types that match a filter, then the caller could iterate the returned
list until it found the pickup it wanted. To speed things up, it could
early-out if it found a pickup with a name later than the item it was
looking for, or it could do a binary search of the returned list. In
both those cases, if the internal representation changed to something
that wasn’t ordered by name, then the code would no longer work.
If the internal representation was changed so it was ordered by hash,
then the early-out and binary search would be completely broken.
Another example of the contract being too little information, in
184 CHAPTER 15. WHAT’S WRONG?
away from that through data driven means. No-one has so far cre-
ated a game API, because to do so, it would have to be so generic
that it wouldn’t provide anything more than what we already have
with our languages we use for development.
Reuse, being hankered after by production, and thought of so
highly by anyone without much experience in making games, has
become an end in itself for many games developers. The pitfall of
generics is a focus on keeping a class generic enough to be reused or
re-purposed without thought as to why, or how. The first, the why,
is a major stumbling block and needs to be taught out of developers
as quickly as possible. Making something generic, for the sake of
generality, is not a valid goal. It adds time to development without
adding value. Some developers would cite this as short sighted,
however, it is the how that deflates this argument. How do you
generalise a class if you only use it in one place? The implementation
of a class is testable only so far as it can be tested, and if you
only use a class in one place, you can only test that it works in
one situation. If you then generalise the class, yet don’t have any
other test cases than the first situation, then all you can test is that
you didn’t break the class when generalising it. So, if you cannot
guarantee that the class works for other types or situations, all you
have done by generalising the class is added more code for bugs to
hide in. The resultant bugs are now hidden in code that works,
possibly even tested, which means that any bugs introduced during
this generalising have been stamped and approved.
Test driven development implicitly denies generic coding until
the point where it is a good choice to do so. The only time when
it is a good choice to move code to a more generic state, is when it
reduces redundancy through reuse of common functionality.
Generic code has to fulfil more than just a basic set of features
if it is to be used in many situations. If you write a templated
array container, access to the array through the square bracket op-
erators would be considered a basic feature, but you will also want
to write iterators for it and possibly add an insert routine to take
the headache out of shuffling the array up in memory. Little bugs
15.6. REUSABLE GENERIC CODE 191
can creep in if you rewrite these functions whenever you need them,
and linked lists are notorious for having bugs in quick and dirty im-
plementations. To be fit for use by all users, any generic container
should provide a full set of methods for manipulation, and the STL
does that. There are hundreds of different functions to understand
before you can be considered an STL-expert, and you have to be an
STL-expert before you can be sure you’re writing efficient code with
the STL. There is a large amount of documentation available for
the various implementations of the STL. Most of the implementa-
tions of the STL are very similar if not functionally the same. Even
so, it can take some time for a programmer to become a valuable
STL programmer due to this need to learn another langauge. The
programmer has to learn a new language, the language of the STL,
with its own nouns verbs and adjectives. To limit this, many games
companies have a much reduced feature set reinterpretation of the
STL that optionally provides better memory handling (because of
the awkward hardware), more choice for the containers (so that you
may choose a hash map or trie, rather than just a map), or ex-
plicit implementations of simpler containers such as stack or singly
linked lists and their intrusive brethren. These libraries are normally
smaller in scope and are therefore easier to learn and hack than the
STL variants, but they still need to be learnt and that takes some
time. In the past this was a good compromise, but now the STL has
extensive online documentation, there is no excuse not to use the
STL except where memory or compilation time is concerned.
The takeaway from this however, is that generic code still needs
to be learnt in order for the coder to be efficient, or not cause ac-
cidental performance bottlenecks. If you go with the STL, then at
least you have a lot of documentation on your side. If your game
company implements an amazingly complex template library, don’t
expect any coders to use it until they’ve had enough time to learn it,
and that means that if you write generic code, expect people to not
use it unless they come across it accidentally, or have been explicitly
told to, as they won’t know it’s there, or won’t trust it. In other
words, starting out by writing generic code is a good way to write a
192 CHAPTER 15. WHAT’S WRONG?
Looking at hardware
The first thing a good software engineer does when starting work
on a new platform is read the contents listings in all the hardware
manuals. The second thing is usually try to get hello world up and
running. It’s uncommon for a games development software engineer
to decide it’s a good idea to read all the documentation available.
When they do, they will be reading them literally, and still probably
not getting all the necessary information. When it comes to under-
standing hardware, there is the theoretical restrictions implied by
the comments and data sheets in the manuals, but there is also the
practical restrictions that can only be found through working with
the hardware at an intimate level.
As most of the contemporary hardware is now API driven, with
hardware manuals only being presented to the engineers responsible
for graphics, audio, and media subsystems, it’s tempting to start
programming on a new piece of hardware without thinking about
the hardware at all. Most programmers working on big games in
big studios don’t really know what’s going on at the lower levels of
their game engines they’re working on, and to some extent that’s
probably good as it frees them up to write more gameplay code, but
there comes a point in every developer’s life when they have to bite
193
194 CHAPTER 16. HARDWARE
the bullet and find out why their code is slow. Some day, you’re
going to be five weeks from ship and need to claw back five frames
a second on one level of the game that has been optimised in every
other area other than yours. When that day comes, you’d better
know why your code is slow, and to do that, you have to know what
the hardware is doing when it’s executing your code.
Some of the issues surrounding code performance are relevant
to all hardware configurations. Some are only pertinent to config-
urations that have caches, or do write combining, or have branch
prediction, but some hardware configurations have very special re-
strictions that can cause odd, but simple to fix performance glitches
caused by decisions made during the chip’s design process. These
glitches are the gotchas of the hardware, and as such, need to be
learnt in order to be avoided.
When it comes to the overall design of console CPUs, The XBox360
and PS3 are RISC based, low memory speed, multi-core machines,
and these do have a set of considerations that remain somewhat mis-
understood by mainstream game developers. Understanding how
these machines differ from the desktop x86 machines that most pro-
grammers start their development life on, can be highly illuminating.
The coming generation of consoles and other devices will change the
hardware considerations again, but understanding that you do need
to consider the hardware can sometimes only be learned by looking
at historic data.
memory the next 127 times you look. For a list of floats, that works
out as one memory load for 32 values. If you have an animation that
has less than 32 keys per bone, then you can guarantee that you will
only need to load as many cache-lines as you have bones in order to
find all the key indexes into your arrays of transforms.
In practice, you will find that this only applies if the process you
run doesn’t load in lots of other data to help process your stream.
That’s not to say that trying to organise your data sequentially isn’t
important, but that it’s just as important to ensure that the data
being accessed is being accessed in patterns that allow the processors
to leverage the benefits of that form. There is no point in making
data sequential if all you are going to do is use it so slowly that the
cache fills up between reads.
Sequential data is also easier to split among different processors
as there is little to no chance of cache sharing. When your data is
stored sequentially, rather than randomly, you know where in mem-
ory the data is, and so you can dispatch tasks to work on guaranteed
unshared cache-lines. When multiple CPUs compete to write to a
particular cache-line, they have to storing and loading to keep things
consistent. If the data is randomly placed, such as when you allocate
from a memory pool, or directly from the heap, you cannot be sure
what order the data is in and can’t even guarantee that you’re not
asking two different CPUs to work on the same cache-line of data.
The data-oriented approach, even when you don’t use structs
of arrays, still maintains that sequential data is better than ran-
dom allocations. Not only is it good for the hardware, it’s good for
simplicity of code as it generally promotes transforms rather than
object-oriented messaging.
on, from the current generation of consoles such as Sony’s PS3 and
Microsoft’s Xbox360, but also to hand helds such as the Nintendo
DS, the iPhone, and other devices.
Pipelines provide a way for CPUs to trade gains in speed for
latency and branch penalties. A non-pipelined CPU finishes every
instruction before it begins the next, however a pipelined CPU starts
instructions and doesn’t necessarily finish them until many cycles
later. If you imagine a CPU as a factory, the idea is the equivalent
of the production line, where each worker has one job, rather than
each worker seeing and working on a product from start to finish. A
CPU is better able to process more data faster this way because by
increasing the latency, in well thought out programs, you only add
a few cycles to any processing during prologue or epilogue. During
the transform, latency can be mitigated by doing more work on non-
related data while waiting for any dependencies. Because the CPUs
have to do a lot less per cycle, the cycles take less time, which is
what allows CPUs to get faster. What’s happening is that it still
takes just as long for a CPU to do an operation as it always has (give
or take), but because the operation is split up into a lot of smaller
stages, it is possible to do a lot more operations per second as all
of the separate stages can operate in parallel, and any efficient code
concentrates on doing this after all other optimisations have been
made.
When pipelining, the CPU consists of a number of stages, firstly
the fetch and decode stages, which in some hardware are the same
stage, then an execute stage which does the actual calculation. This
stage can take multiple cycles, but as long as the CPU has all the
cycles covered by stages, it won’t affect throughput. The CPU then
finally stores the result in the last stage, dropping the value back
into the output register or memory location.
With instructions having many stages, it can take many cycles
for them to complete, but because only one part of the instruction
is in use at each stage, a new instruction can be loaded as soon as
the first instruction has got to the second stage. This pipe-lining
allows us issue many more instructions than we could otherwise,
16.3. MICROCODE 197
even though they might have high latency in themselves, and saves
on transistor count as more transistors are being used at any one
time. There is less waste. In the beginning, the main reason for this
was that the circuits would take a certain amount of time to stabilise.
Logic gates, in practice, don’t immediately switch from one logic
state to another. If you add in noise, resonance, and manufacturing
error, you can begin to see that CPUs would have to wait quite a
while between cycles, massively reducing the CPU frequency. This
is why FPGAs cannot easily run at GHz speeds, they are arrays of
flexible gate systems, which means that they suffer the most from
stability problems, but amplified by the gates being located a long
way from each other, in the sense that they are not right up next
to each other like logic circuits are inside an inflexible ASIC like a
production CPU.
Pipelines require that the instructions are ready. This can be
problematic if the data the instruction is waiting on is not ready,
or if the instruction is not loaded. If there is anything stopping the
instruction from being issued it can cause a stall, or in the case of
branching causing the instructions to be run, then trashed as the
pipe-line is flushed ready to begin processing the correct branch.
If the instruction pointer is determined by some data value, it can
mean a long wait while the next instruction is loaded from main
memory. If the next instruction is based on a conditional branch,
then the branch has to wait on the condition, thus causing a number
of instructions to begin processing when the branch could invalidate
all the work done so far. As well as instructions needing to be nearby,
the registers must already be populated, otherwise something will
have to wait.
So, look to your types, and see if you can add a bit of SIMD to
your development without even breaking out the vector instrinsics.
Future
201
202 CHAPTER 17. FUTURE
around much has been offloading processing outside the local ma-
chine. Even though there has been plenty of work done with very
high latency architectures such as Beowulf clusters or grid comput-
ing, the idea that they could be useful for games is still in its in-
fancy due to the virtually real-time nature of processing for games.
It could be a while off, but we may see it come one day, and it’s
better not to let it sneak up on us. Once we need it, we will need
a language that handles offloading to external processing like any
other threaded task. Erlang is one such language. Erlang is func-
tional (with a tendency for immutable state which helps very much
with concurrency) and is highly process oriented with a very sim-
ple migration path from single-machine multi-threaded to remote
processing or grid compute style distributed processing and a large
number of options for fault tolerance. Node-js offers some of the
same parallelism and a much shorter learning time as it is based
on Javascript. Functional languages would seem to dominate, but
OpenCL isn’t purely functional, and C++AMP only requires you
consider how to amplify your code with parallelism, which might not
be enough for how many cores we really end up. Whatever language
we do end up using to leverage the power of parallel processing, once
it’s ubiquitous, we’d better be sure we’re not still thinking serial.
But parallel processing isn’t the only thing on the horizon. Along
with many cores comes the beginnings of a new issue that we’re only
starting to see traces of outside supercomputing. The issue of power
per watt. In mobile gaming, though we’re striving to make a game
that works, one thing that developers aren’t regularly doing that will
affect sales in the years to come, is keeping the power consumption
down on purpose. The mobile device users will put up with us
eking out the last of the performance for only so long. There is a
bigger thing at stake than being the best graphical performance and
fastest AI code, there is also the need to have our applications be
small enough to be kept on the device, and also not eat up enough
battery that the users drop the application like a hot potato. Data-
oriented design can address this issue, but only if we add it to the
list of considerations when developing a game. As with parallelism,
17.1. PARALLEL PROCESSING 203
the reason they have been growing in power so rapidly over the last
few years (2008-2013), and show no signs of significantly slowing
down in their GFLOPS growth. There are always more pixels to
render, always more vertices to transform, always more textures to
lookup. The bottleneck in a graphics card is still the number of
cores, and that’s why it’s relatively easy to increase the power of
them compared to general purpose chips like CPUs.
In addition to the amount of jobs the cores have to do, the relative
similarity of the jobs makes it easier to move towards a job-board
style approach to job dispatch. In every parallel architecture so far,
the instructions to run per compute core are decided specifically per
core. In the future, an implicit job system may make a difference by
instigating the concept of a public read-only job spec, and having set
up the compute cores to look out for jobs and apply their own variant
information on them. A good example might be that a graphics card
may set all the compute cores to run the same algorithm, but given
some constants about where they should get their data such as offsets
into the stream and different output rectangles of the display.
This second limiting factor is dependency. We touched on it with
the argument for pixel shaders, the set up time is a dependency. The
number of cores able to be utilised is serial up until the first step
of set up is finished, that is, the call to render primitive. Once the
render call is done, we can bifurcate our way to parallelism, but even
then, we’re wasting a lot of power because of dependence on previous
steps. One way to avoid this in hardware is have many cores wait
for tasks to do by assuming they will be in charge of some particular
part of the transform. We see it in SIMD processing where the CPU
issues one instruction to multiply a vector, and each sub core carries
out its own multiply, knowing that even though it was told that a
multiply for the whole vector was issued, it could be sure that it was
the core involved in multiplying element n of that vector. This is very
useful and possibly why future optimisations of hardware can work
towards a runtime configurable hardware layout that runs specified
computations on demand, but without explicit instruction. Instead,
as soon as any data enters into the transformation arena, it begins
17.2. DISTRIBUTED COMPUTING 205
For example, If you know that your scene is going to animate, and
you want to do a ray cast from some entities to other entities, but
don’t know which, you can task the grid engine with advancing the
animation and doing all the ray casts. Then, once you are ready,
read from the ray casts that you decided that you did need after
whatever calculations you needed to do. That’s a lot of wasted pro-
cessing power, but it reduces latency. When you’re thinking about
the future, it is only sensible to think about latency as the future
contains an infinite number of processing cores.
Using this one example, you can ramp it back to current gen-
eration hardware by offloading processes to begin large scale ray
casting, and during the main thread calculation stages, cull tasks
from the thread that is managing the ray casts. This means that
you can get started on tasks, but only waste cycles when you don’t
know enough to not.
connectivity. FPGAs can be used to test out new standards such as a new HDMI
specification or a completely different encoding
208 CHAPTER 17. FUTURE