Dealing with the “Itanium Effect”
Steve Richfield
Consultant
5498 124th Avenue East
Edgewood, WA 98372
00-1-505-934-5200
[email protected]
ABSTRACT
The “Itanium Effect” is a subtle organizational phenomenon
leading to the wide adoption of a few widely applicable technologies, and the abandonment of many powerful but more narrowly
applicable technologies.
The main elements of the Itanium Effect are:
1.
2.
3.
4.
5.
Technology loops
Compartmentalized conferences
Little PhD student participation
Procedural exclusion of futurist and top-down discussions
Keeping problems secret, so that no one else can help
The Itanium Effect has become the leading barrier to advancement of high performance computing. This is why defects continue to impair yield. This is what now stands in the way of wafer
scale integration.
Prospective glue technologies examined in this paper include:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
of the mistakes leading to the STRETCH boondoggle, partially
documented in the book Planning a Computer System. All of
the recognized mistakes made on STRETCH, like the absence of
a “guess bit” in conditional branch instructions, were recreated
in the Itanium, despite recommendations by various commentators of 40 years earlier.
Now this same process is continuing with GPU designs that are
paralleling the products of Floating Point Systems Inc. of ~30
years earlier. Sure, we have jumped about one decade ahead in
this loop, but why not simply skip 3 more decades of nowobsolete designs of the past and move forward?
How could such crazy repetitions of history possibly happen?
How is this same phenomenon continuing to radically depress the
capabilities of all present-day CPU, GPU, and FPGA designs?
How could correcting this phenomenon in your company be
worth billions of dollars to your shareholders? Let’s examine this
continuing phenomenon.
Logarithmic arithmetic
Medium-grained and multi-grained FPGAs
Coherent memory mapping
Variable data chaining
Fast aggregation across ALUs
Blurring the SIMD/MIMD distinction
A simple horizontal microcoding interface for applications
Failsoft configuration on power-up
Failsoft partial reconfiguration during execution
Symmetrical pinout to use of defective components.
An architecture-independent universal compiler to compile a
new APL-level language
Category and Subject Descriptor
B.0 [Hardware]: General.
Keywords
coherent memory mapping, failsoft reconfiguration, logarithmic
arithmetic, medium granularity, symmetrical pinout, universal
compiler.
1. INTRODUCTION
A really incredible thing happened in 2001, which went completely unnoticed throughout the chip-making industry. Intel
released the Itanium, a nearly precise monolithic copy of the
biggest boondoggle in computer history – the 1961 IBM Project
STRETCH. Further, there had been much contemporary analysis
Photo 1. IBM 7030 STRETCH Control Panel
2. TECHNOLOGY LOOPS
That things could proceed along a “linear” development path to
perfectly close a gigantic loop as the Itanium did, suggests a path
with technological cobblestones laid in gigantic circles. Present
technology is defined by the hundreds of college courses and
thousands of textbooks that teach the various past methods. Often, new methods appear on the scene, but remain in productspecific manuals that disappear as soon as products become obsolete (like the superior methods of fault tolerance pioneered by
Tandem Corp), or appear at a time when industry just isn’t ready
for them (like Ashenhurt and Metropolis’ 1959 paper on significance arithmetic). If you look at the many technologies that
appear in textbooks as points in a multidimensional space,
people will predictably add points indefinitely until a closed loop
is formed.
The STRETCH boondoggle foreclosed on the future of extensive
instruction lookahead for 40 years, until engineers at Intel read
books whose content was traceable to the last computer with
extensive instruction lookahead, and substantially recreated the
STRETCH, complete with its dead ends. In short, they didn’t
truly reinvent the STRETCH, but rather they copied its concepts
and arrived at essentially the identical architecture, as you might
reasonably expect from good engineers developing a common
concept. STRETCH and its descendant, the Itanium, constitute
just one of those many points in the multidimensional technological space. The precision of this re-creation shows that we have
very nearly reached the end of incremental improvements in
computer architecture, so now it is time to take some radical
steps if we are to continue to move forward.
Note that some present-day coarse-grained FPGA proposals are
traceable all the way back to the 1949 IBM-407 accounting machine that was able to use similar methods to achieve electronic
speeds using slow electromechanical components.
Once a few closed loops are formed, there is no more pressing
need to add additional technologies. Project development will
then predictably jump from point to nearby point, as needed to
push out new products, without ever reaching a point where anything radically new is needed to produce the next product. This
has created a situation where corporate managers now think they
need only hire PhDs who understand the various technologies,
and then those PhDs can work for the rest of their lives, without
ever having to leave the comfort zone of their own knowledge.
This would ordinarily leave everyone in the industry vulnerable
to a maverick corporation that willingly adopts obscure technologies, if not for the extreme expense in developing new chips.
Now, any such maverick corporation would probably need a billion or so dollars just to “ante in” to this game. Hence, the challenge is to find some way for existing corporations to morph their
methods to break out of this loop and conquer their competition.
Laggards will be forced out of business.
3. ENGINEERING ISOLATION
I usually attend the yearly WORLDCOMP, which consists of 20
separate conferences on all aspects of computing. This keeps me
current on the very latest thinking in everything ranging from
computer design to AI. WORLDCOMP includes separate conferences on computer and FPGAs design, including various panel
discussions about hot topics of the year. Dozens of PhDs present
their work, pushing the frontiers of computing a bit further
ahead. One thing these conferences do not include is any representation from major hardware or software vendors, with some
rare curious exceptions countable on the fingers of one hand. I
routinely look up these very rare individuals to determine their
place in their respective corporations, and their reasons for attending. Invariably I discover that they are not in a position to
design anything for their employers, and have traveled on their
own time and money, for the same reasons that I have come – to
stay on top of computer technology.
Meanwhile, major players like Microsoft hold their own conferences covering their own new products, and there are various
separate conferences on Supercomputers, FPGAs, etc., sprinkled
around at various times and places. In short, the entire industry is
highly compartmentalized, and thereby effectively isolated from
outside innovation.
4. SACRIFICING OUR FUTURE
The argument that most of the PhD theses and other outside
work presented at conferences like WORLDCOMP is sophomoric, and hence not worth studying, is one industry defense. This is
the predictable result of keeping outside work in isolation. Entire
development departments exist where no PhD students are being
mentored, thereby sealing the last avenue of cross-pollination.
With no common conferences other than WORLDCOMP, no
cross-specialty interaction because of the lack of manufacturer
participation in WORLDCOMP, very few PhD students, etc.,
there can be little if any significant technological advancement
(outside of the usual technology loops) from the isolated developers in each corporation. This is the underlying problem, not the
questionable quality of outside work that is kept completely isolated from the harsh realities of fabrication. Sure, keep secret the
proprietary methods that are presently being used to address the
technological challenges at hand, but keeping the challenges
themselves secret, e.g. the nature and distribution of real-world
defects, is suicidal to the entire industry.
5. CORRUPTION OF PROCESS
Much of the computing community has lost the “re” in “research”, corrupting it to mean the exposition of past developments, and specifically excludes speculations as to where things
could go with careful (re)direction. Indeed, an earlier version of
this paper was denounced on this basis by reviewers, with comments like “there are no analyses of any of these suggestions,
only proposals”, ”it does not fit easily into any subtopic area”
and “ideas are cheap, especially recycled ideas.”
As a result, the entire field of high performance computing has
been sucked into a bottom-up design process, by simply dismissing top-down discussions on the basis that they are not “research”. As a result, high performance computing still lacks an
effective architecture, when all indications are that a top-down
approach could probably have produced a good architecture a
decade ago.
This obviously can not be corrected at compartmentalized conferences like the FPGA conference. Only multidisciplinary conferences like WORLDCOMP have any real hope of encouraging
the top-down methods needed to merge the disparate technologies to forge the future of high performance computing.
6. WHERE WE SHOULD BE HEADING
It seems pretty obvious that each segment of the industry has
some of the “missing pieces” needed by the other segments. Intel
is now developing CPUs with embedded GPU-like capability,
which could be greatly enhanced with some reconfigurable logic.
Similarly, a processor embedded into FPGAs could orchestrate
power-on fault-tolerant reconfiguration and reconfigurable logic.
Just about everyone already seems to agree that all memory must
be coherent. In short, it appears that each segment of the industry
is proceeding toward a single common point, a computational
singularity promising a hundredfold improvement in performance
over traditional methods. However, the present looping of technology is tugging at every segment to pull them away from that
computational singularity. The main blockage seems to come
from ignorance of various obscure enabling technologies.
Adding with logarithms is easy:
1.
Take the ratio of the two arguments, which since they are
represented as the logarithms of numbers; you can divide by
simply subtracting their logarithms.
2.
Use that ratio to look up the appropriate entry in a “fudge
factor” table that contains the logarithms of fudge factors.
The size of this table limits the precision available with this
method to less than IEEE-754 single precision. Interpolation
methods extend the available precision. Carefully computed
table entries avoid consistent round-off errors much like
IEEE-754 does.
3.
Multiply the numerator by the fudge factor, which is accomplished by adding their logarithms.
7. ENABLING TECHNOLOGIES
The following synergistic enabling technologies are not additive,
or even multiplicative. Projected advanced technology processors
will need nearly all of these enabling technologies to function at
all, because a very high level of parallelism is needed to support
power-on and on-the-fly reconfiguration, and dynamic reconfiguration is needed to economically manufacture chips having so
many components. Possibly the most challenging application that
will ever be run on these processors will be contained in an attached ROM that is executed when power is applied, or when a
fault is detected during execution, to reconfigure many faults
“into the ether” leaving a fully functioning chip.
This “wall of obscure technology” presents a sort of chicken-oregg situation that has so far blocked the construction of envisioned advanced technology processors, and hence denied the
hundredfold improvement in performance that is expected from
such architectures. Corporations limit their technological risks by
taking risks one-at-a-time, and therefore wouldn’t think of trying
many new things on a single new device. These enabling technologies are not individually transformative, so they have remained obscure.
Note that taken together, these enabling technologies define a
processor that is very different from present day processors, and
would look a bit like “alien technology” to present day technologists. I think of this as the “rubber band effect”. These methods
were individually rejected because they presented no great advantage, providing an ever increasing incentive to embrace other
new methods as they appear. Now, that rubber band is stretched
quite tight, providing a gigantic incentive for adopting enabling
technologies.
WARNING: Most of these technologies have outward characteristics that are very similar to other more common but less powerful methods. This has resulted in many experts misdismissing them during the varying number of decades since they
were first proposed, thinking that they were something else. Indeed, this phenomenon has become an important driving force
behind the continuation of the Itanium Effect.
7.1 Logarithmic Arithmetic
Everyone is taught in school that you can multiply and divide
with logarithms, but not add and subtract. This is simply not true
(see below). Pipelined logarithmic ALUs are much simpler than
floating-point ALUs, as they only need 3 adders, a small ROM,
and some glue. Unfortunately, they are consigned to low precision. These enable super performance for low precision applications, like image and speech recognition, neural networks, etc.,
and make medium granularity designs quite practical.
Note that, for the most part, future supercomputers will be working on very different problems than do present day computers.
Future computing will involve AI applications centered on visual,
audio, and neural network (NN) applications, all of which deal in
low precision that is within the range of logarithmic arithmetic.
Subtraction and signed arguments are handled with obvious extensions of this simple strategy, that involve the use of two sign
bits, one sign bit for the sign of the logarithm of the absolute
value of the number being represented, and the other sign bit for
the sign of the number being represented.
Since most of the “logic” of logarithmic arithmetic is contained
in the contents of its tables, SECDED error correction logic can
correct most faults, and detect the faults that it cannot correct.
Note that the practical success of power-on and on-the-fly reconfiguration depend on having a high enough flops/transistor ratio
to support real-time diagnosis and reconfiguration, so the speed
and simplicity of logarithmic arithmetic makes it a clear winner
in these devices.
7.2 Medium Grain and Multi-Grain FPGA
Architectures
Until now, everyone designing FPGAs either designed with full
ALUs (coarse granularity), or with just gates (fine granularity).
However, logarithmic arithmetic and “fixed point” (like integer,
but the decimal point can have any pre-assigned location) arithmetic require much simpler blocks. Full IEEE-754 floating point
ALUs can be chopped both horizontally (into functional units)
and vertically (into digits). These blocks can be combined to
achieve full pipelined result-per-clock-cycle capability as needed,
though at the cost of additional flow-through time.
The pieces of a logarithmic ALU are adders and tables with attached lookup logic. The pieces of full IEEE-754 floating point
ALUs are priority encoders, shift matrixes, adders, multipliers,
etc., depending on the implementation. These pieces can be used
for other things, e.g. binary multiplying a logarithm by an integer
is actually raising the number represented by the logarithm to the
integer exponent. By providing a “parts store” of both ALU pieces and full ALUs in typically needed proportions, Akin to the
letter assortment in a Scrabble game, users can create “super
duper operations” that use most of them, then instantly reconfigure (see Horizontal Microcoding below) them to do other different “super duper operations” as needed. The goal here is not
(yet) to provide everything needed to do entire programs in a
single data chained operation, but rather to simply reduce the
number of times that the same data needs to be (re)handled by an
order of magnitude or so.
Note that in most cases this obviates some of the need for an
optimizing compiler, because there is no benefit to optimizing a
configuration, unless without optimization, there aren’t enough
components available to perform a complex series of computations and produce a result every clock cycle.
Multi-grain would doubtless be more complex to use than either
fine or coarse grained approaches, but its use would improve the
flops/transistor ratio to facilitate real-time reconfiguration.
7.3 Coherent Memory Mapping
Often, “local” memory has been attached to a particular ALU,
and “global” memory has been attached to a particular bus. This
isolation of local memory is needless, and associating “global”
memory to a particular bus sucks performance. Multiple memory
busses and interleaved memories have been used since the 1960s
to exceed single-bus performance, and it takes very little logic to
be able to attach local memories to global bus systems so that all
memory is uniquely addressable (coherent).
Note that it is important to provide redundant busses if vary large
chips are to be reliably made. The trick is to provide many redundant busses, and then use whatever works and is available. In
addition to “in use” bits, they would also need “functional” bits,
along with the logic to honor and operate these bits, and use
whatever bus is ready to carry the traffic.
Switch fabric architecture is evolving. Note that simply organizing clusters of functional components in a 2-D fashion on the
chip, and providing redundant busses for every row and column,
it becomes possible for many cluster-to-cluster communications
to be simultaneously taking place. As a result, traditionally slow
operations like scatter and gather can run at many times traditional speeds.
However, the biggest payoff from coherent memory is in its ability to easily save the state of a subsystem by simply copying out
the memory in the subsystem. Without coherent memory, partial
reconfiguration would be extremely difficult.
7.4 Variable Data Chaining
The idea of pasting ALUs together in chains, as is now done in
some coarse-grained FPGA designs, was called “data chaining”
in early supercomputers, like the CDC Cyber 205. They simply
switched ALU connections from memory to pipeline registers, in
order to interconnect ALUs to get a sort of simplistic coarsegrained FPGA-like performance. Hence, perceived distinctions
between coarse-grained FPGA designs and multi-ALU data
chaining in supercomputers is illusory.
By simply providing pipeline registers between 2-D arranged
clusters, and configuring ALU ports to connect to those pipeline
registers, otherwise isolated slave processors can chain together
to perform extremely complex operations.
7.5 Fast Aggregation across ALUs
One thing missing from all contemporary designs, and now presenting a major barrier to GPU advancement, is the interaction
between parallel ALUs needed to perform aggregation functions
like adding the elements of an array, finding the maximum value,
etc., in log n time instead of n time. First every even ALU interacts with the adjacent odd ALU, and then the winners interact
with other winners, etc. Graphics applications are unique in their
general lack of need for aggregation, which has been the basis for
GPU successes in that arena, and the basis for their lackluster
performance in other areas.
Aggregation also facilitates the merging of thousands of individually performed diagnostics in thousands of slave processors,
thereby speeding up real-time reconfiguration.
7.6 Blurring the SIMD/MIMD Distinction
Using Small Local Program Memories
A certain amount of temporal autonomy is needed for slave processors to deal with slow operations, re-routing over busy and
defective busses, memory cycles lost to other processors, waiting
for slow global busses, etc. A small amount of FIFO or memory
would provide the buffering needed to hold a few slave processor
instructions broadcast by the central processor. However, it
would take little more to implement a rudimentary instruction set
in that memory, complete with conditional branch instructions,
etc. With this, individual slave processers could each do their
part to perform complex operations, without holding up the entire processor when they individually slow down or stumble into
each other. This could provide the combined advantages of SIMD
and MIMD architectures.
Note that code running in faulty configurations will doubtless run
into many roadblocks. Slave processors must be able to function
autonomously, in order to continue running when other slaves
have died.
7.7 A Simple Horizontal Microcoding
Interface for Applications
HP is believed to have been the first to make some of their horizontal microprogramming memory accessible to applications
programmers. This was implemented as an option on HP 21-MX
minicomputers. This way, users could define new operations that
ran at hard-wired speed. In FPGA terms, this is akin to instant
partial reconfiguration from memory. Instead of the usual FPGA
design that uses long shift registers to hold a particular configuration, suppose that some of the configuration bits are replaced
with several bits and a global mechanism of selecting which bit
from each group to use. This could be a simple as having short
circular shift registers controlling each potential connection, and
a global mechanism of rotating all circular shift registers in unison. Loading would work as usual, only when all of the bits
have been loaded for one configuration, the short circular shift
registers would all be rotated by one bit and loaded with new
contents for that position, with this process continuing until all of
the bits in every circular shift register have been loaded. Runtime reloading, equivalent to re-defining operation codes of the
computer being implemented, could be accomplished by rotating
the circular shift register into position, and transferring the contents of a dedicated place in memory into the main FPGA programming input, while inhibiting changing any memory or
registers on the device during reprogramming. This way, compilers could invent perfect “operation codes” that are custom made
to perform the work of a page of code, and use just about everything on large devices to achieve incredible speed.
This is critical for diagnostics in preparation for reconfiguration,
because it allows access to the basic “grains” of the system.
Without this, really complex diagnostics would be needed that
deduce the sources of malfunctions from combinations of observed failures from various complex configurations.
such glitches is the present situation of permanent irreversible
component failure.
7.8 Failsoft Configuration on Power-Up
To further improve the likelihood that in-service failures will be
recognized and corrected, idle time should be spent running diagnostics, diagnostic failures should be used to invalidate recent
results, and diagnostics should be invoked before accepting highly unusual results.
Blowing fuses during manufacture to deal with faults makes no
provision for in-service failures. However, (re)configuring on
power-up cures in-service failures and assures a nearly limitless
lifespan. This is no problem with a general purpose programmable device that is fast enough to do the job in a reasonable
amount of time. However, this functionality establishes a high
lower limit on the performance of future processors, as processors must be able to fully diagnose and repair numerous malfunctions on power-up within a second or so. Further, devices must
incorporate appropriate technologies to support dynamic reconfiguration. This presents a chicken-or-egg challenge, as superperformance is needed for practical power-up reconfiguration,
and power-up reconfiguration is a practical necessity to implement super-performance at the high levels envisioned here.
Once this has been achieved, there is no limit on size and complexity of future computers while maintaining ~100% yield, providing that sufficient spares are included in the design, and there
is fall-back to smaller configurations in the event of an excessive
number of failures. For example, a chip might have a dual mainprocessor where only one is needed, and, say, 4096 slave processors plus a few hundred spares, configured in a way that the system could run with 2048, or even 1024. Further, the main
processors might have IEEE-754 floating-point ALUs, which
could be emulated in software should the ALUs be faulty. There
would be countless fuses in the power distribution network to
isolate any shorted logic, etc. In short, if much of anything
worked, the chip would work well enough to sell, at least into
applications that didn’t need their maximum potential performance. This would completely remove the present tradeoffs between chip size and yield.
7.10 Physically Symmetrical Pinout to Facilitate the Use of Defective Components
By carefully assigning the pinout of new chip designs, it could
easily become possible to plug them in any of up to four different
ways, with as many separate-but-equal processors and associated
I/O pins as there are ways of plugging them in. This way, there
would be up to four prospective pins #1. After testing, a factory
technician would then cut off the corner pin associated with each
fully functional processor. A circuit board designer could then
eliminate any combination of corner pin holes, depending on
which processors and I/O pins were not essential to the design.
Typical circuit board designs would have 3 missing corner pin
holes on the circuit board, so that a fully functioning chip could
be plugged in any of 4 different ways (because all of its corner
pins would be missing), However, a chip with a malfunctioning
processor or I/O pin(s) could only be plugged in only one way,
with the associated pin #1 connecting to the one remaining corner pin hole. Note that hexagonal chips could be plugged in any
of 6 ways, and only 1/6 of the chip would be lost to a bad I/O pin.
Designers would be motivated to use fewer I/O pins, because
malfunctioning chips would be less expensive then ones where
all of the processors and I/O pins function correctly. Designs
needing less than half of the I/O pins would provide additional
corner pin holes on the circuit board design, so that chips with
even more bad sections could be utilized.
Note that advanced configuration methods typically involve genetic algorithms (GA) to discover workable configurations, so the
time needed to configure is variable and potentially unbounded.
Hence, a large engineering margin in performance will be needed
to assure timely (re)configuration.
Another pin (perhaps pin #2) would be asserted on the circuit
board to identify the “primary” section of the chip, so that the
chip can tell which section is to control the others, and identify
the chip’s rotational position on the board, so that the functions
of I/O pins can be properly determined.
7.9 Failsoft Partial Reconfiguration During
Execution
Of course a designer could use all of the processors and all but
two pins in every corner, but then only fully functional chips
could by utilized.
It is also possible to survive most new faults during operation.
Mostly implemented in the firmware running on a future processor, applications that run as multiple tasks that post their results
when done; could continue operation, even with the failure of
associated computational components. Applications would use
Assert logic to confirm correct operation, and watchdog timers
to recognize dead tasks. When a task is seen to be malfunctioning, a partial reconfiguration would be triggered, using the same
(re)configuration logic used on power-up, but restricted to the
subset of hardware used by the malfunctioning task.
When reconfiguration is complete, the failed task would then be
restarted on the reconfigured hardware. A repeated identical
failure would indicate a programming error.
Fail-soft partial reconfiguration would require suitable application program architecture, and would introduce a considerable
momentary “glitch” into DSP applications. The alternative to
Symmetrical pinout is the last line of defense for when fail-soft
methods fail. This provides the factory with a method of selling
devices that cannot fully repair themselves, and provides service
personnel with a method for repairing equipment incorporating
devices that can no longer fully repair themselves. Repair personnel would simply remove a malfunctioning device, rotate it,
and reinstall it.
7.11 An Architecture-Independent Universal
Compiler To Compile A New APL-Level
Language
There are several candidate compiler architectures that could
compile to almost any imaginable computer. Methods range from
rewriting the code-generator portion, to rewriting table entries
containing snippets that perform various functions, to using a
syntax directed meta-compiler and simply altering the output
instructions. All of these methods and more have been used in
commercial production compilers, but complicating factors, like
the desire for code optimization, muddy the waters. Note that
there are orders of magnitude to be gained from a radically new
compiler, yet optimization typically only buys <2:1. Hence, for
the time being, until architectures make their radical jump and
then settle down, optimization concerns should be set aside in
favor of getting the technology off of its present hump.
A major complication is the need for a new source language with
semantics akin to APL in which to program high performance
applications. A truly flexible and capable compiler for compiling
from an APL-level language to variable targets is desperately
needed for the high-performance industry to progress. This can
only come as an industry-wide effort, or at government expense.
If a new language is being designed, it should probably also have
ADA-like variable declarations to facilitate both compile-time
and run-time checking, and support a COBOL-like verbose listing mode to facilitate “desk checking”.
way. However, with the industry chopped up into various fixed
domains that are gradually becoming less and less appropriate for
future computational needs, the gaps in other-domain capability
are being filled in with extreme in-domain complexity. In this
mode, we all become part of the problem, rather than part of the
solution.
There was a similar situation during the 1960s, when radically
new computers were being introduced on a yearly basis. Many
excellent technologists simply dropped out of this scramble to
keep up with technology. I anticipate the same phenomenon during a transition to the envisioned advanced technology processors, where few present technologists will be able to keep up
with the coming radical changes.
9. COMPUTATIONAL SINGULARITY
Here are three differently stated but identical architectures,
stated from three different points of view:
1.
An array processor whose slave processors are much more
powerful than in the past, including reconfigurable logic, so
that on-the-fly array processing instruction definition becomes practical.
2.
A multi-grain FPGA with ALUs and parts of ALUs, memory
banks, etc., organized into reconfigurable clusters under the
control of central processor(s).
3.
GPUs with reconfigurable logic and integrated into a CPU.
7.12 Putting it All Together
It seems clear that it is possible to build self-repairing processors
of nearly limitless size and performance, and manufacture them
with ~100% yield, with reasonable extensions to the technology
at hand. However, its architecture won’t be a CPU, GPU or
FPGA. Instead, it will be a combination of all of these and a little
more. It could run languages like C++ just as inefficiently as do
present products, but would need a language with APL-like semantics to support future capabilities. Further, there are some
big money hurdles to overcome. Probably the biggest hurdle is
the need for a universal compiler to support future highperformance computing.
Is a hundredfold improvement in performance worth the ~billion
dollar cost to engineer such devices? That is the question that
should logically determine whether this path is taken.
Note the prior call for technological revolution of computation,
made 15 years ago by Andŕe DeHon in his PhD thesis Reconfigurable Architectures for General-Purpose Computing. His
technology was eventually commercialized into a company called
Silicon Spice, which was sold to Broadcom at the height of the
tech boom for ~$1.2 billion. Technology of that time restricted
practical implementation to DSP applications, whereas no such
limitations now persist. However, with more complex methods
supporting loftier goals, including general purpose computing,
the cost will be considerably higher.
8. BEFUDDLEMENT
If you ask others why they are in their particular corner of their
industry, or you ask yourself why you are in your particular corner of the industry, the answer usually boils down to befuddlement with the unknown worlds outside of a particular domain.
Nowhere else is this more clear than with memory designers,
who usually want absolutely nothing to do with the internal complexities of computer architecture, operating systems, etc. There
is a similar though lesser effect with FPGA designers. At the
opposite end of the spectrum are the operating systems and applications designers, few of whom have any experience with an
oscilloscope or logic analyzer. Just about everyone sees the industry as way too complex to think about in any sort of gestalt
Each of these descriptions presumes the application of the necessary enabling technologies as explained earlier.
So, where does the hundredfold improvement in performance
come from?
1.
Future projected applications won’t need the precision of
present devices, so precision can be traded for speed, e.g.
more logarithmic ALUs in place of fewer floating-point
ALUs. Also, this speed will facilitate fast power-on diagnostics.
2.
With the methods presented, much larger chips can be fabricated at the same overall cost, because defects will no
longer impair the yield. Greater size brings the components
needed to do more in parallel, along with the assortment of
components needed to implement long computational pipelines.
10. CONCLUSION FOR DEVELOPERS
Just because your co-workers remain isolated is no reason for you
to do the same. Attend WORLDCOMP and other conferences
that are out of your present scope of work. Present papers that
discuss your vision for the distant future of your present scope of
work. Make it known at local universities that you will mentor
PhD students. Make friends who are familiar with potentially
important future technologies, e.g.:
1.
Someone who worked on 1980s supercomputers
2.
Someone who has extensive computing experience that
predates microcomputers.
3.
A CS/EE professor who is up on just about everything that
has ever been made.
Soon, you will become known as THE guy who knows just about
everything. You will soon be able to move into management,
whereupon…
11. CONCLUSION FOR MANAGEMENT
The issues presented in this paper all center around what would
have been called “professionalism” in the era before microcomputers. Professionals would represent their employers at multidisciplinary conferences, present papers at conferences; stay up
on all potential enabling technologies, mentor students, etc. This
sense of professionalism has been completely lost in “modern”
technology, and with it has gone the productivity of the designers
at the major chip makers. Everyone is now threatened by this
lack of a sense of professionalism. Thirty years ago I might have
recommended simply firing such designers for cause and hiring
more professional designers, but that time has passed and we
must now “dig our way out” of this mess, using the people that
we now have working in the industry.
Rather than taking years/decades to do so in any slow way, my
present recommendation is to draw up clear company standards
of professionalism that ALL product designers must follow, and
demote or fire those who don’t follow them. Sure you may have
to make an example of one or two technically capable developers, but this is necessary to demonstrate your resolve. Soon everyone will be attending conferences other than just the “inside”
conferences that only pertain to their present narrow work. Everyone will mentor PhD students, will be presenting papers inviting public scrutiny before committing designs to silicon, will
bring in both very new and very old enabling technologies, etc.
12. OVERALL CONCLUSION
This paper advances an approach to achieve a renaissance in
computational architecture, promising a hundredfold increase in
performance. However, the big challenge here is organizational
rather than architectural. Existing microcomputer, GPU, and
FPGA manufacturers have been unable/unwilling to adapt to
produce these products, or this would have happened a decade or
more ago. This appears to require revolutionary management
changes, presenting an excellent opportunity for takeover bids by
astute investors who are not attached to present methods of engineering management.
13. REFERENCES
[1] Strenski, D., Sundararajan, P., Wittig, R. 2010. The Expanding Floating-Point Performance Gap Between FPGAs
and Microprocessors. HPC wire. November 22.
http://www.hpcwire.com/features/The-Expanding-FloatingPoint-Performance-Gap-Between-FPGAs-andMicroprocessors-109982029.html?viewAll=y
[2] http://www.computerhistory.org/collections/ibmstretch
Computer History Museum’s collection of IBM Project
STRETCH materials.
[3] http://www.world-academy-of-science.org/ WORLDCOMP
web site
[4] http://www.fundinguniverse.com/companyhistories/TANDEM-COMPUTERS-INC-CompanyHistory.html
[5] http://en.wikipedia.org/wiki/Floating_Point_Systems
[6] DeHon, Andŕe. 1996. Reconfigurable Architectures for
General-Purpose Computing. A.I. Technical Report No.
1586. http://www.seas.upenn.edu/~andre/pdf/aitr1586.pdf
[7] Pickett, L. 1993. U.S. Patent 5,184,317. A discussion of
advanced logarithmic arithmetic methods.
[8] Richfield, S. 1987, A Logarithmic Vector Processor for
Neural Net Applications. Proceedings of the IEEE First Annual International Conference on Neural Networks. IEEE
Catalog #87TH0191-7.
[9] Buchholz, Werner. 1962. Planning a computer system:
Project Stretch. McGraw-Hill, Inc. Hightstown, NJ, USA.
ISBN:B0000CLCYO
[10] Ashenhurt, R. L., Metropolis, N. 1959. Unnormalized
Floating Point Arithmetic. Journal of the ACM (JACM) Volume 6, Issue 3. Pp 415-428. ISSN:0004-5411