Academia.eduAcademia.edu

1. Technology loops 2. Compartmentalized conferences

2012

The “Itanium Effect ” is a subtle organizational phenomenon leading to the wide adoption of a few widely applicable technologies, and the abandonment of many powerful but more narrowly applicable technologies.

Dealing with the “Itanium Effect” Steve Richfield Consultant 5498 124th Avenue East Edgewood, WA 98372 00-1-505-934-5200 [email protected] ABSTRACT The “Itanium Effect” is a subtle organizational phenomenon leading to the wide adoption of a few widely applicable technologies, and the abandonment of many powerful but more narrowly applicable technologies. The main elements of the Itanium Effect are: 1. 2. 3. 4. 5. Technology loops Compartmentalized conferences Little PhD student participation Procedural exclusion of futurist and top-down discussions Keeping problems secret, so that no one else can help The Itanium Effect has become the leading barrier to advancement of high performance computing. This is why defects continue to impair yield. This is what now stands in the way of wafer scale integration. Prospective glue technologies examined in this paper include: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. of the mistakes leading to the STRETCH boondoggle, partially documented in the book Planning a Computer System. All of the recognized mistakes made on STRETCH, like the absence of a “guess bit” in conditional branch instructions, were recreated in the Itanium, despite recommendations by various commentators of 40 years earlier. Now this same process is continuing with GPU designs that are paralleling the products of Floating Point Systems Inc. of ~30 years earlier. Sure, we have jumped about one decade ahead in this loop, but why not simply skip 3 more decades of nowobsolete designs of the past and move forward? How could such crazy repetitions of history possibly happen? How is this same phenomenon continuing to radically depress the capabilities of all present-day CPU, GPU, and FPGA designs? How could correcting this phenomenon in your company be worth billions of dollars to your shareholders? Let’s examine this continuing phenomenon. Logarithmic arithmetic Medium-grained and multi-grained FPGAs Coherent memory mapping Variable data chaining Fast aggregation across ALUs Blurring the SIMD/MIMD distinction A simple horizontal microcoding interface for applications Failsoft configuration on power-up Failsoft partial reconfiguration during execution Symmetrical pinout to use of defective components. An architecture-independent universal compiler to compile a new APL-level language Category and Subject Descriptor B.0 [Hardware]: General. Keywords coherent memory mapping, failsoft reconfiguration, logarithmic arithmetic, medium granularity, symmetrical pinout, universal compiler. 1. INTRODUCTION A really incredible thing happened in 2001, which went completely unnoticed throughout the chip-making industry. Intel released the Itanium, a nearly precise monolithic copy of the biggest boondoggle in computer history – the 1961 IBM Project STRETCH. Further, there had been much contemporary analysis Photo 1. IBM 7030 STRETCH Control Panel 2. TECHNOLOGY LOOPS That things could proceed along a “linear” development path to perfectly close a gigantic loop as the Itanium did, suggests a path with technological cobblestones laid in gigantic circles. Present technology is defined by the hundreds of college courses and thousands of textbooks that teach the various past methods. Often, new methods appear on the scene, but remain in productspecific manuals that disappear as soon as products become obsolete (like the superior methods of fault tolerance pioneered by Tandem Corp), or appear at a time when industry just isn’t ready for them (like Ashenhurt and Metropolis’ 1959 paper on significance arithmetic). If you look at the many technologies that appear in textbooks as points in a multidimensional space, people will predictably add points indefinitely until a closed loop is formed. The STRETCH boondoggle foreclosed on the future of extensive instruction lookahead for 40 years, until engineers at Intel read books whose content was traceable to the last computer with extensive instruction lookahead, and substantially recreated the STRETCH, complete with its dead ends. In short, they didn’t truly reinvent the STRETCH, but rather they copied its concepts and arrived at essentially the identical architecture, as you might reasonably expect from good engineers developing a common concept. STRETCH and its descendant, the Itanium, constitute just one of those many points in the multidimensional technological space. The precision of this re-creation shows that we have very nearly reached the end of incremental improvements in computer architecture, so now it is time to take some radical steps if we are to continue to move forward. Note that some present-day coarse-grained FPGA proposals are traceable all the way back to the 1949 IBM-407 accounting machine that was able to use similar methods to achieve electronic speeds using slow electromechanical components. Once a few closed loops are formed, there is no more pressing need to add additional technologies. Project development will then predictably jump from point to nearby point, as needed to push out new products, without ever reaching a point where anything radically new is needed to produce the next product. This has created a situation where corporate managers now think they need only hire PhDs who understand the various technologies, and then those PhDs can work for the rest of their lives, without ever having to leave the comfort zone of their own knowledge. This would ordinarily leave everyone in the industry vulnerable to a maverick corporation that willingly adopts obscure technologies, if not for the extreme expense in developing new chips. Now, any such maverick corporation would probably need a billion or so dollars just to “ante in” to this game. Hence, the challenge is to find some way for existing corporations to morph their methods to break out of this loop and conquer their competition. Laggards will be forced out of business. 3. ENGINEERING ISOLATION I usually attend the yearly WORLDCOMP, which consists of 20 separate conferences on all aspects of computing. This keeps me current on the very latest thinking in everything ranging from computer design to AI. WORLDCOMP includes separate conferences on computer and FPGAs design, including various panel discussions about hot topics of the year. Dozens of PhDs present their work, pushing the frontiers of computing a bit further ahead. One thing these conferences do not include is any representation from major hardware or software vendors, with some rare curious exceptions countable on the fingers of one hand. I routinely look up these very rare individuals to determine their place in their respective corporations, and their reasons for attending. Invariably I discover that they are not in a position to design anything for their employers, and have traveled on their own time and money, for the same reasons that I have come – to stay on top of computer technology. Meanwhile, major players like Microsoft hold their own conferences covering their own new products, and there are various separate conferences on Supercomputers, FPGAs, etc., sprinkled around at various times and places. In short, the entire industry is highly compartmentalized, and thereby effectively isolated from outside innovation. 4. SACRIFICING OUR FUTURE The argument that most of the PhD theses and other outside work presented at conferences like WORLDCOMP is sophomoric, and hence not worth studying, is one industry defense. This is the predictable result of keeping outside work in isolation. Entire development departments exist where no PhD students are being mentored, thereby sealing the last avenue of cross-pollination. With no common conferences other than WORLDCOMP, no cross-specialty interaction because of the lack of manufacturer participation in WORLDCOMP, very few PhD students, etc., there can be little if any significant technological advancement (outside of the usual technology loops) from the isolated developers in each corporation. This is the underlying problem, not the questionable quality of outside work that is kept completely isolated from the harsh realities of fabrication. Sure, keep secret the proprietary methods that are presently being used to address the technological challenges at hand, but keeping the challenges themselves secret, e.g. the nature and distribution of real-world defects, is suicidal to the entire industry. 5. CORRUPTION OF PROCESS Much of the computing community has lost the “re” in “research”, corrupting it to mean the exposition of past developments, and specifically excludes speculations as to where things could go with careful (re)direction. Indeed, an earlier version of this paper was denounced on this basis by reviewers, with comments like “there are no analyses of any of these suggestions, only proposals”, ”it does not fit easily into any subtopic area” and “ideas are cheap, especially recycled ideas.” As a result, the entire field of high performance computing has been sucked into a bottom-up design process, by simply dismissing top-down discussions on the basis that they are not “research”. As a result, high performance computing still lacks an effective architecture, when all indications are that a top-down approach could probably have produced a good architecture a decade ago. This obviously can not be corrected at compartmentalized conferences like the FPGA conference. Only multidisciplinary conferences like WORLDCOMP have any real hope of encouraging the top-down methods needed to merge the disparate technologies to forge the future of high performance computing. 6. WHERE WE SHOULD BE HEADING It seems pretty obvious that each segment of the industry has some of the “missing pieces” needed by the other segments. Intel is now developing CPUs with embedded GPU-like capability, which could be greatly enhanced with some reconfigurable logic. Similarly, a processor embedded into FPGAs could orchestrate power-on fault-tolerant reconfiguration and reconfigurable logic. Just about everyone already seems to agree that all memory must be coherent. In short, it appears that each segment of the industry is proceeding toward a single common point, a computational singularity promising a hundredfold improvement in performance over traditional methods. However, the present looping of technology is tugging at every segment to pull them away from that computational singularity. The main blockage seems to come from ignorance of various obscure enabling technologies. Adding with logarithms is easy: 1. Take the ratio of the two arguments, which since they are represented as the logarithms of numbers; you can divide by simply subtracting their logarithms. 2. Use that ratio to look up the appropriate entry in a “fudge factor” table that contains the logarithms of fudge factors. The size of this table limits the precision available with this method to less than IEEE-754 single precision. Interpolation methods extend the available precision. Carefully computed table entries avoid consistent round-off errors much like IEEE-754 does. 3. Multiply the numerator by the fudge factor, which is accomplished by adding their logarithms. 7. ENABLING TECHNOLOGIES The following synergistic enabling technologies are not additive, or even multiplicative. Projected advanced technology processors will need nearly all of these enabling technologies to function at all, because a very high level of parallelism is needed to support power-on and on-the-fly reconfiguration, and dynamic reconfiguration is needed to economically manufacture chips having so many components. Possibly the most challenging application that will ever be run on these processors will be contained in an attached ROM that is executed when power is applied, or when a fault is detected during execution, to reconfigure many faults “into the ether” leaving a fully functioning chip. This “wall of obscure technology” presents a sort of chicken-oregg situation that has so far blocked the construction of envisioned advanced technology processors, and hence denied the hundredfold improvement in performance that is expected from such architectures. Corporations limit their technological risks by taking risks one-at-a-time, and therefore wouldn’t think of trying many new things on a single new device. These enabling technologies are not individually transformative, so they have remained obscure. Note that taken together, these enabling technologies define a processor that is very different from present day processors, and would look a bit like “alien technology” to present day technologists. I think of this as the “rubber band effect”. These methods were individually rejected because they presented no great advantage, providing an ever increasing incentive to embrace other new methods as they appear. Now, that rubber band is stretched quite tight, providing a gigantic incentive for adopting enabling technologies. WARNING: Most of these technologies have outward characteristics that are very similar to other more common but less powerful methods. This has resulted in many experts misdismissing them during the varying number of decades since they were first proposed, thinking that they were something else. Indeed, this phenomenon has become an important driving force behind the continuation of the Itanium Effect. 7.1 Logarithmic Arithmetic Everyone is taught in school that you can multiply and divide with logarithms, but not add and subtract. This is simply not true (see below). Pipelined logarithmic ALUs are much simpler than floating-point ALUs, as they only need 3 adders, a small ROM, and some glue. Unfortunately, they are consigned to low precision. These enable super performance for low precision applications, like image and speech recognition, neural networks, etc., and make medium granularity designs quite practical. Note that, for the most part, future supercomputers will be working on very different problems than do present day computers. Future computing will involve AI applications centered on visual, audio, and neural network (NN) applications, all of which deal in low precision that is within the range of logarithmic arithmetic. Subtraction and signed arguments are handled with obvious extensions of this simple strategy, that involve the use of two sign bits, one sign bit for the sign of the logarithm of the absolute value of the number being represented, and the other sign bit for the sign of the number being represented. Since most of the “logic” of logarithmic arithmetic is contained in the contents of its tables, SECDED error correction logic can correct most faults, and detect the faults that it cannot correct. Note that the practical success of power-on and on-the-fly reconfiguration depend on having a high enough flops/transistor ratio to support real-time diagnosis and reconfiguration, so the speed and simplicity of logarithmic arithmetic makes it a clear winner in these devices. 7.2 Medium Grain and Multi-Grain FPGA Architectures Until now, everyone designing FPGAs either designed with full ALUs (coarse granularity), or with just gates (fine granularity). However, logarithmic arithmetic and “fixed point” (like integer, but the decimal point can have any pre-assigned location) arithmetic require much simpler blocks. Full IEEE-754 floating point ALUs can be chopped both horizontally (into functional units) and vertically (into digits). These blocks can be combined to achieve full pipelined result-per-clock-cycle capability as needed, though at the cost of additional flow-through time. The pieces of a logarithmic ALU are adders and tables with attached lookup logic. The pieces of full IEEE-754 floating point ALUs are priority encoders, shift matrixes, adders, multipliers, etc., depending on the implementation. These pieces can be used for other things, e.g. binary multiplying a logarithm by an integer is actually raising the number represented by the logarithm to the integer exponent. By providing a “parts store” of both ALU pieces and full ALUs in typically needed proportions, Akin to the letter assortment in a Scrabble game, users can create “super duper operations” that use most of them, then instantly reconfigure (see Horizontal Microcoding below) them to do other different “super duper operations” as needed. The goal here is not (yet) to provide everything needed to do entire programs in a single data chained operation, but rather to simply reduce the number of times that the same data needs to be (re)handled by an order of magnitude or so. Note that in most cases this obviates some of the need for an optimizing compiler, because there is no benefit to optimizing a configuration, unless without optimization, there aren’t enough components available to perform a complex series of computations and produce a result every clock cycle. Multi-grain would doubtless be more complex to use than either fine or coarse grained approaches, but its use would improve the flops/transistor ratio to facilitate real-time reconfiguration. 7.3 Coherent Memory Mapping Often, “local” memory has been attached to a particular ALU, and “global” memory has been attached to a particular bus. This isolation of local memory is needless, and associating “global” memory to a particular bus sucks performance. Multiple memory busses and interleaved memories have been used since the 1960s to exceed single-bus performance, and it takes very little logic to be able to attach local memories to global bus systems so that all memory is uniquely addressable (coherent). Note that it is important to provide redundant busses if vary large chips are to be reliably made. The trick is to provide many redundant busses, and then use whatever works and is available. In addition to “in use” bits, they would also need “functional” bits, along with the logic to honor and operate these bits, and use whatever bus is ready to carry the traffic. Switch fabric architecture is evolving. Note that simply organizing clusters of functional components in a 2-D fashion on the chip, and providing redundant busses for every row and column, it becomes possible for many cluster-to-cluster communications to be simultaneously taking place. As a result, traditionally slow operations like scatter and gather can run at many times traditional speeds. However, the biggest payoff from coherent memory is in its ability to easily save the state of a subsystem by simply copying out the memory in the subsystem. Without coherent memory, partial reconfiguration would be extremely difficult. 7.4 Variable Data Chaining The idea of pasting ALUs together in chains, as is now done in some coarse-grained FPGA designs, was called “data chaining” in early supercomputers, like the CDC Cyber 205. They simply switched ALU connections from memory to pipeline registers, in order to interconnect ALUs to get a sort of simplistic coarsegrained FPGA-like performance. Hence, perceived distinctions between coarse-grained FPGA designs and multi-ALU data chaining in supercomputers is illusory. By simply providing pipeline registers between 2-D arranged clusters, and configuring ALU ports to connect to those pipeline registers, otherwise isolated slave processors can chain together to perform extremely complex operations. 7.5 Fast Aggregation across ALUs One thing missing from all contemporary designs, and now presenting a major barrier to GPU advancement, is the interaction between parallel ALUs needed to perform aggregation functions like adding the elements of an array, finding the maximum value, etc., in log n time instead of n time. First every even ALU interacts with the adjacent odd ALU, and then the winners interact with other winners, etc. Graphics applications are unique in their general lack of need for aggregation, which has been the basis for GPU successes in that arena, and the basis for their lackluster performance in other areas. Aggregation also facilitates the merging of thousands of individually performed diagnostics in thousands of slave processors, thereby speeding up real-time reconfiguration. 7.6 Blurring the SIMD/MIMD Distinction Using Small Local Program Memories A certain amount of temporal autonomy is needed for slave processors to deal with slow operations, re-routing over busy and defective busses, memory cycles lost to other processors, waiting for slow global busses, etc. A small amount of FIFO or memory would provide the buffering needed to hold a few slave processor instructions broadcast by the central processor. However, it would take little more to implement a rudimentary instruction set in that memory, complete with conditional branch instructions, etc. With this, individual slave processers could each do their part to perform complex operations, without holding up the entire processor when they individually slow down or stumble into each other. This could provide the combined advantages of SIMD and MIMD architectures. Note that code running in faulty configurations will doubtless run into many roadblocks. Slave processors must be able to function autonomously, in order to continue running when other slaves have died. 7.7 A Simple Horizontal Microcoding Interface for Applications HP is believed to have been the first to make some of their horizontal microprogramming memory accessible to applications programmers. This was implemented as an option on HP 21-MX minicomputers. This way, users could define new operations that ran at hard-wired speed. In FPGA terms, this is akin to instant partial reconfiguration from memory. Instead of the usual FPGA design that uses long shift registers to hold a particular configuration, suppose that some of the configuration bits are replaced with several bits and a global mechanism of selecting which bit from each group to use. This could be a simple as having short circular shift registers controlling each potential connection, and a global mechanism of rotating all circular shift registers in unison. Loading would work as usual, only when all of the bits have been loaded for one configuration, the short circular shift registers would all be rotated by one bit and loaded with new contents for that position, with this process continuing until all of the bits in every circular shift register have been loaded. Runtime reloading, equivalent to re-defining operation codes of the computer being implemented, could be accomplished by rotating the circular shift register into position, and transferring the contents of a dedicated place in memory into the main FPGA programming input, while inhibiting changing any memory or registers on the device during reprogramming. This way, compilers could invent perfect “operation codes” that are custom made to perform the work of a page of code, and use just about everything on large devices to achieve incredible speed. This is critical for diagnostics in preparation for reconfiguration, because it allows access to the basic “grains” of the system. Without this, really complex diagnostics would be needed that deduce the sources of malfunctions from combinations of observed failures from various complex configurations. such glitches is the present situation of permanent irreversible component failure. 7.8 Failsoft Configuration on Power-Up To further improve the likelihood that in-service failures will be recognized and corrected, idle time should be spent running diagnostics, diagnostic failures should be used to invalidate recent results, and diagnostics should be invoked before accepting highly unusual results. Blowing fuses during manufacture to deal with faults makes no provision for in-service failures. However, (re)configuring on power-up cures in-service failures and assures a nearly limitless lifespan. This is no problem with a general purpose programmable device that is fast enough to do the job in a reasonable amount of time. However, this functionality establishes a high lower limit on the performance of future processors, as processors must be able to fully diagnose and repair numerous malfunctions on power-up within a second or so. Further, devices must incorporate appropriate technologies to support dynamic reconfiguration. This presents a chicken-or-egg challenge, as superperformance is needed for practical power-up reconfiguration, and power-up reconfiguration is a practical necessity to implement super-performance at the high levels envisioned here. Once this has been achieved, there is no limit on size and complexity of future computers while maintaining ~100% yield, providing that sufficient spares are included in the design, and there is fall-back to smaller configurations in the event of an excessive number of failures. For example, a chip might have a dual mainprocessor where only one is needed, and, say, 4096 slave processors plus a few hundred spares, configured in a way that the system could run with 2048, or even 1024. Further, the main processors might have IEEE-754 floating-point ALUs, which could be emulated in software should the ALUs be faulty. There would be countless fuses in the power distribution network to isolate any shorted logic, etc. In short, if much of anything worked, the chip would work well enough to sell, at least into applications that didn’t need their maximum potential performance. This would completely remove the present tradeoffs between chip size and yield. 7.10 Physically Symmetrical Pinout to Facilitate the Use of Defective Components By carefully assigning the pinout of new chip designs, it could easily become possible to plug them in any of up to four different ways, with as many separate-but-equal processors and associated I/O pins as there are ways of plugging them in. This way, there would be up to four prospective pins #1. After testing, a factory technician would then cut off the corner pin associated with each fully functional processor. A circuit board designer could then eliminate any combination of corner pin holes, depending on which processors and I/O pins were not essential to the design. Typical circuit board designs would have 3 missing corner pin holes on the circuit board, so that a fully functioning chip could be plugged in any of 4 different ways (because all of its corner pins would be missing), However, a chip with a malfunctioning processor or I/O pin(s) could only be plugged in only one way, with the associated pin #1 connecting to the one remaining corner pin hole. Note that hexagonal chips could be plugged in any of 6 ways, and only 1/6 of the chip would be lost to a bad I/O pin. Designers would be motivated to use fewer I/O pins, because malfunctioning chips would be less expensive then ones where all of the processors and I/O pins function correctly. Designs needing less than half of the I/O pins would provide additional corner pin holes on the circuit board design, so that chips with even more bad sections could be utilized. Note that advanced configuration methods typically involve genetic algorithms (GA) to discover workable configurations, so the time needed to configure is variable and potentially unbounded. Hence, a large engineering margin in performance will be needed to assure timely (re)configuration. Another pin (perhaps pin #2) would be asserted on the circuit board to identify the “primary” section of the chip, so that the chip can tell which section is to control the others, and identify the chip’s rotational position on the board, so that the functions of I/O pins can be properly determined. 7.9 Failsoft Partial Reconfiguration During Execution Of course a designer could use all of the processors and all but two pins in every corner, but then only fully functional chips could by utilized. It is also possible to survive most new faults during operation. Mostly implemented in the firmware running on a future processor, applications that run as multiple tasks that post their results when done; could continue operation, even with the failure of associated computational components. Applications would use Assert logic to confirm correct operation, and watchdog timers to recognize dead tasks. When a task is seen to be malfunctioning, a partial reconfiguration would be triggered, using the same (re)configuration logic used on power-up, but restricted to the subset of hardware used by the malfunctioning task. When reconfiguration is complete, the failed task would then be restarted on the reconfigured hardware. A repeated identical failure would indicate a programming error. Fail-soft partial reconfiguration would require suitable application program architecture, and would introduce a considerable momentary “glitch” into DSP applications. The alternative to Symmetrical pinout is the last line of defense for when fail-soft methods fail. This provides the factory with a method of selling devices that cannot fully repair themselves, and provides service personnel with a method for repairing equipment incorporating devices that can no longer fully repair themselves. Repair personnel would simply remove a malfunctioning device, rotate it, and reinstall it. 7.11 An Architecture-Independent Universal Compiler To Compile A New APL-Level Language There are several candidate compiler architectures that could compile to almost any imaginable computer. Methods range from rewriting the code-generator portion, to rewriting table entries containing snippets that perform various functions, to using a syntax directed meta-compiler and simply altering the output instructions. All of these methods and more have been used in commercial production compilers, but complicating factors, like the desire for code optimization, muddy the waters. Note that there are orders of magnitude to be gained from a radically new compiler, yet optimization typically only buys <2:1. Hence, for the time being, until architectures make their radical jump and then settle down, optimization concerns should be set aside in favor of getting the technology off of its present hump. A major complication is the need for a new source language with semantics akin to APL in which to program high performance applications. A truly flexible and capable compiler for compiling from an APL-level language to variable targets is desperately needed for the high-performance industry to progress. This can only come as an industry-wide effort, or at government expense. If a new language is being designed, it should probably also have ADA-like variable declarations to facilitate both compile-time and run-time checking, and support a COBOL-like verbose listing mode to facilitate “desk checking”. way. However, with the industry chopped up into various fixed domains that are gradually becoming less and less appropriate for future computational needs, the gaps in other-domain capability are being filled in with extreme in-domain complexity. In this mode, we all become part of the problem, rather than part of the solution. There was a similar situation during the 1960s, when radically new computers were being introduced on a yearly basis. Many excellent technologists simply dropped out of this scramble to keep up with technology. I anticipate the same phenomenon during a transition to the envisioned advanced technology processors, where few present technologists will be able to keep up with the coming radical changes. 9. COMPUTATIONAL SINGULARITY Here are three differently stated but identical architectures, stated from three different points of view: 1. An array processor whose slave processors are much more powerful than in the past, including reconfigurable logic, so that on-the-fly array processing instruction definition becomes practical. 2. A multi-grain FPGA with ALUs and parts of ALUs, memory banks, etc., organized into reconfigurable clusters under the control of central processor(s). 3. GPUs with reconfigurable logic and integrated into a CPU. 7.12 Putting it All Together It seems clear that it is possible to build self-repairing processors of nearly limitless size and performance, and manufacture them with ~100% yield, with reasonable extensions to the technology at hand. However, its architecture won’t be a CPU, GPU or FPGA. Instead, it will be a combination of all of these and a little more. It could run languages like C++ just as inefficiently as do present products, but would need a language with APL-like semantics to support future capabilities. Further, there are some big money hurdles to overcome. Probably the biggest hurdle is the need for a universal compiler to support future highperformance computing. Is a hundredfold improvement in performance worth the ~billion dollar cost to engineer such devices? That is the question that should logically determine whether this path is taken. Note the prior call for technological revolution of computation, made 15 years ago by Andŕe DeHon in his PhD thesis Reconfigurable Architectures for General-Purpose Computing. His technology was eventually commercialized into a company called Silicon Spice, which was sold to Broadcom at the height of the tech boom for ~$1.2 billion. Technology of that time restricted practical implementation to DSP applications, whereas no such limitations now persist. However, with more complex methods supporting loftier goals, including general purpose computing, the cost will be considerably higher. 8. BEFUDDLEMENT If you ask others why they are in their particular corner of their industry, or you ask yourself why you are in your particular corner of the industry, the answer usually boils down to befuddlement with the unknown worlds outside of a particular domain. Nowhere else is this more clear than with memory designers, who usually want absolutely nothing to do with the internal complexities of computer architecture, operating systems, etc. There is a similar though lesser effect with FPGA designers. At the opposite end of the spectrum are the operating systems and applications designers, few of whom have any experience with an oscilloscope or logic analyzer. Just about everyone sees the industry as way too complex to think about in any sort of gestalt Each of these descriptions presumes the application of the necessary enabling technologies as explained earlier. So, where does the hundredfold improvement in performance come from? 1. Future projected applications won’t need the precision of present devices, so precision can be traded for speed, e.g. more logarithmic ALUs in place of fewer floating-point ALUs. Also, this speed will facilitate fast power-on diagnostics. 2. With the methods presented, much larger chips can be fabricated at the same overall cost, because defects will no longer impair the yield. Greater size brings the components needed to do more in parallel, along with the assortment of components needed to implement long computational pipelines. 10. CONCLUSION FOR DEVELOPERS Just because your co-workers remain isolated is no reason for you to do the same. Attend WORLDCOMP and other conferences that are out of your present scope of work. Present papers that discuss your vision for the distant future of your present scope of work. Make it known at local universities that you will mentor PhD students. Make friends who are familiar with potentially important future technologies, e.g.: 1. Someone who worked on 1980s supercomputers 2. Someone who has extensive computing experience that predates microcomputers. 3. A CS/EE professor who is up on just about everything that has ever been made. Soon, you will become known as THE guy who knows just about everything. You will soon be able to move into management, whereupon… 11. CONCLUSION FOR MANAGEMENT The issues presented in this paper all center around what would have been called “professionalism” in the era before microcomputers. Professionals would represent their employers at multidisciplinary conferences, present papers at conferences; stay up on all potential enabling technologies, mentor students, etc. This sense of professionalism has been completely lost in “modern” technology, and with it has gone the productivity of the designers at the major chip makers. Everyone is now threatened by this lack of a sense of professionalism. Thirty years ago I might have recommended simply firing such designers for cause and hiring more professional designers, but that time has passed and we must now “dig our way out” of this mess, using the people that we now have working in the industry. Rather than taking years/decades to do so in any slow way, my present recommendation is to draw up clear company standards of professionalism that ALL product designers must follow, and demote or fire those who don’t follow them. Sure you may have to make an example of one or two technically capable developers, but this is necessary to demonstrate your resolve. Soon everyone will be attending conferences other than just the “inside” conferences that only pertain to their present narrow work. Everyone will mentor PhD students, will be presenting papers inviting public scrutiny before committing designs to silicon, will bring in both very new and very old enabling technologies, etc. 12. OVERALL CONCLUSION This paper advances an approach to achieve a renaissance in computational architecture, promising a hundredfold increase in performance. However, the big challenge here is organizational rather than architectural. Existing microcomputer, GPU, and FPGA manufacturers have been unable/unwilling to adapt to produce these products, or this would have happened a decade or more ago. This appears to require revolutionary management changes, presenting an excellent opportunity for takeover bids by astute investors who are not attached to present methods of engineering management. 13. REFERENCES [1] Strenski, D., Sundararajan, P., Wittig, R. 2010. The Expanding Floating-Point Performance Gap Between FPGAs and Microprocessors. HPC wire. November 22. http://www.hpcwire.com/features/The-Expanding-FloatingPoint-Performance-Gap-Between-FPGAs-andMicroprocessors-109982029.html?viewAll=y [2] http://www.computerhistory.org/collections/ibmstretch Computer History Museum’s collection of IBM Project STRETCH materials. [3] http://www.world-academy-of-science.org/ WORLDCOMP web site [4] http://www.fundinguniverse.com/companyhistories/TANDEM-COMPUTERS-INC-CompanyHistory.html [5] http://en.wikipedia.org/wiki/Floating_Point_Systems [6] DeHon, Andŕe. 1996. Reconfigurable Architectures for General-Purpose Computing. A.I. Technical Report No. 1586. http://www.seas.upenn.edu/~andre/pdf/aitr1586.pdf [7] Pickett, L. 1993. U.S. Patent 5,184,317. A discussion of advanced logarithmic arithmetic methods. [8] Richfield, S. 1987, A Logarithmic Vector Processor for Neural Net Applications. Proceedings of the IEEE First Annual International Conference on Neural Networks. IEEE Catalog #87TH0191-7. [9] Buchholz, Werner. 1962. Planning a computer system: Project Stretch. McGraw-Hill, Inc. Hightstown, NJ, USA. ISBN:B0000CLCYO [10] Ashenhurt, R. L., Metropolis, N. 1959. Unnormalized Floating Point Arithmetic. Journal of the ACM (JACM) Volume 6, Issue 3. Pp 415-428. ISSN:0004-5411