Intelligent RAM (IRAM): the Industrial Setting, Applications, and Architectures
David Patterson, Krste Asanovic, Aaron Brown, Richard Fromm,
Jason Golbus, Benjamin Gribstad, Kimberly Keeton, Christoforos Kozyrakis, David Martin, Stylianos Perissakis, Randi Thomas, Noah Treuhaft, and Katherine Yelick Computer Science Division, University of California, Berkeley CA 94720-1776
Abstract • This same DRAM fabrication process offers fewer
metal layers than a logic process to lower costs since The goal of Intelligent RAM (IRAM) is to design a routing speed is less of an issue in a memory; cost-effective computer by designing a processor in a • DRAMs are designed to work in plastic packages and memory fabrication process, instead of in a conventional dissipate less than 2 watts, while desktop microproces- logic fabrication process, and include memory on-chip. sors dissipate 20 to 50 watts using ceramic packages; To design a processor in a DRAM process one must • DRAM refresh rates go up with operating temperature, learn about the business and culture of the DRAMs, which approximately doubling for every 10 degrees C raise; is quite different from microprocessors. We describe some of those differences, and then our current vision of IRAM • Some applications may not fit within the on-chip mem- applications, architectures, and implementations. ory of an IRAM, and hence IRAMs must access either conventional DRAMs or other IRAMs over a much 1. Potential and Challenges of IRAM slower path than on-chip accesses. Another major IRAM challenge is matching the cost Intelligent RAM (IRAM) may lead to a different style of DRAM memory. Cost obstacles include: of computer than those based on conventional micropro- cessors. IRAM technology offers the following potential: • DRAMs include redundant memory so that fabrication flaws can be circumvented to improve yield and there- • Improve memory latency by factors of 5 to 10 and fore lower cost. Microprocessors traditionally have no memory bandwidth by factors of 50 to 100, by rede- redundant logic to improve yield. Hence the on-chip signing the memory interface and exploiting the prox- logic may effectively determine the yield of the IRAM. imity of on-chip memory [1][2]; • Testing time affects chip costs. Given both logic and • Improve energy efficiency of memory by factors of 2 to DRAM on a the same die, an IRAM die may need to be 4, primarily by going off-chip less frequently [3][4]; tested on both logic and memory testers. • Reduce design effort tenfold by filling the die with rep- • To help close the performance gap for logic in a DRAM licated memory rather than with custom logic [5]; process, merged-logic DRAM processes are being cre- • Make the memory size and organization fit the intended ated with faster transistors and more metal layers which workload; increasing the cost per wafer by 10% to 30%. • Reduce board area by factors of 4 or much greater by The business model for IRAM also has challenges. integrating many components on a single chip; and Although an IRAM may be classified as a single chip • Improve I/O bandwidth by factors of 4 to 8 by replacing computer and sold like desktop or embedded microproces- the conventional I/O bus with multiple high-speed, sors, the initial companies most interested in pursuing point-to-point, serial lines. [6-8] IRAMs are DRAM companies, and they generally have little experience in the microprocessor market. Some chal- This list makes IRAM an exciting opportunity. lenges are: One IRAM challenge is matching the performance of microprocessors. Performance obstacles include: • DRAMs are “generic” parts, used in many places with- out impacting the software. Putting a processor in the • IRAM is fabricated in a process that has been oriented DRAM limits the software that can run on the IRAM. towards small memory size and low charge leakage rather than fast transistor speed; • The DRAM economic model depends on producing a very high volume of parts––billions of DRAMs are made each year––while some microprocessors sell less the smallest memory cell so as to have the lowest cost per than a million per year. bit. The capacity increases are generally achieved by • DRAMs companies do not need to worry about a sup- reducing cell size by about a factor of 2.5 and increasing ply of support and application software for their chips. the die size by a factor of 1.5. The increasing die size is a IRAM would change that requirement. major reason that the cost per bit changes more slowly than the capacity per chip. A secondary design target is This paper first goes into more depth on the DRAM bandwidth in the fast access mode, and the trailing con- industry to motivate initial solutions to the IRAM chal- cern is latency to access a random bit in memory. lenges, looks at potential IRAM applications and architec- A DRAM company’s business goal is typically to tures, and then concludes with our target implementation supply 10% of a single DRAM generation. As there were alternatives that could be taped out in 1999. 6.25 billion DRAMs shipped in 1996, such an apparently modest target can lead to hundreds of millions of chips. 2. DRAM and Microprocessor Industries Desktop microprocessor designers tend to have a Figure 1 highlights some of the differences between design cost target, expressed as die size, and then build the the DRAM industry and the desktop microprocessor fastest chip they can for that size. Microprocessor volumes industry. DRAM companies agree on new standard inter- have more to do with an instruction set target than with the faces for new generations and configurations of DRAMs. actual final performance, but given that microprocessor These standards include almost everything: pinout, pack- designers generally do not get to pick the instruction set, age, addressing, refresh rates, and so on. they aim for the highest performance. Of secondary Each microprocessor manufacturer generally sets importance is the cost. Recently the power dissipation has their own instruction set standards to ensure software become so high that it is now a concern. compatibility with prior generations, but is free to invent Embedded microprocessor designers have much new interfaces with different packages and pins, different lower cost targets and power budgets, and more likely to memory interfaces, and so on. Whereas microprocessors sacrifice performance to ensure meeting the cost/power follow their own architecture standards with varying budgets than designers of desktop microprocessors. implementations over time, DRAM manufacturers stan- The different figures of merit for memory designers dardize at the package level and innovate in the size of the and microprocessor designers have resulted in a perfor- mance gap between processor and memory in computer memory cell and in efficiency of manufacturing process. systems. The primary approach to bridging this gap has been increasing the amount of SRAM on a microprocessor DRAM Microprocessor to act as a cache. Today, many microprocessors dedicate Standards pinout, package, refresh binary compati- between one-third and two-thirds of the area on chip to rate, addressing, capac- bility, IEEE 754 these caches.[2] Moreover, today there are often external ity, width, fast transfer Floating Point, SRAM chips to build secondary caches. Such chips add to mode, failure rate I/O bus cost and increase board area. Sources multiple single 2.2. Differing Generation Strategies Key figures 1) capacity, cost/bit 1) performance of merit 2) bandwidth on standard Traditionally, DRAM manufacturers would design a 3) latency benchmarks new memory cell and a new fabrication process simulta- 2) cost neously. The company then produces tens of thousands of Rate of 1) 60%/year, 25%/year 1) 60%/year “engineering samples” until both the fabrication process improve- 2) 20%/year 2) little change and memory cell design are fully “characterized.” Charac- ment 3) 7%/year terization means that the resulting dies will operate at min- Figure 1. Business models of DRAM and desktop imum refresh rates over the full temperature range microprocessor industries. supplying data with acceptable bit error rates. Once characterized, the subsequent chips are at the 2.1. Differing Design Targets “first customer ship” milestone. There may also be a sepa- rate milestone of “mass production” when the part Not only do the multiple source versus single source achieves the high volume that DRAM manufacturers business model affect the design of the chips, the figures strive for. Given that all DRAM manufacturers use the of merit vary between the two cultures. DRAM designers same semiconductor fabrication equipment and same pride themselves on improving storage capacity per chip wafers, the time to these milestones can determine what by fourfold every three years (60%/year) and by having share of the market a company will achieve. The size of the die, testing time, and yield determine The economic law of supply and demand was invoked profit of a company that has a sizeable market share. As a in 1996, as DRAM companies increased production and result, DRAM manufacturers are much more secretive new companies entered the market. Between January 1996 about Spice parameters and design rules than micropro- and December 1996 the price of a 16 Mbit DRAM fell cessor companies. To lower costs they shrink the die to from about $40 per chip to $6 per chip, below the histori- increase the number of chips per wafer and improve the cal 25% per year price decline. Stated alternatively, over- fabrication process to improve yield. As they better under- all DRAM sales fell from $16.5B in 3Q95 to $7B in stand a process, they will reduce the testing time and may 1Q97. And although prices rose to $8 per 16 Mbit DRAM even reduce the number of spare rows and columns to get in March 1997, they returned to $6 in August 1997.[9] slightly smaller dies. DRAMs typically go through 3 to 4 At the same time Intel was posting record profits. In generations of die sizes over a 4 to 6 year lifetime. 1996 Intel’s net revenue was $20 billion, with a ten year Recently, DRAM manufactures have separated the growth rate of 30% per year. In recent quarters about a process and memory cell size from the capacity of the die. third of Intel’s income was profit. Hence the same line might make third generation 64 Mbit In addition to the interesting potential of the IRAM and first generation 256 Mbit parts depending on the technology, DRAM companies are hoping that IRAM demands of the market. Today, it makes more sense today would enable profits per wafer to be more like recent to talk about the generations of memory cell size and pro- microprocessors wafers than like recent DRAM wafers. cess rather than just the generation of, say, a 64 Mbit part. Once in mass production, DRAM die yields below 3. Potential IRAM applications 60% are considered disastrous. Such high yields comes from small die, low defect density, and using redundant For DRAM manufacturers to enjoy the profits of an rows and columns to repair some flaws. Although real Intel, they need to find potential IRAM applications that yields are closely guarded secrets, yields of 80% or 90% sell in the millions. The first three applications could meet are apparently achieved by some efficient manufacturers. that goal. The last two applications are predicated on the Microprocessor manufacturers generally are not as success of one or more of these first three, as they are tightly tied to the fabrication process as are DRAM unlikely to achieve such high volumes. designers. In fact, there are several “fabless” microproces- 3.1. “Intelligent” Video Game sor manufacturers, but no major “fabless” DRAM manu- facturers. Microprocessor designers tend to not worry as Nintendo sold 2.6 million of its latest video player much about fully characterizing a design. The key mile- for $150 in its first year. Each is based on a four-chip set: stones tend to be tape out, booting the operating system on one 64-bit MIPS processor chip, one graphic accelerator an early chip, and then mass production occurs when the chip, and two RAMBUS memory chips. Graphics and system using the chip is also shipped. Intel, which ships sound have always needed as much performance as possi- 10 to 100 times the volume of other microprocessor manu- ble, with 3D graphics being especially needy in memory facturers, spends much more time on design verification bandwidth and floating point performance. and process tweaking to improve yield. An IRAM combining the processor, graphics acceler- While every chip designer desires high yield, micro- ator, and 4 to 16 megabytes of memory could exploit the processor designers typically design chips that almost fill orders of magnitude in memory bandwidth and small the full reticle and hence may be very happy with initial board area advantages of IRAM to offer an attractive chip yields of 20%. for the next generation of video games. The die is shrunk once as the technology scales, thereby improving yield and increasing clock rate. Com- 3.2. “Intelligent” PDA panies with high volumes like Intel have a shrink team at Palm-top PDAs are becoming increasingly popular. work before the die is originally taped out, and will go For example, 1 million Palm Pilots were sold in its first through more generations of the die than lower volume year, each for about $300. The Palm Pilot requires the user manufacturers. to learn a new alphabet and then enter the characters with 2.3. Differing Profits a stylus on a touch sensitive screen. Other PDAs offer miniature keyboards. Between 1994 and early 1996, DRAM price per If an IRAM could include sufficient computing power megabyte did not decline by its historical 25% per year. to enable speaker trained, isolated-word speech input to a Since technology continued to improve and thus costs PDA, the device would be much more useful. In such a continued to decline, the DRAM industry became increas- machine the stylus would be used to correct the errors, ingly profitable. usually selected from a pop-up list of potential words. At 90% to 95% word accuracy, achieved by systems like a single IRAM in two to three years to sort more than the Dragon Dictate, and if 80% to 90% of the time the correct current record. Using a few serial lines to connect a cluster word is found in the popup error menu, then speaking into of 16 to 32 IRAMs via a switch for network communica- a PDA could be as fast as typing on a full-sized keyboard. tion and other serial lines connect them to disks could An IRAM with sufficient performance and 4 to 16 allow this cluster to sort more than 100 GB in a minute. MB of memory to hold the dictionary, when combined Given that the high volume applications above need inex- with the advantages of energy efficiency and small board pensive IRAMs, the cost of 16-32 IRAMs would likely be area, could be an attractive building block for the next much less than 10% of the disk infrastructure cost.[7] generation of PDAs. Greg Papadopolous, Chief Technical Officer of Sun Microsystems Computing Corporation, observed a trend 3.3. “Intelligent” Disk in data mining. [12] While processors are doubling perfor- mance every 18 months, customers are doubling data stor- Tens of millions of magnetics disks are made each age every 5 months. Customers would like to “mine” this years, and they include integrated circuits with memory data overnight to shape their business practices, but data is for a track cache and logic to calculate the error correction being accumulated faster than affordable computers can codes for each block. The track cache grows with the process the information. Combining Intelligent Disks with increasing linear density of a track, or about 1.3X per year. an IRAM cluster might lead to scalable processing for For example, the 9-GB Seagate Cheetah drive comes with data mining that can keep up with “Greg’s Law” at a frac- a 0.5 Mbyte track cache and offers a 2.0 Mbyte cache as tion of the costs of the disks. an option. The new Fibre Channel serial interfaces for disks increase bandwidth demands, requiring transfer 3.5. Low-Cost TeraFLOPS Cluster rates to the cache be 100 Mbytes/second over two ports. An IRAM with high-speed serial interfaces could eas- A traditional but even lower volume market is super- ily supply the required memory capacity and network computing. Using the same serial networks to connect bandwidth. With sufficient computing power, in addition IRAMs via cross bar switches, hundreds of small, low to calculating error correction codes, it could handle the power IRAMs could be placed on a few small boards. If network and security protocols. Such a disk could attach IRAMs for video games could compute at 1 GFLOPS, directly to a local area network, thereby avoiding a server. then in 2 to 3 years 1000 IRAMs and the disk system Such a network-attached secure disk may improve scal- needed for the sorting above could offer TeraFLOPS com- ability and bandwidth over conventional systems.[10] puting for less than $500,000. Figure 2 compares key As disks will dissipate between 5 and 20 watts, an parameters to the $55,000,000 ASCI Red machine. IRAM for an Intelligent disk must be power efficient. Note the smaller memory and higher I/O bandwidth Disks also value small board area very highly, as the chips of the IRAM cluster. The sort benchmark was able to trade must fit on the back of 2.5 inch or 3.5 diameter disks. off higher I/O bandwidth for smaller memory. Whether An attractive chip for disk manufacturers might be a this would be true for supercomputing remains to be seen. low-power IRAM with 4 to 16 MB of memory for disk Even adjusting cost/performance of ASCI Red by a caches and networking code plus serial I/O for the inter- factor of 4 to 6 improvement for technological advances face to disk and local area networks. between 1996 and 2000, an IRAM cluster might be attrac- tive for supercomputing. 3.4. Scalable, Low-Cost, Data-Server Cluster ASCI Red [13] IRAM cluster If IRAM proves successful in such high volume mar- kets as those above, such chips may be available to con- Processors 9000 Pentium Pros 1000 IRAMs struct much more cost-effective cluster-based servers than Memory 600 GB 16-24 GB those based on conventional desktop microprocessors. Disk 2000 GB 2100 GB One example comes from the commercial world. One I/O benchmark is Minute Sort, which copies data from Peak Perf. 1.8 TeraFLOPS 1.0 TeraFLOPS disk, sorts it, and then stores it back to disk. This applica- I/O speed 450 GB/s 2000 GB/s tion places the same demands on servers as decision sup- Floor space 1600 sq. ft. <10 sq. ft. port systems. The current world record is 8.6 GBytes using a cluster of 95 Sun Ultra 1 workstations connected Cost $55,000,000 <$500,000 via 160 Mbyte per sec links through switched-based local Year 1996 2000 area network.[11] Figure 2. Supercomputing clusters. Using the serial lines to connect to disks should allow 4. IRAM Architectures and Implementations We selected a vector architecture for four reasons. The first is the compiler technology is the most mature of Putting a conventional cache-based, superscalar the options, increasing the chances that programs would microprocessor in an IRAM does not lead to exciting per- run on an IRAM with little or no change. formance.[7][14] Hence IRAM needs a new architecture. The second reason is that the specification of many If an architecture requires programmers to rewrite parallel operations in a single instruction helps in the their programs, then it needs advantages of factors of at power-performance trade-off. Since the power is reduced least 10 and as much as 50.[15] The reason for this high by the square of a voltage reduction, two techniques allow threshold is that software development is slow, and with us to lower power while maintaining performance: deeper conventional microprocessor performance doubling every pipelines and multiple pipes or lanes. Deeper pipelines 18 months, there must still be a large advantage after the make more sense in a vector architecture because the vec- programming is completed. Otherwise programmers will tor operation specifies 64 or 128 operations without a just wait, as in the long run novel machines are often branch. Multiple pipes or lanes means that by including, unsuccessful commercially. say, 2 ALUs and cutting clock rate in half we can maintain Given the silicon budgets of the next five or so years, performance while reducing voltage to lower power. its unlikely that any alternative will have that large an The third reason is that the multimedia support sug- advantage over conventional microprocessors for a large gested by video games, PDAs, or data mining is an ideal set of programs. Keep in mind the DRAM vendors want application for vector architectures. Compared to multi- designs that can be fabricated in the millions, so it is likely media extensions such as MMX, vectors are a more ele- that IRAMs will be targeted at many applications. gant way of specifying multiple subword operations. We Hence, in selecting a new architecture, the key is find- can simply divide vector registers into smaller elements. ing a design that exploits the memory bandwidth potential The fourth reason is that the use of multiple pipes or of IRAM while leveraging software developed for tradi- lanes gives the IRAM the ability to have redundant logic tional computing. Thus an architecture that has offers that can be discarded to improve yield. With four ALUs, mature compiler technology is at an advantage. A second- for example, it may cost little in overall area but signifi- ary consideration is energy efficiency. Given the applica- tions in section 3, architectures that reduce power while cantly reduce costs to include a fifth ALU as a spare preserving performance are very attractive for IRAM. Another consideration is small code size to reduce the 5. Conclusion amount of memory occupied by programs in IRAM. Figure 3 shows the 1999 merged logic-DRAM tech- We see four architectural alternatives: SIMD, VLIW, nology, available from several companies, and parameter MIMD on a chip, or vector. While SIMD is a good match estimates of two potential vector IRAMs: low power and to the IRAM technology when the logic is distributed with high performance. We believe the low power option. is a memory modules, it has never been a general purpose solution. It also has received little compiler development for traditional programming languages. So we rejected it. Target Low Power High Performance VLIW is very popular today in the architecture Technology 0.18-0.20 micron, 5-6 metal layers, fast xtor research community, but it has three negatives. The first is that the compiler technology has not been successful com- Die size ≈200 mm2 mercially, although it is an area of active compiler Memory 16-24 MB research. The second is that VLIW architectures tradition- Vector lanes 4 64-bit (or 8 32-bit or 16 16-bit or 32 8-bit) ally have the largest code size of the alternatives. The third is object-code compatibility across multiple generations. Serial I/O 4 lines @ 1 Gbit/s 8 lines @ 2 Gbit/s MIMD on a chip is a plausible direction for IRAM, Power ≈2 w @ 1-1.5 v logic ≈10 w @ 1.5-2 v and many have taken or are taking this track.[16-18] The Clockunivers. 200scalar/100vector MHz 250s/250v MHz MIMD commercial successes have been servers, where the performance is number of tasks per hour rather than Clockindustry 400scalar/200vector MHz 500s/500v MHz time for a single task. While servers are found in section 3, Perfuniversity 0.8 GFLOPS64-6 G8 2 GFLOPS64-16 G8 they probably will not have the volumes to justify IRAM. Hence one question is whether a specific MIMD organiza- Perfindustry 1.6 GFLOPS64-12 G8 4 GFLOPS64-32 G8 tion lends itself to compiler technology to automatically Figure 3. Low power and high performance Vector IRAM parallelize an application to run well on all processors goals to be taped out in 1999. The two clock rates are for the with a single chip. A second question is the energy effi- scalar unit and the vector unit, and the range of the perfor- ciency of fetching four independent instructions streams. mance is between 64-bit floating point and 8-bit integer. better match to high volume applications such as video Remember, Denver, CO, USA, 1 June 1997. games, PDAs, or disks. (http://iram.cs.berkeley.edu/isca97-workshop/w2-120-draft.ps) We believe our small, academic design team can build [8] Saulsbury, A.; Nowatzyk, A. “Missing the memory wall: the an IRAM with half the performance of a larger and more case for processor/memory integration.” ISCA'96: The 23rd experienced industrial team. Yet even this design would Annual International Conference on Computer Architecture, demonstrate the potential of IRAM to offer an interesting Philadelphia, PA, USA, 22-24 May 1996. p.90-101. combination of performance, power, memory capacity, [9] Achilles Corporation; “DRAM Market Price Information in board space, and cost. Japan,” 1 August 1997. Several characteristics make IRAM an exciting (http://pweb.aix.or.jp/~maski-na/index1-1EG.html) research topic: large advantages on many dimensions, the design challenges that make success not obvious, the need [10] Gibson, G.A.; Nagle, D.F.; Amiri, K.; Chang, F.W.; Fein- to rethink the computer design for IRAM, its availability berg, E.M.; Gobioff, H.; Lee, C.; Ozceri, B.; Riedel, E.; Roch- berg, D.; Zelenka, J. “File server scaling with network-attached in a fairly standard manufacturing process, and its poten- secure disks.” 1997 ACM International Conference on Measure- tial impact on two large industries. Only time can tell us ment and Modeling of Computer Systems (SIGMETRICS 97), the impact of this intriguing opportunity. Seattle, WA, USA, 15-18 June 1997. p.272-84.
Hellerstein, J.M.; Patterson, D.A. “High-performance sorting on This research was supported by DARPA (DABT63- networks of workstations.” SIGMOD 1997: ACM SIGMOD International Conference on Management of Data, Tucson, AZ, C-0056), the California State MICRO program, and by USA, 13-15 May 1997. p.243-54. research grants from Intel, Samsung, Silicon Graph- ics/Cray Research, and Sun Microsystems. [12] Papadopolous, G. “The Future of Computing.” Unpublished talk at NOW Workshop, Lake Tahoe, CA USA, 27 July 1997. 7. References [13] Rowell, J. “Intel Ships 20 Gflops Teraflops Installment to [1] Patterson, D.; Anderson, T.; Cardwell, N.; Fromm, R.; Kee- Sandia,” May 9, 1996,(http://www.ssd.intel.com/tflop1.html) ton, K.; Kozyrakis, C.; Thomas, R.; Yelick, K. “Intelligent RAM (IRAM): chips that remember and compute,” 1997 IEEE Interna- [14] Bowman, N.; Cardwell, N.; Kozyrakis, C.; Romer, C.; and tional Solids-State Circuits Conference. Digest of Technical Wang, H. “Evaluation of Existing Architectures in IRAM Sys- Papers, San Francisco, CA, USA, 6-8 Feb. 1997. p.224-5. tems,” Workshop on Mixing Logic and DRAM: Chips that Com- pute and Remember, Denver, CO, USA, 1 June 1997. [2] Patterson, D.; Anderson, T.; Cardwell, N.; Fromm, R.; Kee- (http://iram.cs.berkeley.edu/isca97-workshop/w2-114.ps) ton, K.; Kozyrakis, C.; Thomas, R.; and Yelick, K. “A case for intelligent RAM”, IEEE Micro, vol.17, (no.2), March-April [15] Weems, C. “Considerations Leading to an Asynchronous 1997. p.34-44. SIMD Architectural Approach for Exploiting Mixed Logic and Memory,” Workshop on Mixing Logic and DRAM: Chips that [3] Fromm, R.; Perissakis, S.; Cardwell, N.; Kozyrakis, C.; Compute and Remember, Denver, CO, USA, 1 June 1997. McGaughy, B.; Patterson, D.; Anderson, T.; Yelick, K. “The (http://iram.cs.berkeley.edu/isca97-workshop/w2-108.ps) energy efficiency of IRAM architectures,” 24th Annual Interna- tional Symposium on Computer Architecture. (ISCA '97.), Den- [16] Kogge, P.M.; Sunaga, T.; Miyataka, H.; Kitamura, K.; and ver, CO, USA, 2-4 June 1997. p.327-37. others. “Combined DRAM and logic chip for massively parallel systems.” Proceedings. 16th Conference on Advanced Research [4] Shimizu, T.; et al. “A multimedia 32 b RISC microprocessor in VLSI, Chapel Hill, NC, USA, 27-29 March 1995, p. 4-16. with 16 Mb DRAM.” ISSCC Digest of Technical Papers, San Francisco, CA, USA, 8-10 Feb. 1996 p. 216-17, 448. [17] Murakami, K.; Shirakawa, S.; Miyajima, H. “Parallel pro- cessing RAM chip with 256 Mb DRAM and quad processors.” [5] Perissakis, S.; Kozyrakis, C.; Anderson, T.; Asanovic, K.; 1997 IEEE International Solids-State Circuits Conference. Cardwell, N.; Fromm, R.; Golbus, J.; Gribstad, B.; Keeton, K.; Digest of Technical Papers, San Francisco, CA, USA, 6-8 Feb. Patterson, D.; Thomas, R.; Treuhaft, N.; and Yelick, K. “Scaling 1997, p.228-9, 528. Processors to 1 Billion Transistors and Beyond: IRAM,” To appear in IEEE Computer, September 1997. [18] Yamauchi, T., Hammond, L. and Olukotun, K. “Evaluation of Existing Architectures in IRAM Systems,” Workshop on Mix- [6] Yang, C.K.K.; Horowitz, M.A. “A 0.8- mu m CMOS 2.5 Gb/s ing Logic and DRAM: Chips that Compute and Remember, Den- oversampling receiver and transmitter for serial links.” IEEE ver, CO, USA, 1 June 1997. Journal of Solid-State Circuits, 31:12, Dec. 1996. p.2015-23. (http://iram.cs.berkeley.edu/isca97-workshop/w2-106.ps)
[7] Keeton, K.; Arpaci-Dusseau, R; and Patterson, D; “IRAM
and SmartSIMM: Overcoming the I/O Bus Bottleneck,” Work- shop on Mixing Logic and DRAM: Chips that Compute and