Embedding Read-Only Memory in Spin-Transfer Torque MRAM-Based On-Chip Caches
Embedding Read-Only Memory in Spin-Transfer Torque MRAM-Based On-Chip Caches
Embedding Read-Only Memory in Spin-Transfer Torque MRAM-Based On-Chip Caches
Abstract— We propose a design technique for embedding as a result and the processor has to fetch the evicted data
read-only memory (ROM) in spin-transfer torque from off-chip memory again in order to continue program
MRAM (STT-MRAM) arrays by adding an extra bit-line execution. Hence, realizing an on-chip ROM with minimal
in every column of the array. RAM and ROM data, which can
be different, are stored in the same bitcell and the ROM capacity overhead allows static data to be stored closer to the processor,
may be as large as the RAM capacity. Furthermore, our proposed and may be used to accelerate the execution of applications.
ROM-embedding technique is applicable to any resistive memory A method for embedding ROMs in SRAM-based on-chip
technology in which the bit-cell topology is identical to that of the cache (called R-SRAM) was presented in [1]. R-SRAM may
STT-MRAM bit-cell. An additional sense amplifier is required be viewed as a special type of resettable RAM. When ROM
in the peripheral circuitry, hence we propose an area-optimized
peripheral circuitry to minimize the total area penalty of data are needed, the RAM data stored at the corresponding
embedding ROM. Our analysis reveals that the ROM may be memory location is overwritten with ROM data. This is similar
embedded in the STT-MRAM array without area overhead and to a reset operation except that the state which the bit-cell
without any penalty in the performance of the memory as RAM. resets to is determined by the physical connection of the
Furthermore, our simulations show that the embedded ROM bit-cell [1]. Thus, the RAM data stored at the corresponding
may be used to accelerate applications that use lookup tables
with as much as 30% improvement in instructions per cycle of a ROM location needs to be copied to a buffer first. The reset
processor using ROM-embedded STT-MRAM for its L2 cache. operation is then performed in one clock cycle, and the ROM
Index Terms— Accelerating function evaluation, cache data are read out in the following cycle. Finally, RAM data in
memories, emerging technologies, magnetic RAM, nonvolatile the buffer are copied back into the RAM memory location. The
RAM, read-only memory (ROM), ROM-embedded spin-transfer RAM capacity of R-SRAM is not impacted by the embedded
torque (STT)-MRAM (R-MRAM), simulation, STT-MRAM. ROM, since every memory cell stores both a RAM bit and
a ROM bit—therefore, the ROM capacity can be as large
as the RAM capacity. The high latency of the ROM read
1063-8210 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
The SSC, as shown in Fig. 1(a), consists of an ATx and a
magnetic tunnel junction (MTJ), which is the storage element.
The MTJ consists of a magnetically pinned ferromagnetic
layer (PL), a tunneling oxide barrier (MgO), and a free
ferromagnetic layer (FL). Data are stored as the FL magneti-
zation relative to the PL magnetization. When a charge current
flows between the FL and the PL, the flowing electrons transfer
their spin angular momentum to the FL and the PL. Since the
PL is magnetically pinned, its magnetization does not change.
However, the spin angular momentum transferred to the FL
exerts a torque on the FL magnetization. When the electrons
Fig. 1. (a)Structure of an SSC and the MTJ with perpendicular magnetic flow from the FL to the PL, the torque exerted anti-parallelizes
anisotropy. MTJ configurations and the current directions and the corre- the FL magnetization with that of the PL. On the other hand,
sponding direction of magnetization reversal are also shown. (b) The sensing when the electrons flow from the PL to the FL, the torque
circuitry for read operations in SSC.
exerted parallelizes the FL magnetization with that of the PL.
However, the amount of time needed for the torque to com-
functionality is available at lower overheads than
pletely switch the FL magnetization depends on the amount
of current flowing through the MTJ. The FL magnetization
However, design issues in standard STT-MRAM need to be
switches only when the amount of current flow is larger
overcome in order for STT-MRAM to become viable for high-
than the critical switching current (IC ) for switching
performance ultralow power on-chip cache applications [2].
time tSW .
The two-terminal nature of standard STT-MRAM means
The relative magnetization of the FL and the PL may be
that it suffers from source degeneration of the access
sensed as the resistance of the MTJ (RMTJ ) using the circuitry
transistor (ATx) during write operations and hence has
shown in Fig. 1(b). RMTJ is low or RL (high or RH ) when
high write energy. More importantly, sensing of data from
FL and PL magnetizations are parallel (anti-parallel). The
Standard STT-MRAM bit-Cells (SSCs) must be done using
distinguishability between MTJ resistance states is called the
single-ended sensing schemes. Such sensing schemes are
tunneling magnetoresistance ratio, TMR = (RH − RL )/RL .
slow in order to enhance tolerance against process variations.
RMTJ may be sensed by applying a voltage across the
A complementary polarizer STT (CPSTT) MRAM was
bit-cell (VREAD) and comparing the current flowing through
recently proposed for overcoming these design issues [6].
it (IREAD ) to a reference current (IREF ), also known as the
In this paper, we perform the following:
current sensing scheme. Alternatively, a fixed IREAD may be
1) propose a methodology for embedding ROM in CPSTT- passed through the bit-cell and the VREAD developed across it
based cache [ROM-embedded CPSTT (R-CPSTT)]; is compared with a reference voltage (VREF ), also known as
2) propose peripheral circuitry for R-MRAM and for the voltage sensing scheme.
R-CPSTT that minimizes area overhead; Since current flows through the SSC during read and write
3) evaluate the efficacy of R-MRAM and R-CPSTT at the operations, there are conflicting design requirements on the
system level. amount of current flow for read and for write. Furthermore,
We perform an in-depth evaluation of the proposed design the two-terminal nature of SSC makes it difficult optimize for
using a systematic device-circuit-architecture evaluation read and write operations. Consider the need for bidirectional
framework. Our results show that an iso-area replacement of current flow to write data into SSCs, as shown in Fig. 1(a).
SRAM cache with the proposed R-MRAM cache leads to The gate overdrive voltage of the ATx is VGS = VDD when
significant benefits in performance and energy. As we will the write current flows from BL to source-line (SL). However,
show later, our proposed technique for embedding ROM in a potential drop across the MTJ when write current flows
the RAM array may be used in other resistive nonvolatile from SL to BL reduces the gate overdrive voltage of the
memory technology in which the bit-cell topology is the ATx to VGS = VDD − VMTJ . Hence, the ATx needs to be
same as that in R-MRAM and R-CPSTT. upsized to ensure that the write current is larger than IC in
The rest of this paper is organized as follows. Section II both directions of write current flow. Doing so may lead to
discusses the preliminaries of STT-MRAM and CPSTT. Our excessive current flow when write current flows from BL to SL
proposal for R-CPSTT is then presented in Section III, and hence excessive write power dissipation. The reliability
and compared with R-MRAM. Area-optimized peripheral of the tunneling barrier is also reduced. Furthermore, a larger
circuits for R-MRAM and for R-CPSTT are also presented. ATx allows more current to flow through the MTJ during
In Section IV, we present simulation results to evaluate the read operations and may cause accidental switching of the
effectiveness of R-MRAM and R-CPSTT in accelerating the MTJ. More importantly, sensing of data from two-terminal
evaluation of complex math functions. The effectiveness of STT-MRAM bit-cells is done using single-ended sensing
R-MRAM, R-CPSTT, and R-SRAM in accelerating several schemes, which are not robust against process variations.
applications is also explored. Finally, the conclusion is drawn The CPSTT-MRAM bit-cell was proposed to mitigate
in Section V. the aforementioned design issues in SSC [6]. The CPSTT
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 11. Current flow in R-CPSTT during ROM mode operations when the
selected bit-cell is connected to BL0.
AT VDD = 1 V, 2 ns R EAD C YCLE
Fig. 13. Bit-cell area versus ATx width of SSC, CPSTT, R-MRAM, and TABLE III
R-CPSTT. Vertical lines: when the layout transitions to one using fin- I SO -VREAD C OMPARISON OF D ISTURB M ARGINS
gered ATxs. The bit-cell area does not change with ATx width if the layout
is limited by contact or metal pitch. AT VDD = 1 V, 2 ns R EAD C YCLE
Fig. 16. Comparisons of evaluation latencies of (a) log(x) and (b) sin(x)
using conventional SRAM cache (Conv.), R-MRAM, and R-CPSTT using
2-kB lookup tables. R-MRAM read latency is assumed to be twice that of
Fig. 17. Comparisons of evaluation latencies of (a) log(x) and (b) sin(x)
using conventional SRAM cache (Conv.), R-MRAM, and R-CPSTT using
128-kB lookup tables. R-MRAM read latency is assumed to be twice that of Fig. 18. Comparison of the total evaluation cycles for (top) log(x)
SRAM and R-CPSTT. and (bottom) sin(x) using different table sizes (and hence approximating
polynomial) to achieve 65-bit accuracy.
performance sensitivity to lookup table accesses in R-CPSTT [14] J. Harrison, T. Kubaska, S. Story, and P. T. Tang, “The computation
is reduced. Furthermore, the degree of the approximating of transcendental functions on the IA-64 architecture,” Intel Technol. J.,
vol. 4, pp. 234–251, Nov. 1999.
polynomial is small which reduces the processor workload [15] L. Chen, “Pattern classification by assembling small neural networks,”
and further improve performance. Note also that the execution in Proc. IEEE Int. Joint Conf. Neural Netw., vol. 3. Jul./Aug. 2005,
time of R-CPSTT design is lower than the Conv. design for pp. 1947–1952.
[16] H. Qiao, J. Peng, Z.-B. Xu, and B. Zhang, “A reference model approach
lookup table of sizes 2 kB as well as 128 kB, demonstrating to stability analysis of neural networks,” IEEE Trans. Syst., Man,
the efficacy of the proposed design. Cybern. B, Cybern., vol. 33, no. 6, pp. 925–936, Jan. 2003.
[17] S. Razavi and B. A. Tolson, “A new formulation for feedforward neural
networks,” IEEE Trans. Neural Netw., vol. 22, no. 10, pp. 1588–1598,
V. C ONCLUSION Oct. 2011.
[18] C. Mead and L. Conway, Introduction to VLSI Systems. Reading, MA,
We proposed R-MRAM and R-CPSTT caches and evaluated USA: Addison-Wesley, 1980.
their efficiency in accelerating different compute kernels. Note
that R-MRAM and R-CPSTT may be used to accelerate any
application that uses lookup tables (complex math functions, Xuanyao Fong (S’06) received the B.Sc. and
decoding tables, test vectors for built-in self-test, and so on). Ph.D. degrees in electrical engineering from Purdue
University, West Lafayette, IN, USA, in 2006 and
Furthermore, the memory array can store the same amount 2014, respectively.
of ROM and RAM data, and RAM data may be accessed He was an Intern Engineer with the Boston Design
independent of the ROM data stored. Also, the performance Center, Advanced Micro Devices, Inc., Boxborough,
MA, USA, in 2007. He was a Research Assistant
of the arrays in RAM mode is unaffected by the embedding to Prof. K. Roy with the Nanoelectronics Research
of ROM. Our simulation results show that R-MRAM and Laboratory, Purdue University, where he is
R-CPSTT may help to lower the latency of kernel computation currently a Post-Doctoral Research Assistant
to Prof. K. Roy. His current research interests
by more than 30%. The number of off-chip memory accesses is include device/circuit/architecture co-design for silicon and nonsilicon
also dramatically reduced and reduces the total system energy nanoelectronics, design of VLSI logic and memory systems using spintronic
consumption. devices, circuits, and architectures, and non-Boolean and analog computing
paradigms using emerging technologies.
Mr. Fong was a recipient of the AMD Design Excellence Award at Purdue
University in 2008, and the best paper award at the International Symposium
R EFERENCES on Low Power Electronics and Design in 2006.
[1] D. Lee and K. Roy, “Area efficient ROM-embedded SRAM cache,”
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 21, no. 9,
pp. 1583–1595, Sep. 2013. Rangharajan Venkatesan (S’09) received the
[2] Y. Huai, “Spin-transfer torque MRAM (STT-MRAM): Challenges and B.Tech. degree in electronics and communication
prospects,” AAPPS Bull., vol. 18, no. 6, pp. 33–40, 2008. engineering from IIT Roorkee, Roorkee, India,
[3] H. Yoda et al., “Progress of STT-MRAM technology and the effect on in 2009, and the Ph.D. degree in electrical and
normally-off computing systems,” in Proc. IEEE Int. Electron Devices computer engineering from Purdue University,
Meeting, Dec. 2012, pp. 11.3.1–11.3.4. West Lafayette, IN, USA, in 2014.
[4] D. Lee, X. Fong, and K. Roy, “R-MRAM: A ROM-embedded He was a Research Intern with Intel Corporation,
STT MRAM cache,” IEEE Electron Device Lett., vol. 34, no. 10, Hillsboro, OR, USA, in 2012 and 2013, during
pp. 1256–1258, Oct. 2013. his Ph.D., where he is currently involved in
[5] S. P. Park, S. Gupta, N. Mojumder, A. Raghunathan, and K. Roy, “Future developing low power design methodologies for
cache design using STT MRAMs for improved energy efficiency: graphics processors and designing circuits for
Devices, circuits and architecture,” in Proc. 49th ACM/EDAC/IEEE enabling fine-grained power gating. His current research interests include
Design Autom. Conf. (DAC), Jun. 2012, pp. 492–497. circuit-architecture co-design for emerging technologies, neuromorphic
[6] X. Fong and K. Roy, “Complimentary polarizers STT-MRAM (CPSTT) hardware architectures, approximate computing, and variation-aware design
for on-chip caches,” IEEE Electron Device Lett., vol. 34, no. 2, methodologies.
pp. 232–234, Feb. 2013. Mr. Venkatesan received the Ross Fellowship from 2009 to 2010, and
[7] X. Fong, S. K. Gupta, N. N. Mojumder, S. H. Choday, C. Augustine, the Bilsland Dissertation Fellowship from the Graduate School, Purdue
and K. Roy, “KNACK: A hybrid spin-charge mixed-mode simulator for University, from 2013 to 2014. He was a recipient of the best paper award at
evaluating different genres of spin-transfer torque MRAM bit-cells,” in the International Symposium on Low Power Electronics and Design in 2012,
Proc. Int. Conf. Simulation Semiconductor Process. Devices, Sep. 2011, and the Best Paper Nomination in Design Automation Test in Europe in 2015.
pp. 51–54.
[8] C. J. Lin et al., “45 nm low power CMOS logic compatible embedded
STT MRAM utilizing a reverse-connection 1 T/1 MTJ cell,” in Proc.
IEEE Int. Electron Devices Meeting (IEDM), Dec. 2009, pp. 1–4. Dongsoo Lee (M’13) received the B.S. and M.S.
[9] T. Kishi et al., “Lower-current and fast switching of a perpendicular degrees in electrical engineering from the Korea
TMR for high speed and high density spin-transfer-torque MRAM,” in Advanced Institute of Science and Technology, Dae-
Proc. IEEE Int. Electron Devices Meeting, Dec. 2008, pp. 1–4. jeon, Korea, and the Ph.D. degree in electrical
[10] Object-Oriented MicroMagnetic Framework (OOMMF). [Online]. Avail- and computer engineering from Purdue University,
able: http://math.nist.gov/oommf, accessed Aug. 25, 2014. West Lafayette, IN, USA, in 2002, 2004, and 2013,
[11] HSPICE. [Online]. Available: http://www.synopsys.com/Tools/ respectively.
Verification/AMSVerification/CircuitSimulation/HSPICE/, accessed He was with Samsung Electronics Ltd., Suwon,
Aug. 25, 2014. Korea, from 2004 to 2008, where he was involved
[12] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimiz- in research on designing circuits for DTV one-
ing NUCA organizations and wiring alternatives for large caches chip solutions. In 2011, he was a Graduate Intern
with CACTI 6.0,” in Proc. 40th Annu. IEEE/ACM Int. Symp. with Qualcomm Incorporated, San Diego, CA, USA, and Intel Corporation,
Microarchitecture (MICRO), Dec. 2007, pp. 3–14. Hillsboro, OR, USA, in 2012. He has been with the IBM T. J. Watson
[13] T. Austin, E. Larson, and D. Ernst, “SimpleScalar: An infrastructure Research Center, Yorktown Heights, NY, USA, as a Research Staff Member
for computer system modeling,” Computer, vol. 35, no. 2, pp. 59–67, since 2013. His current research interests include low-power design, on-chip
Feb. 2002. memory design, and design for test.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Anand Raghunathan (F’12) received the Kaushik Roy (F’01) received the B.Tech. degree
B.Tech. degree in electrical and electronics in electronics and electrical communications
engineering from IIT Madras, Chennai, India, engineering from IIT Kharagpur, Kharagpur, India,
and the M.A. and Ph.D. degrees in electrical and the Ph.D. degree from the Department of
engineering from Princeton University, Princeton, Electrical and Computer Engineering, University
NJ, USA. of Illinois at Urbana–Champaign, Champaign, IL,
He was a Senior Research Staff Member USA, in 1990.
with NEC Laboratories America Inc., Princeton. He was a Purdue University Faculty Scholar
He held the Gopalakrishnan Visiting Chair with the from 1998 to 2003. He was the M. K. Gandhi
Department of Computer Science and Engineering, Distinguished Visiting Faculty Member with
IIT Madras. He is currently a Professor and the IIT Bombay, Mumbai, India. He was with the
Chair of VLSI with the School of Electrical and Computer Engineering, Semiconductor Process and Design Center, Texas Instruments, Dallas, TX,
Purdue University, West Lafayette, IN, USA, where he leads the Integrated USA, where he was involved in field-programmable gate array architecture
Systems Laboratory. He has co-authored a book entitled High-Level Power development and low-power circuit design. He joined the Electrical and
Analysis and Optimization, eight book chapters, and over 200 refereed Computer Engineering Faculty, Purdue University, West Lafayette, IN, USA,
journal and conference papers. He holds 21 U.S. patents. His current in 1993, where he is currently an Edward G. Tiedemann Jr. Distinguished
research interests include domain-specific architecture, system on chip Professor. He has authored over 600 papers in refereed journals and
design, embedded systems, and heterogeneous parallel computing. conferences, and graduated 65 Ph.D. students. He holds 15 patents. He has
Prof. Raghunathan is a Golden Core Member of the IEEE Computer co-authored two books entitled Low Power CMOS VLSI Design (John Wiley
Society. He was a recipient of the IEEE Meritorious Service Award in 2001, and McGraw Hill). His current research interests include spintronics,
and the Outstanding Service Award in 2004. He received the Patent device-circuit co-design for nanoscale silicon and nonsilicon technologies,
of the Year Award (recognizing the invention with the highest impact), low-power electronics for portable computing and wireless communications,
and two Technology Commercialization Awards from NEC Laboratories and new computing models enabled by emerging technologies.
America Inc. He was chosen by MIT’s Technology Review among the Dr. Roy was a recipient of the National Science Foundation Career
TR35 (top 35 innovators under 35 years, across various disciplines of Development Award in 1995, the IBM Faculty Partnership Award, the
science and technology) for his work on making mobile secure in 2006. His ATT/Lucent Foundation Award, the SRC Technical Excellence Award
publications have been recognized with eight best paper awards and four best in 2005, the SRC Inventors Award, the Purdue College of Engineering
paper nominations. He has served on the technical program and organizing Research Excellence Award, the Humboldt Research Award in 2010, the
committees of several leading conferences and workshops. He has chaired IEEE Circuits and Systems Society Technical Achievement Award in 2010,
the ACM/IEEE International Symposium on Low Power Electronics and the Distinguished Alumnus Award from IIT Kharagpur, best paper awards at
Design, the ACM/IEEE International Conference on Compilers, Architecture, the International Test Conference in 1997, the IEEE International Symposium
and Synthesis for Embedded Systems, the IEEE VLSI Test Symposium, on Quality of IC Design in 1997, the IEEE Latin American Test Workshop
and the IEEE International Conference on VLSI Design. He has served as in 2003, the IEEE Nano in 2003, the IEEE International Conference on
an Associate Editor of the IEEE T RANSACTIONS ON C OMPUTER -A IDED Computer Design in 2004, and the IEEE/ACM International Symposium
D ESIGN, the IEEE T RANSACTIONS ON VLSI S YSTEMS , ACM Transactions on Low Power Electronics and Design in 2006, the IEEE Circuits and
on Design Automation of Electronic Systems, the IEEE T RANSACTIONS ON System Society Outstanding Young Author Award (Chris Kim) in 2005,
M OBILE C OMPUTING, ACM Transactions on Embedded Computing Systems, the IEEE T RANSACTIONS ON VLSI S YSTEMS Best Paper Award in 2006,
the IEEE Design and Test of Computers, and the Journal of Low Power the ACM/IEEE International Symposium on Low Power Electronics and
Electronics. Design Best Paper Award in 2012, and the IEEE T RANSACTIONS ON VLSI
S YSTEMS Best Paper Award in 2013. He is serving as a Fulbright-Nehru
Distinguished Chair, and a DoD National Security Science and Engineering
Faculty Fellow from 2014 to 2019. He was a Research Visionary Board
Member of Motorola Laboratories in 2002. He has been on the Editorial
Board of the IEEE Design and Test, the IEEE T RANSACTIONS ON C IRCUITS
of the Special Issue on Low-Power VLSI of the IEEE Design and Test
Proceedings–Computers and Digital Techniques in 2002, and the IEEE
S YSTEMS in 2011.