Embedding Read-Only Memory in Spin-Transfer Torque MRAM-Based On-Chip Caches

This article has been accepted for inclusion in a future issue of this journal.
Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1
Embedding Read-Only Memory in Spin-Transfer

Torque MRAM-Based On-Chip Caches
Xuanyao Fong, Student Member, IEEE, Rangharajan Venkatesan, Student Member, IEEE,
Dongsoo Lee, Member, IEEE, Anand Raghunathan, Fellow, IEEE, and Kaushik Roy, Fellow, IEEE
Abstract— We propose a design technique for embedding as a result and the processor has to fetch the evicted data
read-only memory (ROM) in spin-transfer torque from off-chip memory again in order to continue program
MRAM (STT-MRAM) arrays by adding an extra bit-line execution. Hence, realizing an on-chip ROM with minimal
in every column of the array. RAM and ROM data, which can
be different, are stored in the same bitcell and the ROM capacity overhead allows static data to be stored closer to the processor,
may be as large as the RAM capacity. Furthermore, our proposed and may be used to accelerate the execution of applications.
ROM-embedding technique is applicable to any resistive memory A method for embedding ROMs in SRAM-based on-chip
technology in which the bit-cell topology is identical to that of the cache (called R-SRAM) was presented in [1]. R-SRAM may
STT-MRAM bit-cell. An additional sense amplifier is required be viewed as a special type of resettable RAM. When ROM
in the peripheral circuitry, hence we propose an area-optimized
peripheral circuitry to minimize the total area penalty of data are needed, the RAM data stored at the corresponding
embedding ROM. Our analysis reveals that the ROM may be memory location is overwritten with ROM data. This is similar
embedded in the STT-MRAM array without area overhead and to a reset operation except that the state which the bit-cell
without any penalty in the performance of the memory as RAM. resets to is determined by the physical connection of the
Furthermore, our simulations show that the embedded ROM bit-cell [1]. Thus, the RAM data stored at the corresponding
may be used to accelerate applications that use lookup tables
with as much as 30% improvement in instructions per cycle of a ROM location needs to be copied to a buffer first. The reset
processor using ROM-embedded STT-MRAM for its L2 cache. operation is then performed in one clock cycle, and the ROM
Index Terms— Accelerating function evaluation, cache data are read out in the following cycle. Finally, RAM data in
memories, emerging technologies, magnetic RAM, nonvolatile the buffer are copied back into the RAM memory location. The
RAM, read-only memory (ROM), ROM-embedded spin-transfer RAM capacity of R-SRAM is not impacted by the embedded
torque (STT)-MRAM (R-MRAM), simulation, STT-MRAM. ROM, since every memory cell stores both a RAM bit and
a ROM bit—therefore, the ROM capacity can be as large
I. I NTRODUCTION
as the RAM capacity. The high latency of the ROM read
M ANY applications, such as digital signal processing,

math libraries, and on-chip built-in self-test, use data
that are determined at design time and stay constant during
operation, which requires multiple steps as outlined above,
limits the performance improvement of applications that use
the embedded ROM.
runtime (which we refer to as static data). Static data may be Spin-transfer torque MRAM (STT-MRAM) has emerged
stored as lookup tables in on-chip read-only memory (ROM). recently as the leading technology candidate for nonvolatile
However, storing large amounts of static data in on-chip ROM on-chip cache memory [2], [3]. Furthermore, a methodology
incurs significant area and power overheads. An alternative for embedding ROM in STT-MRAM was proposed in [4]. The
method is used to store static data off-chip. In this case, the proposed ROM-embedded STT-MRAM (R-MRAM) behaves
processor has to fetch the required static data into on-chip as a dual mode (RAM mode and ROM mode) memory system
cache during program execution, which leads to performance in contrast to R-SRAM–RAM data are stored in the storage
degradation. The problem is further exacerbated when data element in the bit-cell, whereas ROM data are stored as the
for each program running concurrently are mapped to the selective connection of the bit-cell to one of two bit-lines (BLs)
same cache location (also called cache thrashing). The static in the column. As we will show later, the RAM data are
data used by a program may need to be evicted from cache not overwritten in R-MRAM when ROM data are accessed.
Manuscript received August 26, 2014; revised January 17, 2015; In contrast to R-SRAM, backup and restore operations and
March 22, 2015; and May 2, 2015; accepted May 14, 2015. This work buffer storage to temporarily store RAM data are not needed
was supported in part by STARnet, a Semiconductor Research Corporation in R-MRAM. Furthermore, R-MRAM is nonvolatile and may
Program through MARCO and Defense Advanced Research Projects Agency,
in part by the Semiconductor Research Corporation, and in part by Intel be completely turned OFF during idle to save on leakage
Corporation. power, whereas R-SRAM will suffer from data loss if it is
X. Fong, R. Venkatesan, A. Raghunathan, and K. Roy are with the School turned OFF during idle. Moreover, R-MRAM-based cache may
of Electrical and Computer Engineering, Purdue University, West Lafayette,
IN 47907 USA (e-mail: [email protected]). offer as much as 3× higher capacity as R-SRAM-based cache
D. Lee is with the IBM T. J. Watson Research Center, Yorktown Heights, at iso-array area [5]. The larger ROM capacity in R-MRAM
NY 10598 USA. allows more static data to be stored closer to the processor for
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. accelerating applications that use them. Hence, R-MRAM is
Digital Object Identifier 10.1109/TVLSI.2015.2439733 a promising alternative to R-SRAM whereby embedded ROM
1063-8210 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
II. P RELIMINARIES
The SSC, as shown in Fig. 1(a), consists of an ATx and a
magnetic tunnel junction (MTJ), which is the storage element.
The MTJ consists of a magnetically pinned ferromagnetic
layer (PL), a tunneling oxide barrier (MgO), and a free
ferromagnetic layer (FL). Data are stored as the FL magneti-
zation relative to the PL magnetization. When a charge current
flows between the FL and the PL, the flowing electrons transfer
their spin angular momentum to the FL and the PL. Since the
PL is magnetically pinned, its magnetization does not change.
However, the spin angular momentum transferred to the FL
exerts a torque on the FL magnetization. When the electrons
Fig. 1. (a)Structure of an SSC and the MTJ with perpendicular magnetic flow from the FL to the PL, the torque exerted anti-parallelizes
anisotropy. MTJ configurations and the current directions and the corre- the FL magnetization with that of the PL. On the other hand,
sponding direction of magnetization reversal are also shown. (b) The sensing when the electrons flow from the PL to the FL, the torque
circuitry for read operations in SSC.
exerted parallelizes the FL magnetization with that of the PL.
However, the amount of time needed for the torque to com-
functionality is available at lower overheads than
pletely switch the FL magnetization depends on the amount
R-SRAM.
of current flowing through the MTJ. The FL magnetization
However, design issues in standard STT-MRAM need to be
switches only when the amount of current flow is larger
overcome in order for STT-MRAM to become viable for high-
than the critical switching current (IC ) for switching
performance ultralow power on-chip cache applications [2].
time tSW .
The two-terminal nature of standard STT-MRAM means
The relative magnetization of the FL and the PL may be
that it suffers from source degeneration of the access
sensed as the resistance of the MTJ (RMTJ ) using the circuitry
transistor (ATx) during write operations and hence has
shown in Fig. 1(b). RMTJ is low or RL (high or RH ) when
high write energy. More importantly, sensing of data from
FL and PL magnetizations are parallel (anti-parallel). The
Standard STT-MRAM bit-Cells (SSCs) must be done using
distinguishability between MTJ resistance states is called the
single-ended sensing schemes. Such sensing schemes are
tunneling magnetoresistance ratio, TMR = (RH − RL )/RL .
slow in order to enhance tolerance against process variations.
RMTJ may be sensed by applying a voltage across the
A complementary polarizer STT (CPSTT) MRAM was
bit-cell (VREAD) and comparing the current flowing through
recently proposed for overcoming these design issues [6].
it (IREAD ) to a reference current (IREF ), also known as the
In this paper, we perform the following:
current sensing scheme. Alternatively, a fixed IREAD may be
1) propose a methodology for embedding ROM in CPSTT- passed through the bit-cell and the VREAD developed across it
based cache [ROM-embedded CPSTT (R-CPSTT)]; is compared with a reference voltage (VREF ), also known as
2) propose peripheral circuitry for R-MRAM and for the voltage sensing scheme.
R-CPSTT that minimizes area overhead; Since current flows through the SSC during read and write
3) evaluate the efficacy of R-MRAM and R-CPSTT at the operations, there are conflicting design requirements on the
system level. amount of current flow for read and for write. Furthermore,
We perform an in-depth evaluation of the proposed design the two-terminal nature of SSC makes it difficult optimize for
using a systematic device-circuit-architecture evaluation read and write operations. Consider the need for bidirectional
framework. Our results show that an iso-area replacement of current flow to write data into SSCs, as shown in Fig. 1(a).
SRAM cache with the proposed R-MRAM cache leads to The gate overdrive voltage of the ATx is VGS = VDD when
significant benefits in performance and energy. As we will the write current flows from BL to source-line (SL). However,
show later, our proposed technique for embedding ROM in a potential drop across the MTJ when write current flows
the RAM array may be used in other resistive nonvolatile from SL to BL reduces the gate overdrive voltage of the
memory technology in which the bit-cell topology is the ATx to VGS = VDD − VMTJ . Hence, the ATx needs to be
same as that in R-MRAM and R-CPSTT. upsized to ensure that the write current is larger than IC in
The rest of this paper is organized as follows. Section II both directions of write current flow. Doing so may lead to
discusses the preliminaries of STT-MRAM and CPSTT. Our excessive current flow when write current flows from BL to SL
proposal for R-CPSTT is then presented in Section III, and hence excessive write power dissipation. The reliability
and compared with R-MRAM. Area-optimized peripheral of the tunneling barrier is also reduced. Furthermore, a larger
circuits for R-MRAM and for R-CPSTT are also presented. ATx allows more current to flow through the MTJ during
In Section IV, we present simulation results to evaluate the read operations and may cause accidental switching of the
effectiveness of R-MRAM and R-CPSTT in accelerating the MTJ. More importantly, sensing of data from two-terminal
evaluation of complex math functions. The effectiveness of STT-MRAM bit-cells is done using single-ended sensing
R-MRAM, R-CPSTT, and R-SRAM in accelerating several schemes, which are not robust against process variations.
applications is also explored. Finally, the conclusion is drawn The CPSTT-MRAM bit-cell was proposed to mitigate
in Section V. the aforementioned design issues in SSC [6]. The CPSTT
FONG et al.: EMBEDDING ROM IN STT-MRAM-BASED ON-CHIP CACHES 3
Fig. 3. Selective connection of (a) SSC and (b) CPSTT bit-cells to

BL0 or BL1 allows ROM data to be programmed. Two BLs (BL0 and BL1)
are needed but there is no area overhead when the ATx width is sufficiently
large.
Fig. 2. (a) Structure of a CPSTT-MRAM bit-cell. A latch is used for the

read operation, as shown in (b). Bias voltages of the CPSTT bit-cell during
(b) read and (c) write operation are shown.
structure consists of an FL, a tunneling oxide layer, and

two complementary PLs, as shown in Fig. 2(a). At anytime, the
FL magnetization is parallel to one PL and anti-parallel to the
other PL. Hence, data in CPSTT may be sensed using a latch
to compare the resistances between the FL and each PL, as Fig. 4. Structure of the R-MRAM proposed in [4]. Every bit-cell may
shown in Fig. 2(b). The self-referencing and differential nature be programmed with RAM data. In addition, the physical connection of the
bit-cell to BL0 or BL1 stores the ROM data. Bit-cells connected to BL0 store
of data sensing in CPSTT is robust against process variations ROM data ‘0’, whereas those connected to BL1 store ROM data ‘1’.
and may be very fast. Write operations in CPSTT, shown
in Fig. 2(c), occur by steering write current (IWRITE ), through
connection of the bit-cell to one of the two available
one of two complementary polarized PLs. For example, BL,
BLs (BL0 and BL1 in Fig. 3) during design time. During
wordline (WL), and SLR are charged to VDD , and SLL is
ROM mode operation, data are sensed by determining whether
discharged to GND to write a ‘0’. The write current
the bit-cell is connected to BL0 or to BL1. On the other hand,
parallelizes the FL magnetization with that of the left PL.
BL0 and BL1 are electrically connected during RAM mode
To write ‘1’ instead, SLL is charged to VDD , and SLR is
of operation. Note that ROM access and RAM access cannot
discharged to GND, so the write current parallelizes the FL
occur simultaneously.
magnetization with the right PL. Since write current always
flow from BL to SLL or SLR, source degeneration of the
ATx’s is avoided. Hence, the size of each ATx’s may be A. ROM-Embedded STT-MRAM
much smaller than in SSCs. One design of R-MRAM was proposed in [4]. Fig. 4 shows
In this paper, we propose SSC-based and CPSTT-based an example of a column of the R-MRAM array proposed in [4]
caches that can operate in RAM mode or in ROM mode, storing some ROM data. BL0, BL1, and SL are shared along
which are called R-MRAM and R-CPSTT, respectively. the column of the array, whereas WL is shared along a row.
Every bit-cell in R-MRAM and R-CPSTT is a single- The R-MRAM array requires two sense amplifiers, because
level cell that stores both RAM and ROM data, which BL0 and BL1 are not physically connected. Bit-cells that are
do not have to be the same. Data may be written to or connected to BL0 are programmed to store ROM data value
read from any memory address during the RAM mode of of ‘0’, whereas those connected to BL1 are programmed to
operation. In the ROM mode of operation, only data that are store ROM data value of ‘1’. The WL is turned ON to select
programmed into the bit-cell layout during design time may be a row of cells and current may flow through only one bit-cell
read from any memory address. Since RAM data are stored in the column. During RAM write operations, the write driver
in the storage element in each bit-cell, and ROM data are ensures that both BL0 and BL1 are at the same voltage. The
stored as the connection of the bit-cell, the proposed bit-cell relative voltages of SL and the BLs depend on DataIn. During
designs do not compromise the density benefits of spin-based RAM read operations, SL is discharged to GND and the read
memories. Section III describes R-MRAM and R-CPSTT in bias generators act as a current source that drives current into
detail. BL0 and BL1. The sense amplifiers compare the voltage on
the BL0 and BL1 to a common reference voltage, which is
III. E MBEDDING R EAD -O NLY M EMORY IN lower than VDD . Note that the reference voltage depends on
STT-MRAM-BASED O N -C HIP C ACHES whether the ROM or RAM data are required. If the voltage on
The key insight used to enable R-MRAM and R-CPSTT the BL is higher than the reference voltage, the sense amplifier
is the fact that an additional BL may be added to the cache outputs a ‘1’ and ‘0’ otherwise.
arrays without bit-cell area penalty if the ATx is sufficiently Fig. 5 shows an example column of R-MRAM where a
large. ROM data may then be programmed as the selective bit-cell storing ROM data bit ‘1’ is selected for reading. The
Fig. 5. Current flow in a selected bit-cell connected to BL1.
unselected cells in the column are marked with an ‘X’, and

the selected bit-cell is connected to BL1. The output of the
sense amplifier connected to BL1 depends on the resistance
of the selected bit-cell. Since BL0 is a high impedance node,
the current from the read bias generator charges BL0 to a Fig. 6. Our proposed ROM-embedded MRAM uses pass gates to electrically
connect BL0 and BL1 during RAM mode operation so only one sense ampli-
voltage close to VDD . Hence, the sense amplifier connected to fier is needed for RAM mode read operations. ROM mode read operations
BL0 will output a ‘1’. For a ROM read operation, the output use a latch to determine which BL is the high impedance node.
of the sense amplifier connected to BL0 gives the result and is
sent to the array output (ROMOut in Fig. 5). For a RAM read
operation, the result of the read operation must be determined
by the resistance of the selected bit-cell. The sense amplifier
connected to the BL1 in Fig. 5 gives the correct result for
the RAM read operation. However, if the selected bit-cell was
connected to BL0 instead of BL1, the correct result of the
RAM read operation is given by the sense amplifier connected
to BL0. Note that if a BL is a high impedance node, the
sense amplifier connected to it will output ‘1’ during read
operations. During RAM read operation, one of the two sense Fig. 7. Current flow in a selected bit-cell connected to BL0 during RAM
mode operation.
amplifiers will output ‘1’ because the BL connected to it is the
high impedance node. The output of the other sense amplifier
depends on the resistance of the selected bit-cell. Hence, the
result of the RAM read operation is obtained by AND-ing the
outputs of both sense amplifiers (RAMOut in Fig. 5).
In the aforementioned design, both sense amplifiers need
to be designed to reduce sensing failures during RAM mode
of operation, because the result of the RAM read operation
can come from either of them. Thus, the area overhead from
the sense amplifiers may be significant. Furthermore, the
ROM mode read operation may be limited by the sensing Fig. 8. Current flow in a selected bit-cell connected to BL0 during ROM
speed of the sense amplifiers, which must meet RAM mode mode of operation.
read operation requirements. To overcome these issues, we
propose modifications to the peripheral circuitry, as shown
in Fig. 6. Two sense amplifiers are still needed but one is used to a reference voltage and outputs a ‘0’ if the reference
exclusively for RAM mode read operations and the other is voltage is higher. Otherwise, the sense amplifier outputs a ‘1’.
used exclusively for ROM mode read operations. Consider the During ROM mode read operations, EnRAM is deasserted
operation of the array when the selected bit-cell in the column to turn OFF the pass transistors. The latch is turned ON
is connected to BL0, as shown in Fig. 7. During RAM mode to determine which BL is the high impedance node. When the
operations, EnRAM is asserted to turn ON the pass transistors latch is turned ON, there is a current path from BL0 to VDD
so that BL0 and BL1 are electrically connected. The write through M1–M4, and a current path from BL1 to VDD through
driver can directly drive both BLs and SL during RAM mode M1 and M6–M8 (Fig. 6). Due to the cross-coupled inverter
write operations. During RAM mode read operations, current action in the latch, the BL that is the high impedance node will
from the read bias generator flows through the pass transistors get charged to VDD , while the other BL is discharged to GND.
and the selected bit-cell to SL. As a result, a voltage appears During ROM read operation of the scenario, shown in Fig. 8,
on the positive input of the sense amplifier. The value of BL0 is discharged to GND and ROMOut outputs a ‘0’. If the
this voltage depends on the resistance of the selected bit-cell. selected bit-cell is connected to BL1 instead, BL0 is charged to
The sense amplifier compares the voltage at its positive input VDD and ROMOut outputs a ‘1’. Since only one of BL0 or BL1
Fig. 11. Current flow in R-CPSTT during ROM mode operations when the
selected bit-cell is connected to BL0.
with an ‘X’. ROMColSel is deasserted, so the pass transistors

connecting BL0 and BL1 to the latch-based sense amplifier
are turned OFF. During write operations, the latch is turned
OFF and the BL driver drives both BL0 and BL1 to VDD .
If WrData is ‘0’, the write driver drives SLL to GND and SLR
to VDD . If instead WrData is ‘1’, the write driver drives SLR
to GND and SLL to VDD . For RAM mode read operations, the
write driver is turned OFF and both BL0 and BL1 are driven
to GND by the BL driver instead. RAMColSel is asserted
to turn ON the pass transistors, and the sense amplifier is
Fig. 9. (a) R-CPSTT (right) is implemented from CPSTT (left) by sharing
turned ON to compare the resistances through the selected
BL’s columnwise instead. (b) Additional pair of pass transistors allow the bit-cell from SLL to BL0 and from SLR to BL1. Due to the
sense amplifier [shown in Fig. 6 (bottom)] to be able to sense RAM data cross-coupled inverter effect in the latch-based sense amplifier,
from the SLs and sense ROM data from the BLs. Bit-cells storing ROM 0 (1)
are connected to BL0 (BL1).
SLR will be charged to VDD , and SLL is discharged to GND
if the resistance from SLL to BL0 is lower than that from
SLR to BL0. Hence, the RAM mode behavior of R-CPSTT is
just like in the conventional CPSTT.
Consider instead the ROM mode behavior of R-CPSTT, as
shown in Fig. 11. In ROM mode, RAMColSel is deasserted
and the SL driver drives both SLL and SLR to GND. Also,
ROMColSel is asserted to connect the sense amplifier inputs to
BL1 and BL0. During ROM read operations, the negative input
of the sense amplifier is a high impedance node, while there is
Fig. 10. Current flow in R-CPSTT during RAM mode operation when the a current path from the positive input to the GND through BL0
selected bit-cell is connected to BL0. and the selected bit-cell. Hence, BL1 gets charged to VDD , and
BL0 is discharged to GND by the cross-coupled inverter action
of the sense amplifier. BL0 is charged to VDD , and BL1 is
has a direct path to GND through the selected SSC, a minimum discharged to GND if the selected bit-cell is connected to BL1
sized latch may be used as the sense amplifier for ROM mode instead of BL0. If BL0 is discharged to GND, data of the sense
read operations. In comparison with the design in [4], which amplifier will ‘0’. If BL0 is VDD instead, data of the sense
requires two large op-amps per column to perform RAM and amplifier will be ‘1’. Hence, ROM data stored as the selective
ROM sensing, our proposed design uses only one identical connection of the CPSTT bit-cells to BL0 and BL1 may be
op-amp and one minimum size latch per column for sensing read out. Note that the sense amplifier is shared between
RAM and ROM data, respectively, resulting in significantly RAM mode read operations and ROM mode read operations.
lower area overhead. Hence, the area overhead in peripheral circuitry in R-CPSTT
is just an additional set of pass transistors and additional logic
B. ROM-Embedded CPSTT in the BL and SL drivers.
CPSTT-based cache was proposed in [6]. However, the BL
IV. R ESULTS AND D ISCUSSION
is shared across the row and hence embedding of ROM is
not possible. Fig. 9 shows the modifications needed to enable A. Simulation Framework
R-CPSTT. The BLs are shared across the column instead We implemented the simulation framework in [7] so as to
of across the row, as shown in Fig. 9. An additional pair evaluate the performance of R-MRAM and R-CPSTT. Our
of pass transistors allows the sense amplifier to sense data simulation models were calibrated to experimentally measured
from the SLs during read operation in RAM mode, and from data [8], [9] first before evaluating R-MRAM and R-CPSTT.
the BLs during read operation in ROM mode. Consider the Micromagnetic simulations were performed using the Object-
RAM mode operation on a selected bit-cell that is connected Oriented MicroMagnetic Framework (OOMMF) [10] to
to BL0, as shown in Fig. 10. The unselected cells are marked estimate the critical switching current (IC ) of R-MRAM
TABLE I
B IT-C ELL S IMULATION PARAMETERS
Fig. 12. Array layout of (a) R-MRAM and (b) R-CPSTT.
To achieve certain accuracy in the result of the function

evaluation, the degree of the approximating polynomial used
needs to be high if the size of the lookup table is small.
Alternatively, the degree of the approximating polynomial
may be reduced by increasing the size of the lookup table.
and R-CPSTT in RAM mode. The R-MRAM and R-CPSTT As shown in [1], the evaluation latency can be large if the
bit-cells are then simulated in HSPICE circuit simulator degree of the polynomial used for step 2 is high (since it takes
[11] to evaluate the bit-cell level performance. The transient longer to evaluate the polynomial) or the lookup table used
behavior of R-MRAM and R-CPSTT in RAM and ROM in step 3 is large (since the chances of cache miss is high).
modes is also verified in HSPICE. The parameters used in our In this paper, we utilize a lookup table consisting of 16 k
simulations are tabulated in Table I. Results from HSPICE entries and an approximating polynomial of degree 4 to
simulations are then used for simulations in a modified compute sin and log (unless stated otherwise).
version of the CACTI 6.5 cache modeling tool [12] and the 2) Neural Network: Neural networks are being increasingly
SimpleScalar architectural simulator [13]. used in a wide range of recognition and mining applications,
For RAM mode operation, we performed architectural sim- such as face detection in Google+ and voice recognition in
ulations using a wide range of workloads from the SPEC2K6 Apple Siri. Neural networks typically consist of several layers
benchmark suite. All our simulations in RAM mode were of neurons, each of which employs a nonlinear activation
performed for 1 billion instructions after warming up the function that determines its firing rate [15]–[17]. The most
cache by fast forwarding for 1 billion instructions. To evaluate commonly used activation function is the sigmoidal function
the benefits of the proposed R-MRAM and R-CPSTT in given by Vo = 1/(1 + exp(Vi )), where Vi is the input potential
ROM mode, we have developed four custom microbenchmarks and Vo is the output potential of the neuron. In this paper,
that represent the compute kernels in various scientific and we developed a microbenchmark that computes the activation
engineering applications. The microbenchmarks consist of function using a lookup table of the exponential function. The
multiple function calls to different compute kernels that are benchmark uses a lookup table of size 128 kB and computes
typically implemented using lookup tables, due to their high 200 iterations of the activation function for different inputs.
computational complexity. In this paper, we exploit the ROM 3) Gamma Correction: Gamma correction is a commonly
space in R-MRAM and R-CPSTT to store these lookup tables used image processing compute kernel that is used to improve
and accelerate the computation of the kernels that use these the quality of an image in the region sensitive to the
tables. Inputs and outputs of the functions are IEEE double human eye. Mathematically, gamma correction is given by
precision floating point numbers with at least 65-bit accuracy. Vout = Vin Y , where Vin and Vout are the input and output
The average latency for each function evaluation is determined values, respectively. A gamma value of 0.45 is typically used
and used as the metric for the effectiveness of R-MRAM for gamma encoding. In this paper, we implemented gamma
and R-CPSTT. A brief description of these benchmarks is correction using a lookup table of size 16 k values.
presented as follows.
1) Math Functions: We designed two microbenchmarks that B. Layout Comparisons
compute sine (sin) and logarithm (log)—two of the widely To compare R-MRAM and R-CPSTT at iso-bit-cell area,
used math functions. Three steps are generally needed in their layouts are used to determine the size of the ATx
the evaluation of complex math functions using Intel’s math in R-MRAM and the sizes of ATx’s in R-CPSTT. Several
library [1], [14]: 1) range reduction; 2) approximation; and layouts for R-MRAM and R-CPSTT, drawn using λ-based
3) reconstruction. A power series is evaluated in step 2 to layout rules [18], were explored, and Fig. 12 shows the
approximate the result of the function evaluation. A lookup R-MRAM and R-CPSTT layouts used for our comparisons in
table is used in step 3 and combined with the result from the rest of this paper. The bit-cell area versus ATx width for
step 2 to obtain the accurate result of the function evaluation. R-MRAM and R-CPSTT are plotted in Fig. 13. Comparisons
The evaluation of the approximating polynomial and looking with SSC and CPSTT show that ROM may be embedded
up data in the table may be executed in parallel. The degree without bit-cell area penalty if the ATx is large (Fig. 13).
of accuracy of the function evaluation depends on the size of Note that the minimum ATx size that can be used may be
the lookup table and degree of the approximating polynomial. limited by the write current requirement. Techniques, such as
TABLE II
I SO -VREAD C OMPARISON OF S ENSING M ARGINS
AT VDD = 1 V, 2 ns R EAD C YCLE
Fig. 13. Bit-cell area versus ATx width of SSC, CPSTT, R-MRAM, and TABLE III
R-CPSTT. Vertical lines: when the layout transitions to one using fin- I SO -VREAD C OMPARISON OF D ISTURB M ARGINS
gered ATxs. The bit-cell area does not change with ATx width if the layout
is limited by contact or metal pitch. AT VDD = 1 V, 2 ns R EAD C YCLE
WL voltage boosting [6], may be used to reduce the ATx size

and can result in bit-cell area of R-MRAM and R-CPSTT
to become metal pitch limited. The bit-cell area overhead TABLE IV
of embedding ROM when this is the case can be obtained I SO -W RITE M ARGIN VDD AND AVERAGE W RITE P OWER /B IT
by comparing the bit-cell areas graphed in Fig. 13. For the
following comparisons between R-MRAM and R-CPSTT, the
bit-cell area is fixed at 0.1664 μm2 . The corresponding ATx
widths in R-MRAM and in R-CPSTT are shown in Fig. 13.
C. RAM Mode Performance Evaluation

The RAM mode read performance of R-MRAM and flow through each current path in R-CPSTT. Note that data
R-CPSTT depends on the sensing scheme used. Since a stored in the bit-cell may be accidentally overwritten, because
self-referenced differential sensing scheme can be used IREAD is flowing through the bit-cell during read operation,
for R-CPSTT but not for R-MRAM, the comparison of resulting in read disturb failure. Read disturb failures are
RAM mode read performance is done using a dc current minimized by ensuring that there is sufficient disturb margin
sensing scheme for both R-MRAM and R-CPSTT. For (defined as IC − IREAD /IC ). Note that the direction of IREAD
the RAM mode read operation of R-MRAM, a fixed read is fixed and hence only one type of disturb failure can
voltage (VREAD) is applied across the bit-cell and the read occur—a stored ‘0’ being overwritten or a stored ‘1’
current flowing through it (IREAD ) is compared with a being overwritten—during read operations. Table III
reference current, IREF . IREF is the average of IREAD,L compares the disturb margins of R-MRAM and R-CPSTT.
(IREAD when the bit-cell stores a low resistance state Furthermore, HSPICE simulations performed to evaluate the
or ‘0’) and IREAD,H (IREAD when the bit-cell stores a high read performance of R-CPSTT using a latch for sensing
resistance state or ‘1’). The sense amplifier outputs ‘1’ RAM data, as shown in Fig. 9(b), show that read operations
when IREAD < IREF , and ‘0’ when IREAD > IREF . For the up to 1.7 GHz are possible.
RAM mode read operation of R-CPSTT, VREAD is applied The complementary pinned layers in R-CPSTT need to be
to both SLs, which are shown in Fig. 2(a), while the BLs separated by an amount dependent on the layout rules. Hence,
are grounded. The sense amplifier compares the IREAD the FL in R-CPSTT needs to be enlarged so as to interface with
flowing through SLL and through SLR. When IREAD through both pinned layers, resulting in a larger IC (‘0’) compared with
SLL is higher (lower) than IREAD through SLR, the sense R-MRAM, as shown in Table I. However, R-MRAM requires
amplifier outputs ‘0’ (‘1’). Note that in R-CPSTT, the bit-cell bidirectional write current flow to program the bit-cells
stores ‘0’ if the resistances between BL and SLL and between in RAM mode, whereas R-CPSTT always parallelizes the
BL and SLR are low and high, respectively. The bit-cell free layer with a pinned layer. Hence, IC (‘1’) of R-CPSTT
stores ‘1’ instead if the resistances between BL and SLL and can be lower than that of R-MRAM, as shown in Table I.
between BL and SLR are high and low, respectively. These Furthermore, the ATx’s are never source degenerated during
are the only configurations possible in R-CPSTT, since the R-CPSTT RAM mode write operations. Hence, the VDD for
free layer is parallel to only one of the two pinned layers at R-CPSTT to meet the required write margins (defined as
any time. Hence, the sensing margin for R-MRAM is defined write margin = IWRITE − IC /IC , where IWRITE is the current
as min(|IREAD,L − IREF |, |IREAD,H − IREF |)/IREF , whereas it flowing through the bit-cell during write operation) can be
is defined as |IREAD,L − IREAD,H |/min(IREAD,L , IREAD,H ) for substantially lower than that in R-MRAM to meet the same
R-CPSTT. The sensing margins of R-MRAM and R-CPSTT write margin. This is shown in Table IV. Note that the
are compared in Table II. Read energy per bit of R-CPSTT average IWRITE is higher in R-CPSTT than in R-MRAM,
is comparable with that of R-MRAM, because although a although VDD is lower. Hence, the average write power per
separate IREF is needed in R-MRAM, IREAD,AP and IREAD,P bit may be higher in R-CPSTT than in R-MRAM.
TABLE V
A RCHITECTURAL S IMULATION PARAMETERS
Fig. 15. Summary of ROM results.
as shown in Table I. During ROM mode read operations, the

read current pulse is a 50-ps-wide triangle pulse with 10-μA
peak current. Since both the current pulsewidth and peak
read current for ROM mode read operations are smaller than
the corresponding values for RAM mode read operations, we
may conclude that disturb failures due to embedding ROM
Fig. 14. RAM mode comparisons of R-MRAM and R-CPSTT at the in the R-MRAM array are not worse compared with the
architecture level. STT-MRAM array without embedded ROM.
Many applications, such as the evaluation of complex math
functions, digital signal processing, and on-chip built-in self-
Since the comparison of energy consumption at the bit-cell
test, use constant data in the form of lookup tables. Consider,
level does not account for the fact that read operations
for example, the evaluation of transcendental math functions.
are more frequent than write operations in many cache
Math libraries are commonly used for the evaluation of
applications, a system level simulation was done to compare
transcendental math functions. These libraries utilize lookup
the RAM mode performance of R-MRAM and R-CPSTT.
tables that are stored off-chip and hence a significant amount
Table V shows the processor configuration used to evaluate
of memory accesses takes place when complex math functions
R-MRAM and R-CPSTT in the SimpleScalar architectural
are first called or when there are cache misses. Consider, for
simulator [13] for a wide range of SPEC2K6 benchmarks.
example, the first call to a complex math function during
Fig. 14 shows the simulation results, which are normalized to
the execution of a computer program. A mandatory cache
R-MRAM results, for 2-MB L2 caches based on R-CPSTT
miss occurs and the processor needs to fetch the required
and R-MRAM. The R-CPSTT-based L2 cache achieved
lookup table data from the off-chip memory, which could take
4% improvement in performance at 9% lower energy
hundreds of clock cycles. Furthermore, the data already in
consumption as compared to the R-MRAM L2 cache.
cache may need to be evicted to accommodate the lookup
table. As a result, the evaluation of such math functions may
D. ROM Mode Performance Evaluation incur a significant number of clock cycles before completion.
Since current flows through the selected bit-cells during These lookup tables may be stored closer to the processor
ROM operation as well, the RAM data stored in them may be as ROM that is embedded in on-chip cache. Hence, any access
accidentally overwritten if this current is too large. Our simu- to the lookup tables will be a cache hit and the processor does
lation results show that the current flowing through R-CPSTT not need to go off-chip to fetch the required data (which may
bit-cells during ROM mode and RAM mode read operations take hundreds of cycles). To evaluate the efficacy of R-MRAM
may be modeled as 80-ps-wide triangular pulses. The peak and R-CPSTT proposed in this paper, we study the benefits
current flowing through the selected bit-cells are 15 and 17 μA of R-MRAM and R-CPSTT to accelerate the evaluation of
for ROM mode and RAM mode read operations, respectively. different compute kernels from various applications using
Since the ROM mode read current is smaller than the RAM microbenchmarks described in Section IV-A, compared with
mode read current and the time duration which read current a conventional CMOS-based design. We also perform design
flows is the same, we may conclude that disturb failures due space exploration to analyze the performance of the proposed
to embedding ROM in the R-CPSTT array are not worse designs for different implementations of the transcendental
compared with the CPSTT array without embedded ROM. math functions.
In case of R-MRAM, current flows from SL to BL during Fig. 15 shows the comparison of the performance of the
all read operations. As explained earlier, a 2-ns read current proposed R-MRAM and R-CPSTT designs with that of SRAM
pulse flows through selected bit-cells in R-MRAM during for different benchmarks. All the simulations are performed
RAM mode operations. The nominal peak read current for 200 iterations of different compute kernels. We observe
flowing through the bit-cell when it stores ‘0’ is 12.62 μA, that R-MRAM and R-CPSTT designs can achieve up to
Fig. 16. Comparisons of evaluation latencies of (a) log(x) and (b) sin(x)
using conventional SRAM cache (Conv.), R-MRAM, and R-CPSTT using
2-kB lookup tables. R-MRAM read latency is assumed to be twice that of
SRAM and R-CPSTT.
Fig. 17. Comparisons of evaluation latencies of (a) log(x) and (b) sin(x)
using conventional SRAM cache (Conv.), R-MRAM, and R-CPSTT using
128-kB lookup tables. R-MRAM read latency is assumed to be twice that of Fig. 18. Comparison of the total evaluation cycles for (top) log(x)
SRAM and R-CPSTT. and (bottom) sin(x) using different table sizes (and hence approximating
polynomial) to achieve 65-bit accuracy.
29% and 31% improvement in execution time compared with

SRAM. This is because, the proposed designs store the lookup decreases when the number of function calls is more than 100.
tables in the on-chip ROM, thereby eliminating the off-chip The improvements using R-MRAM over Conv. are ∼3% and
accesses that occurs in a traditional CMOS design. Across ∼5% in evaluating log(x) and sin(x), respectively, while the
all the benchmarks, the proposed R-MRAM and R-CPSTT improvements using R-CPSTT over Conv. in evaluating log(x)
designs achieve 20% and 22% improvement in performance, and sin(x) are ∼4% and ∼7%, respectively. We observed
respectively. that the evaluation latency was dominated by the latency of
1) Design Space Exploration: As explained in evaluating the approximating polynomial when 2-kB lookup
Section IV-A, the accuracy of the transcendental math tables are used. Hence, the degree of the approximating
functions’ evaluation depends on the size of the lookup table polynomial is reduced to reduce evaluation latency. However,
and degree of the approximating polynomial. In this section, the total evaluation latency may become limited by cache read
we study the impact of the proposed designs for two different latency if the degree of the approximating polynomial is too
implementations of sin and log: 1) using a lookup table of low. Fig. 17 compares the improvement in performance while
size 128 kB and approximating polynomial of degree 4 and 2) using 128-kB lookup tables for sin(x) and log(x). As shown
using a lookup table of size 2 kB and approximating polyno- in Fig. 17, R-MRAM and R-CPSTT can achieve more than
mial of degree 7. Fig. 16 shows the total evaluation latency of ∼30% improvement in performance. The improvements
sin(x) and log(x) using the conventional SRAM cache (Conv.), remain high for a large number of function calls because the
R-MRAM, and R-CPSTT (normalized to the total evaluation lookup table is not entirely in L1 cache. The inputs to the
latency using Conv.) when 2-kB lookup table is used. As the functions are random enough that some of the required entries
number of function calls increases, we initially observe an of the lookup table may have been moved out of L1 cache in
increase in the improvement in performance relative to Conv. the Conv. case and need to be reloaded.
case. Initial accesses to the lookup table result in cache misses Fig. 18 shows the sensitivity of Conv. and R-CPSTT
in the Conv. case. Therefore, a larger fraction of execution time to the size of the lookup table with increasing number of
is dominated by accesses to the lookup table from memory function calls for log(x) (top) and sin(x) (bottom) evaluation,
when the number of function calls is small. As a result, respectively. In the Conv. case, a small lookup table leads to
increasing the number of function calls leads to large increases lower execution times, because a large lookup table requires
in execution time. However, further increase in the number of a large number of off-chip memory accesses. On the other
function calls increases the likelihood that the table data are hand, in R-CPSTT case, the performance is optimal while
completely loaded into L1 cache in the Conv. case. Hence, using a larger lookup table size. The latency of table lookup is
the improvement of R-MRAM and R-CPSTT over Conv. equal to the read latency of L2 cache in R-CPSTT. Thus, the
performance sensitivity to lookup table accesses in R-CPSTT [14] J. Harrison, T. Kubaska, S. Story, and P. T. Tang, “The computation
is reduced. Furthermore, the degree of the approximating of transcendental functions on the IA-64 architecture,” Intel Technol. J.,
vol. 4, pp. 234–251, Nov. 1999.
polynomial is small which reduces the processor workload [15] L. Chen, “Pattern classification by assembling small neural networks,”
and further improve performance. Note also that the execution in Proc. IEEE Int. Joint Conf. Neural Netw., vol. 3. Jul./Aug. 2005,
time of R-CPSTT design is lower than the Conv. design for pp. 1947–1952.
[16] H. Qiao, J. Peng, Z.-B. Xu, and B. Zhang, “A reference model approach
lookup table of sizes 2 kB as well as 128 kB, demonstrating to stability analysis of neural networks,” IEEE Trans. Syst., Man,
the efficacy of the proposed design. Cybern. B, Cybern., vol. 33, no. 6, pp. 925–936, Jan. 2003.
[17] S. Razavi and B. A. Tolson, “A new formulation for feedforward neural
networks,” IEEE Trans. Neural Netw., vol. 22, no. 10, pp. 1588–1598,
V. C ONCLUSION Oct. 2011.
[18] C. Mead and L. Conway, Introduction to VLSI Systems. Reading, MA,
We proposed R-MRAM and R-CPSTT caches and evaluated USA: Addison-Wesley, 1980.
their efficiency in accelerating different compute kernels. Note
that R-MRAM and R-CPSTT may be used to accelerate any
application that uses lookup tables (complex math functions, Xuanyao Fong (S’06) received the B.Sc. and
decoding tables, test vectors for built-in self-test, and so on). Ph.D. degrees in electrical engineering from Purdue
University, West Lafayette, IN, USA, in 2006 and
Furthermore, the memory array can store the same amount 2014, respectively.
of ROM and RAM data, and RAM data may be accessed He was an Intern Engineer with the Boston Design
independent of the ROM data stored. Also, the performance Center, Advanced Micro Devices, Inc., Boxborough,
MA, USA, in 2007. He was a Research Assistant
of the arrays in RAM mode is unaffected by the embedding to Prof. K. Roy with the Nanoelectronics Research
of ROM. Our simulation results show that R-MRAM and Laboratory, Purdue University, where he is
R-CPSTT may help to lower the latency of kernel computation currently a Post-Doctoral Research Assistant
to Prof. K. Roy. His current research interests
by more than 30%. The number of off-chip memory accesses is include device/circuit/architecture co-design for silicon and nonsilicon
also dramatically reduced and reduces the total system energy nanoelectronics, design of VLSI logic and memory systems using spintronic
consumption. devices, circuits, and architectures, and non-Boolean and analog computing
paradigms using emerging technologies.
Mr. Fong was a recipient of the AMD Design Excellence Award at Purdue
University in 2008, and the best paper award at the International Symposium
R EFERENCES on Low Power Electronics and Design in 2006.
[1] D. Lee and K. Roy, “Area efficient ROM-embedded SRAM cache,”
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 21, no. 9,
pp. 1583–1595, Sep. 2013. Rangharajan Venkatesan (S’09) received the
[2] Y. Huai, “Spin-transfer torque MRAM (STT-MRAM): Challenges and B.Tech. degree in electronics and communication
prospects,” AAPPS Bull., vol. 18, no. 6, pp. 33–40, 2008. engineering from IIT Roorkee, Roorkee, India,
[3] H. Yoda et al., “Progress of STT-MRAM technology and the effect on in 2009, and the Ph.D. degree in electrical and
normally-off computing systems,” in Proc. IEEE Int. Electron Devices computer engineering from Purdue University,
Meeting, Dec. 2012, pp. 11.3.1–11.3.4. West Lafayette, IN, USA, in 2014.
[4] D. Lee, X. Fong, and K. Roy, “R-MRAM: A ROM-embedded He was a Research Intern with Intel Corporation,
STT MRAM cache,” IEEE Electron Device Lett., vol. 34, no. 10, Hillsboro, OR, USA, in 2012 and 2013, during
pp. 1256–1258, Oct. 2013. his Ph.D., where he is currently involved in
[5] S. P. Park, S. Gupta, N. Mojumder, A. Raghunathan, and K. Roy, “Future developing low power design methodologies for
cache design using STT MRAMs for improved energy efficiency: graphics processors and designing circuits for
Devices, circuits and architecture,” in Proc. 49th ACM/EDAC/IEEE enabling fine-grained power gating. His current research interests include
Design Autom. Conf. (DAC), Jun. 2012, pp. 492–497. circuit-architecture co-design for emerging technologies, neuromorphic
[6] X. Fong and K. Roy, “Complimentary polarizers STT-MRAM (CPSTT) hardware architectures, approximate computing, and variation-aware design
for on-chip caches,” IEEE Electron Device Lett., vol. 34, no. 2, methodologies.
pp. 232–234, Feb. 2013. Mr. Venkatesan received the Ross Fellowship from 2009 to 2010, and
[7] X. Fong, S. K. Gupta, N. N. Mojumder, S. H. Choday, C. Augustine, the Bilsland Dissertation Fellowship from the Graduate School, Purdue
and K. Roy, “KNACK: A hybrid spin-charge mixed-mode simulator for University, from 2013 to 2014. He was a recipient of the best paper award at
evaluating different genres of spin-transfer torque MRAM bit-cells,” in the International Symposium on Low Power Electronics and Design in 2012,
Proc. Int. Conf. Simulation Semiconductor Process. Devices, Sep. 2011, and the Best Paper Nomination in Design Automation Test in Europe in 2015.
pp. 51–54.
[8] C. J. Lin et al., “45 nm low power CMOS logic compatible embedded
STT MRAM utilizing a reverse-connection 1 T/1 MTJ cell,” in Proc.
IEEE Int. Electron Devices Meeting (IEDM), Dec. 2009, pp. 1–4. Dongsoo Lee (M’13) received the B.S. and M.S.
[9] T. Kishi et al., “Lower-current and fast switching of a perpendicular degrees in electrical engineering from the Korea
TMR for high speed and high density spin-transfer-torque MRAM,” in Advanced Institute of Science and Technology, Dae-
Proc. IEEE Int. Electron Devices Meeting, Dec. 2008, pp. 1–4. jeon, Korea, and the Ph.D. degree in electrical
[10] Object-Oriented MicroMagnetic Framework (OOMMF). [Online]. Avail- and computer engineering from Purdue University,
able: http://math.nist.gov/oommf, accessed Aug. 25, 2014. West Lafayette, IN, USA, in 2002, 2004, and 2013,
[11] HSPICE. [Online]. Available: http://www.synopsys.com/Tools/ respectively.
Verification/AMSVerification/CircuitSimulation/HSPICE/, accessed He was with Samsung Electronics Ltd., Suwon,
Aug. 25, 2014. Korea, from 2004 to 2008, where he was involved
[12] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimiz- in research on designing circuits for DTV one-
ing NUCA organizations and wiring alternatives for large caches chip solutions. In 2011, he was a Graduate Intern
with CACTI 6.0,” in Proc. 40th Annu. IEEE/ACM Int. Symp. with Qualcomm Incorporated, San Diego, CA, USA, and Intel Corporation,
Microarchitecture (MICRO), Dec. 2007, pp. 3–14. Hillsboro, OR, USA, in 2012. He has been with the IBM T. J. Watson
[13] T. Austin, E. Larson, and D. Ernst, “SimpleScalar: An infrastructure Research Center, Yorktown Heights, NY, USA, as a Research Staff Member
for computer system modeling,” Computer, vol. 35, no. 2, pp. 59–67, since 2013. His current research interests include low-power design, on-chip
Feb. 2002. memory design, and design for test.
Anand Raghunathan (F’12) received the Kaushik Roy (F’01) received the B.Tech. degree
B.Tech. degree in electrical and electronics in electronics and electrical communications
engineering from IIT Madras, Chennai, India, engineering from IIT Kharagpur, Kharagpur, India,
and the M.A. and Ph.D. degrees in electrical and the Ph.D. degree from the Department of
engineering from Princeton University, Princeton, Electrical and Computer Engineering, University
NJ, USA. of Illinois at Urbana–Champaign, Champaign, IL,
He was a Senior Research Staff Member USA, in 1990.
with NEC Laboratories America Inc., Princeton. He was a Purdue University Faculty Scholar
He held the Gopalakrishnan Visiting Chair with the from 1998 to 2003. He was the M. K. Gandhi
Department of Computer Science and Engineering, Distinguished Visiting Faculty Member with
IIT Madras. He is currently a Professor and the IIT Bombay, Mumbai, India. He was with the
Chair of VLSI with the School of Electrical and Computer Engineering, Semiconductor Process and Design Center, Texas Instruments, Dallas, TX,
Purdue University, West Lafayette, IN, USA, where he leads the Integrated USA, where he was involved in field-programmable gate array architecture
Systems Laboratory. He has co-authored a book entitled High-Level Power development and low-power circuit design. He joined the Electrical and
Analysis and Optimization, eight book chapters, and over 200 refereed Computer Engineering Faculty, Purdue University, West Lafayette, IN, USA,
journal and conference papers. He holds 21 U.S. patents. His current in 1993, where he is currently an Edward G. Tiedemann Jr. Distinguished
research interests include domain-specific architecture, system on chip Professor. He has authored over 600 papers in refereed journals and
design, embedded systems, and heterogeneous parallel computing. conferences, and graduated 65 Ph.D. students. He holds 15 patents. He has
Prof. Raghunathan is a Golden Core Member of the IEEE Computer co-authored two books entitled Low Power CMOS VLSI Design (John Wiley
Society. He was a recipient of the IEEE Meritorious Service Award in 2001, and McGraw Hill). His current research interests include spintronics,
and the Outstanding Service Award in 2004. He received the Patent device-circuit co-design for nanoscale silicon and nonsilicon technologies,
of the Year Award (recognizing the invention with the highest impact), low-power electronics for portable computing and wireless communications,
and two Technology Commercialization Awards from NEC Laboratories and new computing models enabled by emerging technologies.
America Inc. He was chosen by MIT’s Technology Review among the Dr. Roy was a recipient of the National Science Foundation Career
TR35 (top 35 innovators under 35 years, across various disciplines of Development Award in 1995, the IBM Faculty Partnership Award, the
science and technology) for his work on making mobile secure in 2006. His ATT/Lucent Foundation Award, the SRC Technical Excellence Award
publications have been recognized with eight best paper awards and four best in 2005, the SRC Inventors Award, the Purdue College of Engineering
paper nominations. He has served on the technical program and organizing Research Excellence Award, the Humboldt Research Award in 2010, the
committees of several leading conferences and workshops. He has chaired IEEE Circuits and Systems Society Technical Achievement Award in 2010,
the ACM/IEEE International Symposium on Low Power Electronics and the Distinguished Alumnus Award from IIT Kharagpur, best paper awards at
Design, the ACM/IEEE International Conference on Compilers, Architecture, the International Test Conference in 1997, the IEEE International Symposium
and Synthesis for Embedded Systems, the IEEE VLSI Test Symposium, on Quality of IC Design in 1997, the IEEE Latin American Test Workshop
and the IEEE International Conference on VLSI Design. He has served as in 2003, the IEEE Nano in 2003, the IEEE International Conference on
an Associate Editor of the IEEE T RANSACTIONS ON C OMPUTER -A IDED Computer Design in 2004, and the IEEE/ACM International Symposium
D ESIGN, the IEEE T RANSACTIONS ON VLSI S YSTEMS , ACM Transactions on Low Power Electronics and Design in 2006, the IEEE Circuits and
on Design Automation of Electronic Systems, the IEEE T RANSACTIONS ON System Society Outstanding Young Author Award (Chris Kim) in 2005,
M OBILE C OMPUTING, ACM Transactions on Embedded Computing Systems, the IEEE T RANSACTIONS ON VLSI S YSTEMS Best Paper Award in 2006,
the IEEE Design and Test of Computers, and the Journal of Low Power the ACM/IEEE International Symposium on Low Power Electronics and
Electronics. Design Best Paper Award in 2012, and the IEEE T RANSACTIONS ON VLSI
S YSTEMS Best Paper Award in 2013. He is serving as a Fulbright-Nehru
Distinguished Chair, and a DoD National Security Science and Engineering
Faculty Fellow from 2014 to 2019. He was a Research Visionary Board
Member of Motorola Laboratories in 2002. He has been on the Editorial
Board of the IEEE Design and Test, the IEEE T RANSACTIONS ON C IRCUITS
AND S YSTEMS , the IEEE T RANSACTIONS ON VLSI S YSTEMS , and the
IEEE T RANSACTIONS ON E LECTRON D EVICES . He was a Guest Editor
of the Special Issue on Low-Power VLSI of the IEEE Design and Test
in 1994, the IEEE T RANSACTIONS ON VLSI S YSTEMS in 2000, IEE
Proceedings–Computers and Digital Techniques in 2002, and the IEEE
J OURNAL ON E MERGING AND S ELECTED T OPICS IN C IRCUITS AND
S YSTEMS in 2011.

Embedding Read-Only Memory in Spin-Transfer Torque MRAM-Based On-Chip Caches

Uploaded by

Copyright:

Available Formats

Embedding Read-Only Memory in Spin-Transfer Torque MRAM-Based On-Chip Caches

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Embedding Read-Only Memory in Spin-Transfer Torque MRAM-Based On-Chip Caches

Uploaded by

Copyright:

Available Formats

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

Embedding Read-Only Memory in Spin-Transfer

M ANY applications, such as digital signal processing,

2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

FONG et al.: EMBEDDING ROM IN STT-MRAM-BASED ON-CHIP CACHES 3

Fig. 3. Selective connection of (a) SSC and (b) CPSTT bit-cells to

Fig. 2. (a) Structure of a CPSTT-MRAM bit-cell. A latch is used for the

structure consists of an FL, a tunneling oxide layer, and

4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 5. Current flow in a selected bit-cell connected to BL1.

unselected cells in the column are marked with an ‘X’, and

FONG et al.: EMBEDDING ROM IN STT-MRAM-BASED ON-CHIP CACHES 5

with an ‘X’. ROMColSel is deasserted, so the pass transistors

6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 12. Array layout of (a) R-MRAM and (b) R-CPSTT.

To achieve certain accuracy in the result of the function

FONG et al.: EMBEDDING ROM IN STT-MRAM-BASED ON-CHIP CACHES 7

WL voltage boosting [6], may be used to reduce the ATx size

C. RAM Mode Performance Evaluation

8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 15. Summary of ROM results.

as shown in Table I. During ROM mode read operations, the

FONG et al.: EMBEDDING ROM IN STT-MRAM-BASED ON-CHIP CACHES 9

29% and 31% improvement in execution time compared with

10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

FONG et al.: EMBEDDING ROM IN STT-MRAM-BASED ON-CHIP CACHES 11

You might also like